designing an analog crossbar based

36
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Sapan Agarwal , Alexander Hsia, Robin Jacobs-Gedrim, David R. Hughart, Steven J. Plimpton, Conrad D. James, Matthew J. Marinella Sandia National Laboratories Designing an Analog Crossbar based Neuromorphic Accelerator 1 “For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Upload: others

Post on 18-Mar-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Photos placed in horizontal position

with even amount of white space

between photos and header

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and

Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S.

Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Sapan Agarwal, Alexander Hsia, Robin Jacobs-Gedrim, David R. Hughart,

Steven J. Plimpton, Conrad D. James, Matthew J. Marinella

Sandia National Laboratories

Designing an Analog Crossbar based

Neuromorphic Accelerator

1“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

The Von Neumann Bottleneck

CPU MemoryCommunications Bus

Cross chip communications ~ 1 pJ

DRAM Access >10 pJ

Ethernet ~ 1nJCurrent Transistors ~ 10 aJ

40kT Noise Limit ~ 0.2 aJ

Processor Layer Photonic Layer

Optical interconnects 100 fJ to 1 pJ

Communications require

orders of magnitude more

energy!

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Use Resistive Memories for Local Computation

• A resistive memory or ReRAM is a

programmable resistor• Apply small voltages allows the conductance

to be read: I = G × V

• Apply large voltages to change the resistance

Ta

Pt

+ +

++++

+ ++ ++

+ ++ +

+ +++

ON

Ta

Pt

+ +

++++

TaOx

+

OFF

V = I×R

I = G×V

multiplication

Addition: I=I1+I2

I1

I2

Current

VoltageVREAD VSET

VRESETSET

RESETWrite

Read Window

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Directly Process in the Memory Itself

4

Analog is efficiently and naturally

able to combine computation and

data access

Effectively, large-scale processing in

memory with a multiplier and adder

at each real-valued memory location

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

V1=x1 +-

+-

+-

+-

V2=x2

V3=x3

V4=x4

I1=x1*w11 + … + x4*w41

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Crossbars Can Perform Parallel Reads and Writes

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

V1=x1+-

+-

+-

+-

V2=x2

V3=x3

V4=x4x4=1

V t

x3=0.66V t

x2=0.33V t

x1=0V t

y1=0.25

V V V V

tt t t

y2=0.5 y3=0.75 y4=1

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

N r

ow

s

M columns

Energy to charge the crossbar is CV2

E ∝ C ∝ number of RRAMs ∝ N×M

E ~ O(N×M)

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

SRAM Arrays Require Charging Columns Multiple Times

WL[0]

WL[1]

WL[2]

BL[0] BL[1] BL[2]

SRAMs must be read one row at a time, charging M columns

Each column wire length is O(N).

Energy = N Rows × M Columns × O(N) wire length

Energy ~ O(N2×M)

O(N) times worse than a crossbar!

N r

ow

s

M columns

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Want To Accelerate Many Different Neural Algorithms

BackpropagationSparse

Coding

Liquid State

Machine

Input

Nodes

Output

NodesReservoir

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Crossbars Can Perform Parallel Reads and Writes

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

V1=x1+-

+-

+-

+-

V2=x2

V3=x3

V4=x4x4=1

V t

x3=0.66V t

x2=0.33V t

x1=0V t

y1=0.25

V V V V

tt t t

y2=0.5 y3=0.75 y4=1

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

N r

ow

s

M columns

Energy to charge the crossbar is CV2

E ∝ C ∝ number of RRAMs ∝ N×M

E ~ O(N×M)

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

General Purpose Neural Architecture

Neuromorphic core:

• Evaluate vector matrix multiplies along

rows or columns

• Train based on input vectors

Digital Core:

• Process neural core inputs/outputs

• For NxN crossbar, the crossbar accelerates

O(N2) operations leaving only O(N) operations

for the digital core

Run any neural algorithm on the

same hardware

R RR BusBus

R RR Bus Bus

R RR Bus Bus

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Router

D/A

D/A

D/A D/A

A/D A/D

+ - + -

A/D

A/D

+ -

+ -

positive

weights

negative

weights

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

yi

Neural Core

jzje

y

1

1

i

ijij wyz

yj

O(N2)

Operations

O(N)

Operations

in

out

i

j

k

Can Run Neural Networks on this

Architecture

w11

w21

w31

w12

w22

w32

w31

w32

w33

Digital Core

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

yi

Digital Core

kjj zdz

dy )(

i

ijij wyz

O(N2) Read

Operations

O(N)

Operations

i

j

k

kerror

k

k

jkk w

j k

yiiy

O(N2) Write

Operations

Back Propagation

w11

w21

w31

w12

w22

w32

w31

w32

w33

w11

w21

w31

w12

w22

w32

w31

w32

w33

w11

w21

w31

w12

w22

w32

w31

w32

w33

jz

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Design & Model Detailed Architecture

12

Vector Matrix Multiply Matrix Vector Multiply Outer product Update

Neuron Circuitry Current Conveyor

Based Integrator

Comparator

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Row & Column Driver Circuitry

13

Row Driver Logic

Column Driver Logic

Voltage level shifter (drive

high V transistor with low V)

Array driver pass transistors

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Compare Architectures

14

1024 x1024 = 1M array operations, sum over 1 training cycle, 3 operations:

• Vector Matrix Multiply

• Matrix Vector Multiply

• Outer Product Update

Latency

35 – 800X over SRAM

Energy

430 – 6,900X over SRAM

Area

11 – 20X over SRAM

Used a commercial 14/16 nm PDK ***Requires 100 MΩ on state devices

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Neural Core Energy Analysis

15

Analog

ReRAM

Digital

ReRAM

SRAM

12,010 nJ 10,150 nJ 8,970 nJ

7,520 nJ 5,580 nJ 4,340 nJ

28 nJ 2.7 nJ 1.3 nJ

8 bits In/out

8 bit weights

4 bits In/out

8 bit weights

2 bits In/out

8 bit weights

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

16

Algorithms

Sandia Cross-Sim: Translates device measurements and crossbar circuits to algorithm-level performance

Architecture

Circuits

Devices

Materials

Target Algorithms• Deep Learning• Sparse Coding• Liquid State

Machines

x2

x2

x2

x2

w1,1

w2,1

w3,1

w4,1

w1,2

w2,2

w3,2

w4,2

w2,x

w2,x

w3,x

w4,x

...

...

...

...

Drift-diffusion model of ReRAM band diagram & transport (REOS, Charon)

DFT of model of oxide physics, bands

+++

+-

--

--

-

++

VTE

+

++

+

+

--

+

++

++

+

+

--

--

-

PtTaOxTa

Modified McPAT/CACTI: Model performance and energy requirements

In situ TEM of filament switching: Use DFT model to interpret EELS signature

Sandia’s Xyce Circuit Sim: Simulate crossbar circuits based on our devices

Memristor fabrication and measurements in MESAFab

Multiscale Model of a Neural Training Accelerator

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Numeric

Crossbar

Simulator

Xyce

Crossbar

Circuit Model

Learning

Algorithm

Neural Core

Simulator

Simple Python API:

# Do a matrix vector multiplication

result = neural_core.run_xbar_mvm(vector)

+ - + - + -

333231

232221

131211

www

www

www

Detailed but

slow

Fast but

approximate

Measured

Devices

TaOx –MNIST

Periodic Carry

Single Device

Ideal Numeric

https://cross-sim.sandia.gov

Algorithmic

Performance

Physical

Hardware

Crossbar

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

18

Simple API to model crossbars# ************** set parameters defining the crossbar

params.algorithm_params.weights.sim_type = “XYCE” # Use a XYCE based sim

params.algorithm_params.weights.maximum = 10 # clipping limits

params.algorithm_params.weights.minimum = -10 # clipping limits

params.xyce_parameters.xbar.device.TAHA_A1 = 4e-4 # Xyce Parameters

# ************** API for running neural operations

# All crossbar details are transparent to the user

# Create a neural_core object that models a crossbar

neural_core = MakeCore(params=params)

neural_core.set_matrix(weights) # set the initial weights

result = neural_core.run_xbar_vmm(vector) # Do a vector matrix multiply

result = neural_core.run_xbar_mvm(vector) # Do the transpose, a matrix vector mult.

neural_core.update_matrix(vector1,vector2) # Do an outer product update

https://cross-sim.sandia.gov

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Go from Measurement to AccuracyMeasured Pulsing ΔG Scatterplot

Cumulative

Probability of ΔG

D/A

D/A

D/A D/A

A/D A/D

+ - + -

A/D

A/D

+ -

+ -

positive

weights

negative

weights

R

R

R

R

R

R

Bus R

R

R

Bus

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Neural

Core(s)

Digital

Core

Router

Bus Bus

Bus Bus

TiN

Ta– 50 nm

TaOx – 10 nm

TiN

Fabricate

Device

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

If we need more bits per synapse, use multiple memristors

• Three 10 level ReRAMs could represent 1-1000!

• Adding to the weight requires reading every

ReRAM to account for any carries and serially

programming each ReRAM: VERY EXPENSIVE

Neuron

×100 ×10 ×1 • Use >10 levels to represent a base 10 system

• Ignore carry and program the crossbar in parallel.

• Periodically (once every few hundred cycles) read

the ReRAM and perform the carry

10 levels

represent the

weight

Extra levels

store the

carry

conductance

Multi-ReRAM Synapse: Periodic Carry

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Read and reset every 100 pulses

Do 300,000 small (0.02% of weight range) updates

• net of 1500 positive training pulses

Noise Sigma = 1.4% for single device

• (from )

• Write noise applied during updates and carries

Periodic Carry Compensates for Write Noise

Periodic

Carry

Single

Device

rangerangenoise GGG 1.0

Learn from a 0.5% Signal

1/5

-1/5

1/25 1/125

-1/25 -1/125

Carry

-1

1

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Pulse Number

Weig

ht

Positive

Pulses

Alternating

Pulses

1/5

-1/5

1/25 1/125

-1/25 -1/125

Carry

-1

1

Periodic Carry Mitigates Write Nonlinearity

Write NonlinearityAlternating Pulses Cause Weight Decay

Use center linear range of weights

• Train with 1% signal

• Ideal result is 0.6

Single

Device

Periodic

Carry

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

TaOx Results

A/D and D/A is modeled, Serial operations modeled• When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity

• When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)

Carry once every

1000 updates

1/4

-1/4

Carry

-1

1 TaOx – File Types

Periodic Carry

Ideal Numeric

Single Device

TaOx –MNIST

Periodic Carry

Single Device

Ideal Numeric

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Li-Ion Synaptic Transistor for Analog Computation (LISTA)

E. J. Fuller, et al, "Li-Ion Synaptic Transistor for Low Power Analog Computing," Advanced

Materials, vol. 29, no. 4, p. 1604310, 2017.

Off by ~ 1%

from Ideal

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Summary

Fundamental O(N) energy scaling advantage

Use CrossSim to co-design materials to algorithms Use periodic carry to overcome noise devices

Need high resistance 10-100 MΩ Devices

Need low write nonlinearities

25

Latency

35 – 800X over SRAM

Energy

430 – 6,900X over SRAM

Area

11 – 20X over SRAM

https://cross-sim.sandia.gov

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Extra Slides

26

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Overcoming the Power Limit

CPU MemoryCommunications Bus

Integrate Processing and Memory

Richard Goering, “Three Die Stack -- A Big Step “Up” for 3D-ICs with TSVs” Cadence blog

w11

w21

w31

w41

w12

w22

w32

w42

w13

w23

w33

w43

w14

w24

w34

w44

V1=x1+-

+-

+-

+-

V2=x2

V3=x3

V4=x4

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

VGI oo

VGI oo

VGI oo

Measure N resistors and determine the total output current with some signal to noise ratio (SNR)*

What is the minimum energy?

fNGVEnergy O

12

Power in each resistor ×number of resistors

Determined by noise and SNR

*we are assuming we need some fixed precision on the output, and don’t need full floating point accuracy

fGTkN

I

ob

4

Noise Thermal 2

2

2

2

I

NISNR o

24 SNRTkEnergy b

If we double the number of resistors, we can double the speed to get the same energy and SNR.

This is because the noise scales as sqrt(N) while the signal scales as N

NGVSNRTk

f o

b

2

2 14

1

The Noise Limited Energy to Read a Crossbar Column is Independent of Crossbar Size

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Experimental Device Non-idealities

Device: Write Variability, Write Nonlinearity, Asymmetry, Read Noise

Circuit: A/D, D/A noise, parasitics

29

G∝

W

Pulse Number (Vwrite=1V, tpulse=1µs)

= Ideal

Variability and Nonlinearity

= Variability Range

= Nonlinear

I∝

G∝

W

Time (Vread=100mV)

I0I0-∆I

I0+∆I

Read Noise

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Combined Effects of Nonidealities

0

90

98Linear Asymmetric, ν = 1

Asymmetric, ν = 5 Symmetric, ν = 5

AccuracyMNIST

Read Noise (σRN) Read Noise (σRN)

Write

No

ise

WN)

Write

No

ise

WN)

0

90

98Linear Asymmetric, ν = 1

Asymmetric, ν = 5 Symmetric, ν = 5

AccuracyFile Types

Read Noise (σRN) Read Noise (σRN)

Write

No

ise

WN)

Write

No

ise

WN)

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

What are the Neural ReRAM Device Requirements?

Small

Images

Large

Images

File

Types

Read Noise σ (% Range) 3% 5% 9%

Write Noise σ (% Range) 0.3% 0.4% 0.4%

Asymmetric Nonlinearity (ν) 0.1 0.1 0.1

Symmetric Nonlinearity (ν) >20 5 5

Maximum Current 160 nA 13 nA 40 nA

wmin

wmax

We

igh

t (C

on

du

cta

nce

)

Normalized Pulse Number

Positive Pulses Negative Pulses

0 0.5 1 0.5 0

Symmetric Nonlinearity

wmin

wmax

We

igh

t (C

on

du

cta

nce

)

Normalized Pulse Number

Positive Pulses Negative Pulses

0 0.5 1 0.5 0

Asymmetric Nonlinearity

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Full System Simulation

Range Bits

Row Input -1 to 1 8

Col Output -6 to 6 8

Col Input -1 to 1 8

Row Output -4 to 4 8

Row Update -0.01 to 0.01 7

Col Update -1 to 1 5

A/D & D/A Have

Minimal Impact

D/A

D/A

D/A D/A

A/D A/D

- + - +

A/D

A/D

-+

-+

positive

weights

negative

weights

Data set#Training/Test

Examples

Network Size

File Types 4,501 / 900 256×512×9

MNIST 60,000 /10,000 784×300×10

MNIST

Conductance

Wmax

-Wmax

0

0

Gmax

Weight

Gmax/2

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

TaOx Results

A/D and D/A is modeled, serial operations modeled

• When resetting weight, need to adjust pulse size based on current state to

compensate for nonlinearity

• When reading a single weight, need to adjust readout range to be smaller (change

capacitor on the integrator)

Carry once every 1000 updates

for the LSB, and every 2 updates

on others

1/4

-1/4

Carry

-1

1

TaOx – File Types

Periodic Carry

Ideal Numeric

Single Device

TaOx –MNIST

Periodic Carry

Single Device

Ideal Numeric

Update Count (x 10,000)

Dig

it 1

Wei

ght

Dig

it 0

Wei

ght

Weights During Training

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential

Information.”

LISTA ResultsWeight

Configuration

(base 7)

49

-49

-98

98

0

×7

343

-343

-686

686

0

×49

Carry

• Carry once every 1000 updates

• Use a single device per weight and

subtract a reference current

7

-7

-14

14

0

×1

LISTA - MNIST99

98

97

9695

Periodic

CarryIdeal

Numeric

Ideal

w/ A/D

Single

Device

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Neural Core Latency Analysis

35

Analog ReRAM

Min write time of 8 ns vs

1 ns incremental write1.28 µs

Digital ReRAM

All bit precisions

4 bit in/out 2 bit in/out

SRAM

All bit precisions

SRAM transpose

read expensive

0.08 µs 0.054 µs

1335 µs 44 µs

8 bit in/out

x1 x0.06 x0.04

x1040 x35

OPU = Outer Product Update

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”

Neural Core Area Analysis

36

Analog ReRAM

Digital ReRAM

SRAM

x1

x1.8

x11.1

8 bits In/out

8 bit weights

SRAM Array

MAC

Array

Drivers

Array

836k µm2

137k µm2

75k µm2

4 bits In/out

8 bit weights

2 bits In/out

8 bit weights

For the ReRAM, high voltage transistors require 8X area, improving this could give ~2X area savings

ReRAM Array

on logic

“For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information.”