rana: towards efficient neural acceleration with refresh ... · rana: towards efficient neural...

22
The 45th International Symposium on Computer Architecture - ISCA 2018 RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University

Upload: others

Post on 18-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

The 45th International Symposium on Computer Architecture - ISCA 2018

RANA: Towards Efficient Neural Acceleration

with Refresh-Optimized Embedded DRAM

Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei

Institute of Microelectronics

Tsinghua University

Page 2: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Ubiquitous Deep Neural Networks (DNNs)

1

Image Classification Object Detection

Video Surveillance Speech Recognition

Page 3: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

DNN Requires Large On-Chip Buffer

• Modern DNN’s layer data storage can reach

0.3~6.27MB.

• The numbers will increase if the network processes

higher resolution images or larger batch size.

2

[1] Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS’12.

[2] Simonyan et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR’15.

[3] Szegedy et al., “Going Deeper with Convolutions”, CVPR’15.

[4] He et al., “Deep Residual Learning for Image Recognition”, CVPR’16.

Page 4: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

SRAM-based DNN Accelerators

• The small footprint limits the on-chip buffer size of

conventional SRAM-based DNN accelerators.

– Usually <500KB with area cost of 3~20mm2. (Normalized)

3

Heterogeneous PE Array

Data Buffer System

Weight Buffer

Controller

ConfigurableInterface

...

Data Buffer1

Data Buffer2

...

BufferCTRL

BufferCTRL

Bank[0]

Bank[47]

Bank[0]

Bank[47]

...

......

CONVFC/LSTM

IO

IO

...PE PE PE PE PE PE

...PE PE PE PE PE PE

...PE PE PE PE PE PE

...PE PE PE PE PE PE

...

...

...

...

...

...

SuperPE

...SuperPE

SuperPE

SuperPE

SuperPE

SuperPE

Configuratin

Configuratin Configuration Context

Thinker, 348KB, 19.4mm2 DianNao, 44KB, 3.0mm2

Eyeriss, 182KB, 12.3mm2 Envision, 77KB, 10.1mm2 (Normalized)

Thinker: Yin et al., “A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications”, JSSC’18.

DianNao: Chen et al., “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning”, ASPLOS’14.

Eyeriss: Chen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, ISSCC’16.

Envision: Moons et al., “ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC’17.

Page 5: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

SRAM vs. eDRAM (Embedded DRAM)

4

eDRAM has higher

density than SRAM.

Refresh is required

for data retention.

Charge will leak over time and

might cause retention failures.

Page 6: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Refresh is an Energy Bottleneck

5[1] Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized eDRAM”, HPCA’13.

[2] Wilkerson et al., “Reducing Cache Power with Low-Cost, Multi-bit Error-Correcting Codes”, ISCA’10.

[1] HPCA’13

eDRAM Power

Breakdown

[2] ISCA’10

System Power

Breakdown

Overhead:

eDRAM Refresh Energy

Page 7: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Opportunity to Remove eDRAM Refresh

6

Refresh Interval = Retention Time

Ghosh, “Modeling of Retention Time for High-Speed Embedded Dynamic Random Access Memories”, TCASI’14.658

Page 8: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Opportunity to Remove eDRAM Refresh

7

Refresh is unnecessary, if

Data Lifetime < Retention Time

Opportunity1: Increase retention time by training.

Opportunity2: Reduce data lifetime by scheduling.

Page 9: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

RANA: Retention-Aware Neural Acceleration Framework

8

Retention-Aware Training Method

Hybrid Computation Pattern

Refresh-Optimized eDRAM ControllerTolerable

Retention TimeLayerwise

Configurations

1. Accuracy Constraint2. eDRAM Retention Time Distribution

1. Energy Modeling2. Data Lifetime Analysis3. Buffer Storage Analysis

1. Data Mapping2. Memory Controller Modification

Optimized Energy Consumption

1. DNN Accelerator2. Target DNN Model

(Training) (Scheduling) (Architecture)

1 2 3

Compilation Phase Execution Phase

• Strengthen DNN accelerators with refresh-optimized

eDRAM:

– Increase on-chip buffer size by replacing SRAM with eDRAM.

– Reduce energy overhead by removing unnecessary eDRAM refresh.

Page 10: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

RANA: Retention-Aware Neural Acceleration Framework

9

Retention-Aware Training Method

Hybrid Computation Pattern

Refresh-Optimized eDRAM ControllerTolerable

Retention TimeLayerwise

Configurations

Optimized Energy Consumption

1. DNN Accelerator2. Target DNN Model

(Training) (Scheduling) (Architecture)

1 2 3

DNN acceleratorDNN model

The last layer?

Switch to the next layer

No

Run scheduling

scheme

Layer description Hardware constraints

Computation Pattern:<OD/WD, Tm, Tn, Tr, Tc>

Yes

Configurations for each layer

eDRAM Bank

eDRAM Bank

eDRAM Bank

eDRAM Bank

eDRAM Bank

ProgrammableClock Divider

eDRAM Refresh Flags

Unified Buffer SystemeDRAM Controller

Refresh Issuer

Reference Clock

Retention Time↑ Data Lifetime↓ Refresh Control

Page 11: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech1: Retention-Aware Training Method

• Retention time is diverse among different cells.

– Retention failure rate: Fraction of the cells under the

given retention time.

10Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip

Threshold Voltage Variation”, ITC’08.

Typical eDRAM Retention Time Distribution (32KB)

The weakest cell appears at

the 45micro-second point.

Page 12: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech1: Retention-Aware Training Method

• Retrain the network to tolerate higher failure rate

and get longer tolerable retention time.

11

Target DNN Model Failure Rate (r)

Fixed-Point Pretrain

Fixed-Point DNN Model

Adding Layer Masks

Random Bit-Level Errors

Retrain

Weight Adjustment

Retention-Aware DNN Model

Retention-Aware Training Method

Page 13: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech1: Retention-Aware Training Method

• Failure rate of 10−5: No accuracy loss, 734𝜇s.

• Failure rate of 10−4: Accuracy decreases.

12

Relative Accuracy under Different Retention Failure Rates

734𝜇s45𝜇s 1030𝜇s

Page 14: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech2: Hybrid Computation Pattern

• Computation pattern, expressed in a loop.

• Data lifetime and buffer storage are related to the

loop ordering, especially the outermost-level loop.

13

Page 15: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech2: Hybrid Computation Pattern

• Outputs are dynamically updated by accumulation,

which recharges the cells like periodic refresh.

• Different computation patterns have different data

lifetime and buffer storage requirements.

14

Input Dependent Output Dependent Weight Dependent

Page 16: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

DNN acceleratorDNN model

The last layer?

Switch to the next layer

No

Run scheduling

scheme

Layer description Hardware constraints

Computation Pattern:<OD/WD, Tm, Tn, Tr, Tc>

Yes

Configurations for each layer

Tech2: Hybrid Computation Pattern

• Scheduling scheme:

– Input: DNN accelerator and network’s parameters.

– Optimization: Minimize total system energy.

– Output: Layerwise configurations.

15

min 𝐸𝑛𝑒𝑟𝑔𝑦s. t.

𝐸𝑛𝑒𝑟𝑔𝑦 = 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 (14),𝑇𝑛 ∙ 𝑇ℎ ∙ 𝑇𝑙 ≤ 𝑅𝑖,𝑇𝑚 ∙ 𝑇𝑟 ∙ 𝑇𝑐 ≤ 𝑅𝑜,

𝑇𝑚 ∙ 𝑇𝑛 ∙ 𝐾2 ≤ 𝑅𝑤,

1 ≤ 𝑇𝑚 ≤ 𝑀,

1 ≤ 𝑇𝑛 ≤ 𝑁,

1 ≤ 𝑇𝑟 ≤ 𝑅,

1 ≤ 𝑇𝑐 ≤ 𝐶.

Scheduling Scheme

Page 17: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Tech3: Refresh-Optimized eDRAM Controller

• eDRAM controller:

– Programmable clock divider: Refresh interval.

– Refresh issuers and flags, for each eDRAM bank.

– Configuration from Tech1 & Tech2.

16

eDRAM Bank

eDRAM Bank

eDRAM Bank

eDRAM Bank

eDRAM Bank

ProgrammableClock Divider

eDRAM Refresh Flags

Unified Buffer SystemeDRAM Controller

Refresh Issuer

Reference Clock

Page 18: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Evaluation Platform

• RTL-level cycle-accurate simulation, for performance estimation and

memory access tracing.

• System-level energy estimation, based on synthesis, Destiny and CACTI.

17

Platform Configurations

DNN Accelerator 256 MACs, 384KB SRAM, 200MHz, 5.682mm2, 65nm

eDRAM 1.454MB, retention time = 45𝜇s, 65nm

Kong et al., “Analysis of Retention Time Distribution of Embedded DRAM – A New Method to Characterize Across-Chip

Threshold Voltage Variation”, ITC’08.

Page 19: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Experimental Results

18

eDRAM refresh operations: 99.7%↓

Off-chip memory access: 41.7%↓

System energy consumption: 66.2%↓

Page 20: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Scalability to Other Architectures

• DaDianNao: 4096 MACs, 36MB eDRAM, 606MHz.

19

eDRAM refresh operations: 99.9%↓

System energy consumption: 69.4%↓

Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, MICRO’14.

Page 21: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Takeaway

RANA: Retention-Aware Neural Acceleration Framework

• Training: Retention-aware training method.– Exploit DNN’s error resilience to improve tolerable retention time.

• Scheduling: Hybrid computation pattern.– Different computing order and parallelism show different data lifetime

and buffer storage requirement.

• Architecture: Refresh-Optimized eDRAM controller.– No need to refresh all the banks.

– No need to always use the worst-case refresh interval.

• Not limited to applying eDRAM to DNN acceleration.– Approximate computing: Retention and error resilience.

20

Retention-Aware Training Method

Hybrid Computation Pattern

Refresh-Optimized eDRAM ControllerTolerable

Retention TimeLayerwise

Configurations

Optimized Energy Consumption

1. DNN Accelerator2. Target DNN Model

(Training) (Scheduling) (Architecture)

1 2 3

Page 22: RANA: Towards Efficient Neural Acceleration with Refresh ... · RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin,

Thank you for your attention!

Email: [email protected]