efficient computing for deep learning, ai and robotics - iap · •operations exhibit high...

86
Vivienne Sze ( @eems_mit) Efficient Computing for Deep Learning, AI and Robotics Vivienne Sze ( @eems_mit) Massachusetts Institute of Technology In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang Slides available at https://tinyurl.com/SzeMITDL2020 1

Upload: others

Post on 18-Mar-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Efficient Computing for Deep Learning, AI and Robotics

Vivienne Sze ( @eems_mit)Massachusetts Institute of Technology

In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel

Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang

Slides available at https://tinyurl.com/SzeMITDL2020

1

Page 2: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Compute Demands for Deep Neural Networks2

Source: Open AI (https://openai.com/blog/ai-and-compute/)

Petaflop/s-days (exponential)

Year

AlexNet to AlphaGo Zero: A 300,000x Increase in Compute

Page 3: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Compute Demands for Deep Neural Networks3

[Strubell, ACL 2019]

[Strubell, ACL 2019]

Page 4: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Processing at “Edge” instead of the “Cloud”

Communication Privacy Latency

4

Page 5: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Computing Challenge for Self-Driving Cars

(Feb 2018)

Cameras and radar generate ~6 gigabytes of data every 30 seconds.

Generates wasted heat and some prototypes need water-cooling!

Self-driving car prototypes use approximately 2,500 Watts of

computing power.

5

Page 6: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Existing Processors Consume Too Much Power

< 1 Watt > 10 Watts

6

Page 7: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Transistors are NOT Getting More EfficientSlow down of Moore’s Law and Dennard Scaling

General purpose microprocessors not getting faster or more efficient

• Need specialized hardware for significant improvement in speed and energy efficiency

• Redesign computing hardware from the ground up!

Slowdown

7

Page 8: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Popularity of Specialized Hardware for DNNs8

Big Bets On A.I. Open a New Frontier for Chips Start-Ups, Too. (January 14, 2018)

“Today, at least 45 start-ups are working on chips that can power tasks like speech and self-driving cars, and at least five of them have raised more than $100 million from investors. Venture capitalists invested more than $1.5 billion in chip start-ups last year, nearly doubling the investments made two years ago, according to the research firm CB Insights.”

Page 9: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Power Dominated by Data MovementOperation: Energy

(pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640

Area (µm2)

366713713604184282349516407700N/AN/A

[Horowitz, ISSCC 2014]

Relative Energy Cost

1 10 102 103 104

Relative Area Cost

1 10 102 103

Memory access is orders of magnitude higher energy than compute

9

Page 10: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Autonomous Navigation Uses a Lot of Data

• Geometric Understanding

- Growing map size

2 million pixels 10x-100x more pixels

• Semantic Understanding

- High frame rate- Large resolutions- Data expansion

[Pire, RAS 2017]

10

Page 11: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Understanding the EnvironmentDepth Estimation

State-of-the-art approaches use Deep Neural Networks, which require up to several

hundred millions of operations and weights to

compute!>100x more complex than

video compression

Semantic Segmentation

11

Page 12: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Deep Neural Networks

Computer Vision Speech Recognition

Game Play Medical

Deep Neural Networks (DNNs) have become a cornerstone of AI

12

Page 13: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

What Are Deep Neural Networks?

Input:Image

Output:“Volvo XC90”

Modified Image Source: [Lee, CACM 2011]

Low Level Features High Level Features

13

Page 14: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Weighted Sum

Yj = activation Wij × Xii=1

3

∑⎛

⎝⎜

⎠⎟

Input Layer

Output Layer

Hidden Layer

X1

X2

X3

Y1

Y2

Y3

Y4

W11

W34

Nonlinear ActivationFunction

Key operation is multiply and accumulate (MAC) Accounts for > 90% of computation

Sigmoid1

-1

0

0 1-1

Rectified Linear Unit (ReLU)1

-1

0

0 1-1

y=max(0,x)y=1/(1+e-x)

Image source: Caffe tutorial

14

Page 15: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Fully Connected Layer– Feed forward, fully connected

– Multilayer Perceptron (MLP)

• Convolutional Layer– Feed forward, sparsely-connected w/ weight sharing

– Convolutional Neural Network (CNN)

– Typically used for images

• Recurrent Layer– Feedback

– Recurrent Neural Network (RNN)

– Typically used for sequential data (e.g., speech, language)

• Attention Layer/Mechanism– Attention (matrix multiply) + feed forward, fully connected

– Transformer [Vaswani, NeurIPS 2017]

Popular Types of Layers in DNNsFeedbackFeed

Forward

Input LayerOutput Layer

Hidden Layer

Input Layer

Output Layer

Hidden Layer

SparselyConnected

FullyConnected

15

Page 16: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

R

S

H

a plane of input activations a.k.a. input feature map (fmap)

filter (weights)

W

16

Page 17: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

R

filter (weights)

S

E

F Partial Sum (psum)

Accumulation

input fmap output fmap

Element-wise Multiplication

H

W

an output activation

17

Page 18: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

H R

filter (weights)

S

E

Sliding Window Processing

input fmap an output activation

output fmap

W F

18

Page 19: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

H R

S

C

input fmap

output fmap

… C …

filter

Many Input Channels (C)

E

W F

AlexNet: 3 – 192 Channels (C)

19

Page 20: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

E

output fmap

many filters (M)

Many Output Channels (M)

M

R

S 1

R

S

C …

M

H

input fmap

… C …

C …

W F

AlexNet: 96 – 384 Filters (M)

20

Page 21: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-Dimensional Convolution in CNN

M

Many Input fmaps (N) Many

Output fmaps (N) …

R

S

R

S

C …

C …

filters

E

F …

H

… C …

H

W

… C …

E …

1 1

N N

W F

Image batch size: 1 – 256 (N)

21

Page 22: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Define Shape for Each Layer

Filters

R

S

…C

H

W

…C

E

F

…M

…M…

R

S

…C

H

W…

…C

1

N

1

M

1

Input fmapsOutput fmaps

E

FN

H – Height of input fmap (activations) W – Width of input fmap (activations)C – Number of 2-D input fmaps /filters(channels)R – Height of 2-D filter (weights)S – Width of 2-D filter (weights)M – Number of 2-D output fmaps (channels)E – Height of output fmap (activations)F – Width of output fmap (activations)N – Number of input fmaps/output fmaps(batch size)

Shape varies across layers

22

Page 23: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Layers with Varying Shapes

Block Filter Size (RxS) # Filters (M) # Channels (C)1 3x3 16 3

3 1x1 64 163 3x3 64 13 1x1 24 64

6 1x1 120 406 5x5 120 16 1x1 40 120

MobileNetV3-Large Convolutional Layer Configurations

[Howard, ICCV 2019]

……

23

Page 24: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Popular DNN ModelsMetrics LeNet-5 AlexNet VGG-16 GoogLeNet

(v1)ResNet-50 EfficientNet-B4

Top-5 error (ImageNet)

n/a 16.4 7.4 6.7 5.3 3.7*

Input Size 28x28 227x227 224x224 224x224 224x224 380x380# of CONV Layers 2 5 16 21 (depth) 49 96# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M# of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G# of FC layers 2 3 3 1 1 65**# of Weights 58k 58.6M 124M 1M 2M 4.9M# of MACs 58k 58.6M 124M 1M 2M 4.9MTotal Weights 60k 61M 138M 7M 25.5M 19MTotal MACs 341k 724M 15.5G 1.43G 3.9G 4.4GReference Lecun,

PIEEE 1998Krizhevsky, NeurIPS 2012

Simonyan, ICLR 2015

Szegedy, CVPR 2015

He, CVPR 2016

Tan, ICML 2019

DNN models getting larger and deeper* Does not include multi-crop and ensemble** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification)

24

Page 25: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Efficient Hardware Acceleration for Deep Neural Networks

25

Page 26: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Properties We Can Leverage• Operations exhibit high parallelism

à high throughput possible

• Memory Access is the Bottleneck

ALU

Memory Read Memory WriteMAC*

* multiply-and-accumulate

filter weightimage pixelpartial sum updated

partial sum

• Example: AlexNet has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

200x 1x

DRAM DRAM

26

Page 27: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Properties We Can Leverage• Operations exhibit high parallelism

à high throughput possible

• Input data reuse opportunities (up to 500x)

27

Filter Input Fmap

Convolutional Reuse (Activations, Weights)

CONV layers only(sliding window)

Filters

2

1

Input Fmap

Fmap Reuse(Activations)

CONV and FC layers

Filter

2

1

Input Fmaps

Filter Reuse(Weights)

CONV and FC layers(batch size > 1)

Page 28: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Exploit Data Reuse at Low-Cost Memories

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process

Farther and larger memories consume more power

0.5 – 1.0 kB

Control

Reg FileSpecialized

hardware with small (< 1kB)

low cost memory near compute

28

Page 29: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Weight Stationary (WS)

• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights

• Broadcast activations and accumulate partial sumsspatially across the PE array

• Examples: TPU [Jouppi, ISCA 2017], NVDLA

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Activation

PE Weight

[Chen, ISCA 2016]

29

Page 30: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Output Stationary (OS)

[Chen, ISCA 2016]

• Minimize partial sum R/W energy consumption− maximize local accumulation

• Broadcast/Multicast filter weights and reuse activationsspatially across the PE array

• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Activation Weight

PE Psum

30

Page 31: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Minimize activation read energy consumption− maximize convolutional and fmap reuse of activations

• Unicast weights and accumulate partial sums spatiallyacross the PE array

• Example: [SCNN, ISCA 2017]

Input Stationary (IS)

[Chen, ISCA 2016]

Global Buffer

I0 I1 I2 I3 I4 I5 I6 I7

Psum

ActPE

Weight

31

Page 32: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Row Stationary Dataflow

• Maximize row convolutional reuse in RF− Keep a filter row and fmap

sliding window in RF

• Maximize row psumaccumulation in RF

PE 1Row 1 Row 1

Row 1

=*

*

[Chen, ISCA 2016] Select for Micro Top Picks

32

Page 33: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Row Stationary Dataflow

Optimize for overall energy efficiency instead for only a certain data type

PE 1Row 1 Row 1

PE 2Row 2 Row 2

PE 3Row 3 Row 3

Row 1

=*

PE 4Row 1 Row 2

PE 5Row 2 Row 3

PE 6Row 3 Row 4

Row 2

=*

PE 7Row 1 Row 3

PE 8Row 2 Row 4

PE 9Row 3 Row 5

Row 3

=*

* * *

* * *

* * *

[Chen, ISCA 2016] Select for Micro Top Picks

33

Page 34: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Dataflow Comparison: CONV Layers

0

0.5

1

1.5

2

NormalizedEnergy/MAC

WS OSA OSB OSC NLR RS

psums

weights

pixels

RS optimizes for the best overall energy efficiency

CNN Dataflows

[Chen, ISCA 2016]

34

Page 35: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Exploit Sparsity

== 0 Zero Buff

Scratch Pad

Enable

Zero Data Skipping

Register FileNo R/W No Switching

Method 1. Skip memory access and computation

Method 2. Compress data to reduce storage and data movement

0

1

2

3

4

5

6

1 2 3 4 5

DRAM

Access

(MB)

AlexNetConvLayer

Uncompressed Compressed

1 2 3 4 5AlexNet Conv Layer

DRAM Access (MB)

0

2

4

61.2×

1.4×1.7×

1.8×1.9×

UncompressedFmaps + Weights

RLE CompressedFmaps + Weights

45% power reduction

[Chen, ISSCC 2016]

35

Page 36: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Eyeriss: Deep Neural Network Accelerator

On-ch

ip Bu

ffer Spatial

PE Array

4mm

4mm

[Joint work with Joel Emer]

Results for AlexNet

Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)

Exploits data reuse for 100x reduction in memory accesses from global buffer and 1400x reduction in memory accesses from off-chip DRAM

Eyeriss Project Website: http://eyeriss.mit.edu

[Chen, ISSCC 2016]

36

Page 37: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Features: Energy vs. Accuracy

0.1

1

10

100

1000

10000

0 20 40 60 80Accuracy (Average Precision)

Energy/Pixel (nJ)

VGG162

AlexNet2

HOG1

Measured in on VOC 2007 Dataset1. DPM v5 [Girshick, 2012]2. Fast R-CNN [Girshick, CVPR 2015]

Exponential

Linear

Video Compression

Measured in 65nm*

* Only feature extraction. Does not include data, classification

energy, augmentation and ensemble, etc.

On-c

hip B

uffer Spatial

PE Array

4mm

4mm

4mm

4mm

[Suleiman, VLSI 2016] [Chen, ISSCC 2016]1 2

[Suleiman*, Chen*, ISCAS 2017]

37

Page 38: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Energy-Efficient Processing of DNNs

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE,

Dec. 2017Book Coming Spring 2020!

A significant amount of algorithm and hardware research on energy-efficient processing of DNNs

We identified various limitations to existing approaches

http://eyeriss.mit.edu/tutorial.html

38

Page 39: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Popular efficient DNN algorithm approaches

Design of Efficient DNN Algorithms

pruning neurons

pruning synapses

after pruningbefore pruning

Network Pruning

C1

1S

R

1

R

SC

Efficient Network Architectures

Examples: SqueezeNet, MobileNet

... also reduced precision

• Focus on reducing number of MACs and weights• Does it translate to energy savings and reduced latency?

[Chen*, Yang*, SysML 2018]

39

Page 40: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Data Movement is Expensive

Energy of weight depends on memory hierarchy and dataflow

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process

40

Page 41: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Energy-Evaluation Methodology

DNN Shape Configuration(# of channels, # of filters, etc.)

DNN Weights and Input Data[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]

DNN Energy Consumption L1 L2 L3

Energy

Memory Accesses

Optimization

# of MACsCalculation

# acc. at mem. level 1# acc. at mem. level 2

# acc. at mem. level n

# of MACs

Hardware Energy Costs of each MAC and Memory Access

Ecomp

Edata

Tool available at: https://energyestimation.mit.edu/

[Yang, CVPR 2017]

41

Page 42: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Number of weights alone is not a good metric for energy

• All data types should be considered

Key Observations

Output Feature Map43%

Input Feature Map25%

Weights22%

Computation10%

Energy Consumption of GoogLeNet

[Yang, CVPR 2017]

42

Page 43: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Directly target energy and incorporate it into the

optimization of DNNs to provide greater energy savings

Energy-Aware Pruning

• Sort layers based on energy and prune layers that consume most energy first

• EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

Ori. DC EAP

Normalized Energy (AlexNet)

2.1x 3.7x

x109

Magnitude Based Pruning

Energy Aware Pruning

Pruned models available at http://eyeriss.mit.edu/energy.html

[Yang, CVPR 2017]

43

Page 44: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

# of Operations vs. Latency

• # of operations (MACs) does not approximate latency well

Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)

44

Page 45: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

NetAdapt: Platform-Aware DNN Adaptation• Automatically adapt DNN to a mobile platform to reach a

target latency or energy budget• Use empirical measurements to guide optimization (avoid

modeling of tool chain or platform architecture)

In collaboration with Google’s Mobile Vision Team

NetAdapt Measure

Network Proposals

Empirical MeasurementsMetric Proposal A … Proposal Z

Latency 15.6 … 14.3

Energy 41 … 46

…… …

PretrainedNetwork Metric Budget

Latency 3.8

Energy 10.5

Budget

AdaptedNetwork

… …

Platform

A B C D Z

Code available at http://netadapt.mit.edu [Yang, ECCV 2018]

45

Page 46: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Simplified Example of One Iteration

Latency: 100msBudget: 80ms

100ms 90ms 80ms

100ms 80ms

Selected

Selected

Layer 1

Layer 4

Acc: 60%

Acc: 40%

Selected

2. Meet Budget

Latency: 80msBudget: 60ms

1. Input 4. Output3. Maximize Accuracy

Network from Previous Iteration

Network for Next Iteration

[Yang, ECCV 2018]

46

Page 47: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• NetAdapt boosts the real inference speed of MobileNetby up to 1.7x with higher accuracy

Improved Latency vs. Accuracy Tradeoff

+0.3% accuracy1.7x faster

+0.3% accuracy1.6x faster

Reference:MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018

*Tested on the ImageNet dataset and a Google Pixel 1 CPU

[Yang, ECCV 2018]

47

Page 48: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

FastDepth: Fast Monocular Depth EstimationDepth estimation from a single RGB image desirable, due to

the relatively low cost and size of monocular cameras.RGB Prediction

[Joint work with Sertac Karaman]

Auto Encoder DNN Architecture (Dense Output)

Reduction (similar to classification) Expansion

48

Page 49: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

FastDepth: Fast Monocular Depth EstimationApply NetAdapt, compact network design, and depth wise decomposition

to decoder layer to enable depth estimation at high frame rates on an embedded platform while still maintaining accuracy

[Wofk*, Ma*, ICRA 2019]

Configuration: Batch size of one (32-bit float)

Models available at http://fastdepth.mit.edu

> 10x

~40fps on an iPhone

49

Page 50: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Many Efficient DNN Design Approaches

pruning neurons

pruning synapses

after pruningbefore pruning

Network Pruning Efficient Network Architectures

10100101000000000101000000000100

01100110

Reduce Precision

32-bit float

8-bit fixed

Binary 0

No guarantee that DNN algorithm designer will use a given approach.

Need flexible hardware!

C1

1S

R

1

R

SC

G

Depth-WiseLayer

Point-WiseLayer

ConvolutionalLayer

…ChannelGroups

[Chen*, Yang*, SysML 2018]

50

Page 51: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency

• Example: Reduce memory access by amortizing across MAC array

Existing DNN Architectures

MAC arrayWeightMemory

ActivationMemory

Weight reuse

Activationreuse

51

Page 52: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)

Limitation of Existing DNN Architectures

MAC array(spatial

accumulation)

Number of filters(output channels)

Number ofinput channels

MAC array(temporal

accumulation)

Number of filters(output channels)

feature mapor batch size

52

Page 53: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)

Limitation of Existing DNN Architectures

MAC array(spatial

accumulation)

Number of filters(output channels)

Number ofinput channels

MAC array(temporal

accumulation)

Number of filters(output channels)

feature mapor batch size

C1

1

S

R

1

Example mapping for depth wise layer

53

Page 54: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)– Less efficient as array scales up in size– Can be challenging to exploit sparsity

Limitation of Existing DNN Architectures

MAC array(spatial

accumulation)

Number of filters(output channels)

Number ofinput channels

MAC array(temporal

accumulation)

Number of filters(output channels)

feature mapor batch size

54

Page 55: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Need Flexible Dataflow• Use flexible dataflow (Row Stationary) to exploit reuse in any

dimension of DNN to increase energy efficiency and array utilization

Example: Depth-wise layer

55

Page 56: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• When reuse available, need multicast to exploit spatial data reuse for energy efficiency and high array utilization

• When reuse not available, need unicast for high BW for weights for FC and weights & activations for high PE utilization

• An all-to-all satisfies above but too expensive and not scalable

Need Flexible NoC for Varying ReuseG

loba

l Buf

fer

PE PE PEPE

PE PE PEPE

PE PE PEPE

PEPE PE PE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

High Bandwidth, Low Spatial Reuse Low Bandwidth, High Spatial Reuse

Unicast Networks Broadcast Network1D Multicast Networks1D Systolic Networks

[Chen, JETCAS 2019]

56

Page 57: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Hierarchical MeshGLBCluster

RouterCluster

PECluster

…… …

MeshNetwork

All-to-allNetwork

(a) (b) (c) (d) (e)

GLBCluster

RouterCluster

PECluster

…… …

MeshNetwork

All-to-allNetwork

(a) (b) (c) (d) (e)

High Bandwidth High Reuse Grouped Multicast Interleaved Multicast

All-to-AllMesh

[Chen, JETCAS 2019]

57

Page 58: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Eyeriss v2: Balancing Flexibility and Efficiency

Over an order of magnitude faster and more energy efficient than Eyeriss v1

Speed up over Eyeriss v1 scales with number of PEs

# of PEs 256 1024 16384

AlexNet 17.9x 71.5x 1086.7x

GoogLeNet 10.4x 37.8x 448.8x

MobileNet 15.7x 57.9x 873.0x

Efficiently supports• Wide range of filter shapes

– Large and Compact

• Different Layers – CONV, FC, depth wise, etc.

• Wide range of sparsity – Dense and Sparse

• Scalable architecture

[Joint work with Joel Emer]

5.610.912.6

[Chen, JETCAS 2019]

58

Page 59: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Looking Beyond the DNN Accelerator for Acceleration

59

Page 60: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Super-Resolution on Mobile Devices

Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth)

Screens are getting larger

Low ResolutionStreaming

Transmit low resolution for lower bandwidth

HighResolutionPlayback

60

Page 61: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

FAST: A Framework to Accelerate SuperRes

A framework that accelerates any SR algorithm by up to 15x when running on compressed videos

FAST SR15x faster

Compressed video

SR algorithm

Real-time

[Zhang, CVPRW 2017]

61

Page 62: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Free Information in Compressed Videos

Compressed videoPixels

Video as a stack of pixels

Block-structure Motion-compensation

Representation in compressed video

This representation can help accelerate super-resolution

Decode

[Zhang, CVPRW 2017]

62

Page 63: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

High-res video

Transfer is Lightweight

Low-res videoHigh-res video

SR

Low-res video

Transfer

FractionalInterpolation

BicubicInterpolation

Skip Flag

The complexity of the transfer is comparable to bicubic interpolation.Transfer N frames, accelerate by N

Transfer allows SR to run on only a subset of frames

SRSRSRSR

SR

[Zhang, CVPRW 2017]

63

Page 64: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Evaluation: Accelerating SRCNN

[Zhang, CVPRW 2017]

64

Page 65: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Visual Evaluation

SRCNN FAST + SRCNN

Bicubic

Code released at www.rle.mit.edu/eems/fast

Look beyond the DNN accelerator for opportunities to accelerate DNN processing (e.g., structure of data and temporal correlation)

[Zhang, CVPRW 2017]

65

Page 66: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Beyond Deep Neural Networks

66

Page 67: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Visual-Inertial Localization

Visual-Inertial Odometry

(VIO)

Localization

Mapping

Image sequence

IMU Inertial Measurement Unit

*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Slide 28

Determines location/orientation of robot from images and IMU

*

67

Page 68: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Localization at Under 25 mW

[Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Consumes 684× and 1582×less energy than

mobile and desktop CPUs, respectively

First chip that performs complete Visual-Inertial Odometry

[Joint work with Sertac Karaman]

Navion

Front-End for camera (Feature detection, tracking, and

outlier elimination)

Front-End for IMU (pre-integration of accelerometer

and gyroscope data)

Back-End Optimization of Pose Graph

Navion Project Website: http://navion.mit.edu

68

Page 69: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Key Methods to Reduce Data Size

Backend Control

Data & Control BusBuild Graph

Linear Solver

Linearize

Marginal

Retract

GraphLinear Solver

Horizon States

Shared Memory

Floating Point

Arithmetic

Matrix Operations

Cholesky

Back Substitute

Rodrigues Operations

Feature Tracking

(FT)

Previous FrameLine Buffers

Feature Detection

(FD)

Undistort & Rectify

(UR)

Undistort & Rectify

(UR)

Data & Control Bus

Sparse Stereo (SS)

Vision Frontend Control

RANSAC Fixed Point Arithmetic Point Cloud Pre-IntegrationFloating Point

Arithmetic

IMU memory

Current Frame

Left Frame

Right Frame

Vision Frontend (VFE)

IMU Frontend (IFE)

Backend (BE)

Register File

Apply Low Cost

FrameCompression

Use compression and exploit sparsity to reduce memory down to 854kB

Exploit Sparsity in Graph and

Linear Solver

Navion: Fully integrated system – no off-chip processing or storage

[Suleiman, VLSI-C 2018] Best Student Paper Award

69

Page 70: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Where to Go Next: Planning and Mapping

Select candidate scan locations

Compute Shannon MI and choose best location

Move to location

and scan

Update Occupancy

Map

Where to scan?

Occupancy map Mutual information map

Mutual Information Updated Map

Robot Exploration: Decide where to go by computing Shannon Mutual Information

Exploration with a mini race car using motion capture for localization

Occupancy map with planned path

MI surface

[Zhang, ICRA 2019]

70

Page 71: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Challenge is Data Delivery to All CoresProcess multiple beams in parallel

Core 1

Core 2

Core 3

Core N

Core N

Core 2

Core 1

Core N

Core 2Core 1

Data delivery from memory is limited

Read Port 1

Read Port 2

71

Page 72: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Specialized Memory ArchitectureBreak up map into separate memory banks and novel storage pattern to

minimize read conflicts when processing different beams in parallel.

Compute the mutual information for an entire map of 20m x 20m at 0.1m resolution in under a second à a 100x speed up versus CPU for 1/10th of the power.

[Joint work with Sertac Karaman]

XY

X

Y

Memory Access Pattern

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4 8

3 3 3 5 6 7

2 2 3 4

1 2 3 4 5 6 7 8

Diagonal Banking Pattern

X

Y

Bank 0

Bank 1

Bank 4

Bank 3

Bank 5

Bank 7

Bank 2

Bank 6

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4 8

3 3 3 5 6 7

2 2 3 4

1 2 3 4 5 6 7 8

[Li, RSS 2019]

72

Page 73: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Monitoring Neurodegenerative Disorders

• Neuropsychological assessments are time consuming and require a trained specialist

• Repeat medical assessments are sparse, mostly qualitative, and suffer from high retest variability

Mini-Mental State Examination (MMSE)

Q1. What is the year? Season? Date?Q2. Where are you now? State? Floor?Q3. Could you count backward from

100 by sevens? (93, 86, …)

Clock-drawing test

Agrell et al. Age and Ageing, 1998.

[Joint work with Thomas Heldt and Charlie Sodini]

Dementia affects 50 million people worldwide today (75 million in 10 years) [World Alzheimer’s Report]

73

Page 74: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Use Eye Movements for Quantitative Evaluation

High-speed camera

Phantom v25-11

Substantial head support

SR EYELINK 1000 PLUS

IR illumination

Reulen et al., Med. & Biol. Eng. & Comp, 1988.

Clinical measurements of saccade latency are done in constrained environments that rely on specialized, costly equipment.

Eye movements can be used to quantitatively evaluate severity, progression or regression of neurodegenerative diseases

74

Page 75: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Measure Eye Movements Using Phone

[Saavedra Peña, EMBC 2018] [Lai, ICIP 2018]

Develop algorithm to measure eye movement using a consumer-grade

camera rather than high-cost research-grade camera.

Enable low-cost in-home longitudinal measurements.

Coun

t

Eye movement feature

Eye movementsSmartphone

Phantom ($100k)

iPhone 6 (< $1k)

Reaction Time (milliseconds)

75

Page 76: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Looking For Volunteers for Eye Reaction Time76

If you are near or on MIT Campus and interested

in volunteering your eye movements for this study,

please contact us at [email protected]

Page 77: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Pulsed Time of Flight: Measure distance using round trip time

of laser light for each image pixel

– Illumination + Imager Power: 2.5 – 20 W for range from 1 - 8 m

• Use computer vision techniques and passive images to

estimate changes in depth without turning on laser

– CMOS Imaging Sensor Power: < 350 mW

Low Power 3D Time of Flight Imaging

Estimated Depth MapsReal-time Performance on Embedded Processor

VGA @ 30 fps on Cortex-A7 (< 0.5W active power)

[Noraky, ICIP 2017]

77

Page 78: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Results of Low Power Depth ToF Imaging

RGB Image Depth MapGround Truth

Depth MapEstimated

Mean Relative Error: 0.7%Duty Cycle (on-time of laser): 11%

[Noraky, ICIP 2017]

78

Page 79: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Efficient computing extends the reach of AI beyond the cloud by reducing communication requirements, enabling privacy, and providing low latency so that AI can be used in wide range of applications ranging from robotics to health care.

• Cross-layer design with specialized hardware enables energy-efficient AI, and will be critical to the progress of AI over the next decade.

Summary

Today’s slides available at https://tinyurl.com/SzeMITDL2020

79

Page 80: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Additional Resources

Overview PaperV. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017

Book Coming Spring 2020!

More info about Tutorial on DNN Architectures http://eyeriss.mit.edu/tutorial.html

For updatesEEMS Mailing List

80

Page 81: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Additional Resources

MIT Professional Education Course on

“Designing Efficient Deep Learning Systems” http://shortprograms.mit.edu/dls

Next Offering: July 20-21, 2020 on MIT Campus

81

Page 82: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Additional ResourcesTalks and Tutorial Available Online

https://www.rle.mit.edu/eems/publications/tutorials/

YouTube ChannelEEMS Group – PI: Vivienne Sze

82

Page 83: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

Acknowledgements

Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not be possible without the support of the following organizations:

Joel Emer

Sertac Karaman

Thomas Heldt

Mailing List: http://mailman.mit.edu/mailman/listinfo/eems-news

83

Page 84: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Energy-Efficient Hardware for Deep Neural Networks– Project website: http://eyeriss.mit.edu

– Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017.

– Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.

– Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.

– Eyexam: https://arxiv.org/abs/1807.07928

• Limitations of Existing Efficient DNN Approaches – Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient

Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.

– V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.

– Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html

References84

Page 85: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Co-Design of Algorithms and Hardware for Deep Neural Networks– T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-

Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

– Energy estimation tool: http://eyeriss.mit.edu/energy.html– T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural

Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018.

– D. Wofk*, F. Ma*, T.-J. Yang, S. Karaman, V. Sze, “FastDepth: Fast Monocular Depth Estimation on Embedded Systems,” IEEE International Conference on Robotics and Automation (ICRA), May 2019. http://fastdepth.mit.edu/

• Energy-Efficient Visual Inertial Localization – Project website: http://navion.mit.edu

– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Symposium on VLSI Circuits (VLSI-Circuits), June 2018.

– Z. Zhang*, A. Suleiman*, L. Carlone, V. Sze, S. Karaman, “Visual-Inertial Odometry on Chip: An Algorithm-and-Hardware Co-design Approach,” Robotics: Science and Systems (RSS), July 2017.

– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A 2mW Fully Integrated Real-Time Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Journal of Solid State Circuits (JSSC), VLSI Symposia Special Issue, Vol. 54, No. 4, pp. 1106-1119, April 2019.

References85

Page 86: Efficient Computing for Deep Learning, AI and Robotics - IAP · •Operations exhibit high parallelism àhigh throughput possible •Memory Access is the Bottleneck ALU Memory Read

Vivienne Sze ( @eems_mit)

• Fast Shannon Mutual Information for Robot Exploration– Z. Zhang, T. Henderson, V. Sze, S. Karaman, “FSMI: Fast computation of Shannon Mutual Information for

information-theoretic mapping,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.– P. Li*, Z. Zhang*, S. Karaman, V. Sze, “High-throughput Computation of Shannon Mutual Information on Chip,”

Robotics: Science and Systems (RSS), June 2019.

• Low Power Time of Flight Imaging– J. Noraky, V. Sze, “Low Power Depth Estimation of Rigid Objects for Time-of-Flight Imaging,” IEEE Transactions

on Circuits and Systems for Video Technology (TCSVT), 2019.

– J. Noraky, V. Sze, “Depth Estimation of Non-Rigid Objects For Time-Of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), October 2018.

– J. Noraky, V. Sze, “Low Power Depth Estimation for Time-of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), September 2017.

• Monitoring Neurodegenerative Disorders Using a Phone – H.-Y. Lai, G. Saavedra Peña, C. Sodini, T. Heldt, V. Sze, “Enabling Saccade Latency Measurements with

Consumer-Grade Cameras,” IEEE International Conference on Image Processing (ICIP), October 2018.

– G. Saavedra Peña, H.-Y. Lai, V. Sze, T. Heldt, “Determination of saccade latency distributions using video recordings from consumer-grade devices,” IEEE International Engineering in Medicine and Biology Conference (EMBC), 2018.

– H.-Y. Lai, G. Saavedra Peña, C. Sodini, V. Sze, T. Heldt, “Measuring Saccade Latency Using Smartphone Cameras,” IEEE Journal of Biomedical and Health Informatics (JBHI), March 2020.

References86