efficient computing for deep learning, ai and robotics - iap · •operations exhibit high...

Vivienne Sze ( @eems_mit)

Efficient Computing for Deep Learning, AI and Robotics

Vivienne Sze ( @eems_mit)Massachusetts Institute of Technology

In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel

Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang

Slides available at https://tinyurl.com/SzeMITDL2020

1

https://tinyurl.com/SzeMITDL2020


Compute Demands for Deep Neural Networks2

Source: Open AI (https://openai.com/blog/ai-and-compute/)

Petaflop/s-days (exponential)

Year

AlexNet to AlphaGo Zero: A 300,000x Increase in Compute

https://openai.com/blog/ai-and-compute/


Compute Demands for Deep Neural Networks3

[Strubell, ACL 2019]

[Strubell, ACL 2019]


Processing at “Edge” instead of the “Cloud”

Communication Privacy Latency

4


Computing Challenge for Self-Driving Cars

(Feb 2018)

Cameras and radar generate ~6 gigabytes of data every 30 seconds.

Generates wasted heat and some prototypes need water-cooling!

Self-driving car prototypes use approximately 2,500 Watts of

computing power.

5


Existing Processors Consume Too Much Power

< 1 Watt > 10 Watts

6


Transistors are NOT Getting More EfficientSlow down of Moore’s Law and Dennard Scaling

General purpose microprocessors not getting faster or more efficient

• Need specialized hardware for significant improvement in speed and energy efficiency

• Redesign computing hardware from the ground up!

Slowdown

7


Popularity of Specialized Hardware for DNNs8

Big Bets On A.I. Open a New Frontier for Chips Start-Ups, Too. (January 14, 2018)

“Today, at least 45 start-ups are working on chips that can power tasks like speech and self-driving cars, and at least five of them have raised more than $100 million from investors. Venture capitalists invested more than $1.5 billion in chip start-ups last year, nearly doubling the investments made two years ago, according to the research firm CB Insights.”


Power Dominated by Data MovementOperation: Energy

(pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640

Area (µm2)

366713713604184282349516407700N/AN/A

[Horowitz, ISSCC 2014]

Relative Energy Cost

1 10 102 103 104

Relative Area Cost

1 10 102 103

Memory access is orders of magnitude higher energy than compute

9


Autonomous Navigation Uses a Lot of Data

• Geometric Understanding

- Growing map size

2 million pixels 10x-100x more pixels

• Semantic Understanding

- High frame rate- Large resolutions- Data expansion

[Pire, RAS 2017]

10


Understanding the EnvironmentDepth Estimation

State-of-the-art approaches use Deep Neural Networks, which require up to several

hundred millions of operations and weights to

compute!>100x more complex than

video compression

Semantic Segmentation

11


Deep Neural Networks

Computer Vision Speech Recognition

Game Play Medical

Deep Neural Networks (DNNs) have become a cornerstone of AI

12


What Are Deep Neural Networks?

Input:Image

Output:“Volvo XC90”

Modified Image Source: [Lee, CACM 2011]

Low Level Features High Level Features

13


Weighted Sum

Yj = activation Wij × Xii=1

3

∑⎛

⎝⎜

⎞

⎠⎟

Input Layer

Output Layer

Hidden Layer

X1

X2

X3

Y1

Y2

Y3

Y4

W11

W34

Nonlinear ActivationFunction

Key operation is multiply and accumulate (MAC) Accounts for > 90% of computation

Sigmoid1

-1

0

0 1-1

Rectified Linear Unit (ReLU)1

-1

0

0 1-1

y=max(0,x)y=1/(1+e-x)

Image source: Caffe tutorial

14


• Fully Connected Layer– Feed forward, fully connected

– Multilayer Perceptron (MLP)

• Convolutional Layer– Feed forward, sparsely-connected w/ weight sharing

– Convolutional Neural Network (CNN)

– Typically used for images

• Recurrent Layer– Feedback

– Recurrent Neural Network (RNN)

– Typically used for sequential data (e.g., speech, language)

• Attention Layer/Mechanism– Attention (matrix multiply) + feed forward, fully connected

– Transformer [Vaswani, NeurIPS 2017]

Popular Types of Layers in DNNsFeedbackFeed

Forward

Input LayerOutput Layer

Hidden Layer

Input Layer

Output Layer

Hidden Layer

SparselyConnected

FullyConnected

15


High-Dimensional Convolution in CNN

R

S

H

a plane of input activations a.k.a. input feature map (fmap)

filter (weights)

W

16



R

filter (weights)

S

E

F Partial Sum (psum)

Accumulation

input fmap output fmap

Element-wise Multiplication

H

W

an output activation

17



H R

filter (weights)

S

E

Sliding Window Processing

input fmap an output activation

output fmap

W F

18



H R

S

…

…

…

C

input fmap

output fmap

…

…

…

… C …

filter

…

Many Input Channels (C)

E

W F

AlexNet: 3 – 192 Channels (C)

19



…

E

output fmap

…

…

many filters (M)

Many Output Channels (M)

M

…

R

S 1

R

S

…

…

…

C …

M

H

input fmap

…

…

…

… C …

C …

…

…

W F

AlexNet: 96 – 384 Filters (M)

20



…

M

…

Many Input fmaps (N) Many

Output fmaps (N) …

R

S

R

S

…

…

…

C …

C …

…

…

filters

…

E

F …

…

H

…

… C …

H

W

…

…

…

… C …

…

E …

…

1 1

N N

W F

Image batch size: 1 – 256 (N)

21


Define Shape for Each Layer

Filters

R

S

…

…

…C

H

W

…

…

…C

…

E

F

…

…

…M

…

…

…M…

R

S

…

…

…C

H

W…

…C

1

N

1

M

1

…

…

Input fmapsOutput fmaps

…

E

FN

H – Height of input fmap (activations) W – Width of input fmap (activations)C – Number of 2-D input fmaps /filters(channels)R – Height of 2-D filter (weights)S – Width of 2-D filter (weights)M – Number of 2-D output fmaps (channels)E – Height of output fmap (activations)F – Width of output fmap (activations)N – Number of input fmaps/output fmaps(batch size)

Shape varies across layers

22


Layers with Varying Shapes

Block Filter Size (RxS) # Filters (M) # Channels (C)1 3x3 16 3

3 1x1 64 163 3x3 64 13 1x1 24 64

6 1x1 120 406 5x5 120 16 1x1 40 120

MobileNetV3-Large Convolutional Layer Configurations

[Howard, ICCV 2019]

……

…

23


Popular DNN ModelsMetrics LeNet-5 AlexNet VGG-16 GoogLeNet

(v1)ResNet-50 EfficientNet-B4

Top-5 error (ImageNet)

n/a 16.4 7.4 6.7 5.3 3.7*

Input Size 28x28 227x227 224x224 224x224 224x224 380x380# of CONV Layers 2 5 16 21 (depth) 49 96# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M# of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G# of FC layers 2 3 3 1 1 65**# of Weights 58k 58.6M 124M 1M 2M 4.9M# of MACs 58k 58.6M 124M 1M 2M 4.9MTotal Weights 60k 61M 138M 7M 25.5M 19MTotal MACs 341k 724M 15.5G 1.43G 3.9G 4.4GReference Lecun,

PIEEE 1998Krizhevsky, NeurIPS 2012

Simonyan, ICLR 2015

Szegedy, CVPR 2015

He, CVPR 2016

Tan, ICML 2019

DNN models getting larger and deeper* Does not include multi-crop and ensemble** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification)

24


Efficient Hardware Acceleration for Deep Neural Networks

25


Properties We Can Leverage• Operations exhibit high parallelism

à high throughput possible

• Memory Access is the Bottleneck

ALU

Memory Read Memory WriteMAC*

* multiply-and-accumulate

filter weightimage pixelpartial sum updated

partial sum

• Example: AlexNet has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

200x 1x

DRAM DRAM

26


Properties We Can Leverage• Operations exhibit high parallelism

à high throughput possible

• Input data reuse opportunities (up to 500x)

27

Filter Input Fmap

Convolutional Reuse (Activations, Weights)

CONV layers only(sliding window)

Filters

2

1

Input Fmap

Fmap Reuse(Activations)

CONV and FC layers

Filter

2

1

Input Fmaps

Filter Reuse(Weights)

CONV and FC layers(batch size > 1)


Exploit Data Reuse at Low-Cost Memories

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

6×

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process

Farther and larger memories consume more power

0.5 – 1.0 kB

Control

Reg FileSpecialized

hardware with small (< 1kB)

low cost memory near compute

28


Weight Stationary (WS)

• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights

• Broadcast activations and accumulate partial sumsspatially across the PE array

• Examples: TPU [Jouppi, ISCA 2017], NVDLA

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Activation

PE Weight

[Chen, ISCA 2016]

29


Output Stationary (OS)

[Chen, ISCA 2016]

• Minimize partial sum R/W energy consumption− maximize local accumulation

• Broadcast/Multicast filter weights and reuse activationsspatially across the PE array

• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Activation Weight

PE Psum

30


• Minimize activation read energy consumption− maximize convolutional and fmap reuse of activations

• Unicast weights and accumulate partial sums spatiallyacross the PE array

• Example: [SCNN, ISCA 2017]

Input Stationary (IS)

[Chen, ISCA 2016]

Global Buffer

I0 I1 I2 I3 I4 I5 I6 I7

Psum

ActPE

Weight

31


Row Stationary Dataflow

• Maximize row convolutional reuse in RF− Keep a filter row and fmap

sliding window in RF

• Maximize row psumaccumulation in RF

PE 1Row 1 Row 1

Row 1

=*

*

[Chen, ISCA 2016] Select for Micro Top Picks

32


Row Stationary Dataflow

Optimize for overall energy efficiency instead for only a certain data type

PE 1Row 1 Row 1

PE 2Row 2 Row 2

PE 3Row 3 Row 3

Row 1

=*

PE 4Row 1 Row 2

PE 5Row 2 Row 3

PE 6Row 3 Row 4

Row 2

=*

PE 7Row 1 Row 3

PE 8Row 2 Row 4

PE 9Row 3 Row 5

Row 3

=*

* * *

* * *

* * *

[Chen, ISCA 2016] Select for Micro Top Picks

33


Dataflow Comparison: CONV Layers

0

0.5

1

1.5

2

NormalizedEnergy/MAC

WS OSA OSB OSC NLR RS

psums

weights

pixels

RS optimizes for the best overall energy efficiency

CNN Dataflows

[Chen, ISCA 2016]

34


Exploit Sparsity

== 0 Zero Buff

Scratch Pad

Enable

Zero Data Skipping

Register FileNo R/W No Switching

Method 1. Skip memory access and computation

Method 2. Compress data to reduce storage and data movement

0

1

2

3

4

5

6

1 2 3 4 5

DRAM

Access

(MB)

AlexNetConvLayer

Uncompressed Compressed

1 2 3 4 5AlexNet Conv Layer

DRAM Access (MB)

0

2

4

61.2×

1.4×1.7×

1.8×1.9×

UncompressedFmaps + Weights

RLE CompressedFmaps + Weights

45% power reduction

[Chen, ISSCC 2016]

35


Eyeriss: Deep Neural Network Accelerator

On-ch

ip Bu

ffer Spatial

PE Array

4mm

4mm

[Joint work with Joel Emer]

Results for AlexNet

Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)

Exploits data reuse for 100x reduction in memory accesses from global buffer and 1400x reduction in memory accesses from off-chip DRAM

Eyeriss Project Website: http://eyeriss.mit.edu

[Chen, ISSCC 2016]

36

http://eyeriss.mit.edu/


Features: Energy vs. Accuracy

0.1

1

10

100

1000

10000

0 20 40 60 80Accuracy (Average Precision)

Energy/Pixel (nJ)

VGG162

AlexNet2

HOG1

Measured in on VOC 2007 Dataset1. DPM v5 [Girshick, 2012]2. Fast R-CNN [Girshick, CVPR 2015]

Exponential

Linear

Video Compression

Measured in 65nm*

* Only feature extraction. Does not include data, classification

energy, augmentation and ensemble, etc.

On-c

hip B

uffer Spatial

PE Array

4mm

4mm

4mm

4mm

[Suleiman, VLSI 2016] [Chen, ISSCC 2016]1 2

[Suleiman*, Chen*, ISCAS 2017]

37


Energy-Efficient Processing of DNNs

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,

“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE,

Dec. 2017Book Coming Spring 2020!

A significant amount of algorithm and hardware research on energy-efficient processing of DNNs

We identified various limitations to existing approaches

http://eyeriss.mit.edu/tutorial.html

38



• Popular efficient DNN algorithm approaches

Design of Efficient DNN Algorithms

pruning neurons

pruning synapses

after pruningbefore pruning

Network Pruning

C1

1S

R

1

R

SC

Efficient Network Architectures

Examples: SqueezeNet, MobileNet

... also reduced precision

• Focus on reducing number of MACs and weights• Does it translate to energy savings and reduced latency?

[Chen*, Yang*, SysML 2018]

39


Data Movement is Expensive

Energy of weight depends on memory hierarchy and dataflow

DRAM Global Buffer

PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×

6×

PE ALU 2×

1× 1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process

40


Energy-Evaluation Methodology

DNN Shape Configuration(# of channels, # of filters, etc.)

DNN Weights and Input Data[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]

DNN Energy Consumption L1 L2 L3

Energy

…

Memory Accesses

Optimization

# of MACsCalculation

…

# acc. at mem. level 1# acc. at mem. level 2

# acc. at mem. level n

# of MACs

Hardware Energy Costs of each MAC and Memory Access

Ecomp

Edata

Tool available at: https://energyestimation.mit.edu/

[Yang, CVPR 2017]

41

https://energyestimation.mit.edu/


• Number of weights alone is not a good metric for energy

• All data types should be considered

Key Observations

Output Feature Map43%

Input Feature Map25%

Weights22%

Computation10%

Energy Consumption of GoogLeNet

[Yang, CVPR 2017]

42


Directly target energy and incorporate it into the

optimization of DNNs to provide greater energy savings

Energy-Aware Pruning

• Sort layers based on energy and prune layers that consume most energy first

• EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

Ori. DC EAP

Normalized Energy (AlexNet)

2.1x 3.7x

x109

Magnitude Based Pruning

Energy Aware Pruning

Pruned models available at http://eyeriss.mit.edu/energy.html

[Yang, CVPR 2017]

43

http://eyeriss.mit.edu/energy.html


# of Operations vs. Latency

• # of operations (MACs) does not approximate latency well

Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)

44

https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html


NetAdapt: Platform-Aware DNN Adaptation• Automatically adapt DNN to a mobile platform to reach a

target latency or energy budget• Use empirical measurements to guide optimization (avoid

modeling of tool chain or platform architecture)

In collaboration with Google’s Mobile Vision Team

NetAdapt Measure

…

Network Proposals

Empirical MeasurementsMetric Proposal A … Proposal Z

Latency 15.6 … 14.3

Energy 41 … 46

…… …

PretrainedNetwork Metric Budget

Latency 3.8

Energy 10.5

Budget

AdaptedNetwork

… …

Platform

A B C D Z

Code available at http://netadapt.mit.edu [Yang, ECCV 2018]

45

http://netadapt.mit.edu/


Simplified Example of One Iteration

Latency: 100msBudget: 80ms

100ms 90ms 80ms

100ms 80ms

Selected

Selected

Layer 1

Layer 4

…

Acc: 60%

Acc: 40%

…

Selected

2. Meet Budget

Latency: 80msBudget: 60ms

1. Input 4. Output3. Maximize Accuracy

Network from Previous Iteration

Network for Next Iteration

[Yang, ECCV 2018]

46


• NetAdapt boosts the real inference speed of MobileNetby up to 1.7x with higher accuracy

Improved Latency vs. Accuracy Tradeoff

+0.3% accuracy1.7x faster

+0.3% accuracy1.6x faster

Reference:MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018

*Tested on the ImageNet dataset and a Google Pixel 1 CPU

[Yang, ECCV 2018]

47


FastDepth: Fast Monocular Depth EstimationDepth estimation from a single RGB image desirable, due to

the relatively low cost and size of monocular cameras.RGB Prediction

[Joint work with Sertac Karaman]

Auto Encoder DNN Architecture (Dense Output)

Reduction (similar to classification) Expansion

48


FastDepth: Fast Monocular Depth EstimationApply NetAdapt, compact network design, and depth wise decomposition

to decoder layer to enable depth estimation at high frame rates on an embedded platform while still maintaining accuracy

[Wofk*, Ma*, ICRA 2019]

Configuration: Batch size of one (32-bit float)

Models available at http://fastdepth.mit.edu

> 10x

~40fps on an iPhone

49

http://fastdepth.mit.edu/


Many Efficient DNN Design Approaches

pruning neurons

pruning synapses

after pruningbefore pruning

Network Pruning Efficient Network Architectures

10100101000000000101000000000100

01100110

Reduce Precision

32-bit float

8-bit fixed

Binary 0

No guarantee that DNN algorithm designer will use a given approach.

Need flexible hardware!

C1

1S

R

1

R

SC

G

Depth-WiseLayer

Point-WiseLayer

ConvolutionalLayer

…ChannelGroups

[Chen*, Yang*, SysML 2018]

50


• Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency

• Example: Reduce memory access by amortizing across MAC array

Existing DNN Architectures

MAC arrayWeightMemory

ActivationMemory

Weight reuse

Activationreuse

51


• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)

Limitation of Existing DNN Architectures

MAC array(spatial

accumulation)

Number of filters(output channels)

Number ofinput channels

MAC array(temporal

accumulation)


feature mapor batch size

52


• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)


MAC array(spatial

accumulation)



MAC array(temporal

accumulation)



C1

1

S

R

1

Example mapping for depth wise layer

53


• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)– Less efficient as array scales up in size– Can be challenging to exploit sparsity


MAC array(spatial

accumulation)



MAC array(temporal

accumulation)



54


Need Flexible Dataflow• Use flexible dataflow (Row Stationary) to exploit reuse in any

dimension of DNN to increase energy efficiency and array utilization

Example: Depth-wise layer

55


• When reuse available, need multicast to exploit spatial data reuse for energy efficiency and high array utilization

• When reuse not available, need unicast for high BW for weights for FC and weights & activations for high PE utilization

• An all-to-all satisfies above but too expensive and not scalable

Need Flexible NoC for Varying ReuseG

loba

l Buf

fer

PE PE PEPE

PE PE PEPE

PE PE PEPE

PEPE PE PE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

Glo

bal B

uffe

r

PE PE PEPE

PE PE PEPE

PE PE PEPE

PE PE PEPE

High Bandwidth, Low Spatial Reuse Low Bandwidth, High Spatial Reuse

Unicast Networks Broadcast Network1D Multicast Networks1D Systolic Networks

[Chen, JETCAS 2019]

56


Hierarchical MeshGLBCluster

RouterCluster

PECluster

…

…

…

…

…… …

MeshNetwork

All-to-allNetwork

(a) (b) (c) (d) (e)

GLBCluster

RouterCluster

PECluster

…

…

…

…

…… …

MeshNetwork

All-to-allNetwork

(a) (b) (c) (d) (e)

High Bandwidth High Reuse Grouped Multicast Interleaved Multicast

All-to-AllMesh

[Chen, JETCAS 2019]

57


Eyeriss v2: Balancing Flexibility and Efficiency

Over an order of magnitude faster and more energy efficient than Eyeriss v1

Speed up over Eyeriss v1 scales with number of PEs

# of PEs 256 1024 16384

AlexNet 17.9x 71.5x 1086.7x

GoogLeNet 10.4x 37.8x 448.8x

MobileNet 15.7x 57.9x 873.0x

Efficiently supports• Wide range of filter shapes

– Large and Compact

• Different Layers – CONV, FC, depth wise, etc.

• Wide range of sparsity – Dense and Sparse

• Scalable architecture

[Joint work with Joel Emer]

5.610.912.6

[Chen, JETCAS 2019]

58


Looking Beyond the DNN Accelerator for Acceleration

59


Super-Resolution on Mobile Devices

Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth)

Screens are getting larger

Low ResolutionStreaming

Transmit low resolution for lower bandwidth

HighResolutionPlayback

60


FAST: A Framework to Accelerate SuperRes

A framework that accelerates any SR algorithm by up to 15x when running on compressed videos

FAST SR15x faster

Compressed video

SR algorithm

Real-time

[Zhang, CVPRW 2017]

61


Free Information in Compressed Videos

Compressed videoPixels

Video as a stack of pixels

Block-structure Motion-compensation

Representation in compressed video

This representation can help accelerate super-resolution

Decode

[Zhang, CVPRW 2017]

62


High-res video

Transfer is Lightweight

Low-res videoHigh-res video

SR

Low-res video

Transfer

FractionalInterpolation

BicubicInterpolation

Skip Flag

The complexity of the transfer is comparable to bicubic interpolation.Transfer N frames, accelerate by N

Transfer allows SR to run on only a subset of frames

SRSRSRSR

SR

[Zhang, CVPRW 2017]

63


Evaluation: Accelerating SRCNN

[Zhang, CVPRW 2017]

64


Visual Evaluation

SRCNN FAST + SRCNN

Bicubic

Code released at www.rle.mit.edu/eems/fast

Look beyond the DNN accelerator for opportunities to accelerate DNN processing (e.g., structure of data and temporal correlation)

[Zhang, CVPRW 2017]

65

http://www.rle.mit.edu/eems/fast


Beyond Deep Neural Networks

66


Visual-Inertial Localization

Visual-Inertial Odometry

(VIO)

Localization

Mapping

Image sequence

IMU Inertial Measurement Unit

…

*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Slide 28

Determines location/orientation of robot from images and IMU

*

67


Localization at Under 25 mW

[Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Consumes 684× and 1582×less energy than

mobile and desktop CPUs, respectively

First chip that performs complete Visual-Inertial Odometry


Navion

Front-End for camera (Feature detection, tracking, and

outlier elimination)

Front-End for IMU (pre-integration of accelerometer

and gyroscope data)

Back-End Optimization of Pose Graph

Navion Project Website: http://navion.mit.edu

68


Key Methods to Reduce Data Size

Backend Control

Data & Control BusBuild Graph

Linear Solver

Linearize

Marginal

Retract

GraphLinear Solver

Horizon States

Shared Memory

Floating Point

Arithmetic

Matrix Operations

Cholesky

Back Substitute

Rodrigues Operations

Feature Tracking

(FT)

Previous FrameLine Buffers

Feature Detection

(FD)

Undistort & Rectify

(UR)

Undistort & Rectify

(UR)

Data & Control Bus

Sparse Stereo (SS)

Vision Frontend Control

RANSAC Fixed Point Arithmetic Point Cloud Pre-IntegrationFloating Point

Arithmetic

IMU memory

Current Frame

Left Frame

Right Frame

Vision Frontend (VFE)

IMU Frontend (IFE)

Backend (BE)

Register File

Apply Low Cost

FrameCompression

Use compression and exploit sparsity to reduce memory down to 854kB

Exploit Sparsity in Graph and

Linear Solver

Navion: Fully integrated system – no off-chip processing or storage

[Suleiman, VLSI-C 2018] Best Student Paper Award

69


Where to Go Next: Planning and Mapping

Select candidate scan locations

Compute Shannon MI and choose best location

Move to location

and scan

Update Occupancy

Map

Where to scan?

Occupancy map Mutual information map

Mutual Information Updated Map

Robot Exploration: Decide where to go by computing Shannon Mutual Information

Exploration with a mini race car using motion capture for localization

Occupancy map with planned path

MI surface

[Zhang, ICRA 2019]

70


Challenge is Data Delivery to All CoresProcess multiple beams in parallel

Core 1

Core 2

Core 3

Core N

Core N

Core 2

Core 1

Core N

Core 2Core 1

Data delivery from memory is limited

Read Port 1

Read Port 2

71


Specialized Memory ArchitectureBreak up map into separate memory banks and novel storage pattern to

minimize read conflicts when processing different beams in parallel.

Compute the mutual information for an entire map of 20m x 20m at 0.1m resolution in under a second à a 100x speed up versus CPU for 1/10th of the power.


XY

X

Y

Memory Access Pattern

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4 8

3 3 3 5 6 7

2 2 3 4

1 2 3 4 5 6 7 8

Diagonal Banking Pattern

X

Y

Bank 0

Bank 1

Bank 4

Bank 3

Bank 5

Bank 7

Bank 2

Bank 6

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4 8

3 3 3 5 6 7

2 2 3 4

1 2 3 4 5 6 7 8

[Li, RSS 2019]

72


Monitoring Neurodegenerative Disorders

• Neuropsychological assessments are time consuming and require a trained specialist

• Repeat medical assessments are sparse, mostly qualitative, and suffer from high retest variability

Mini-Mental State Examination (MMSE)

Q1. What is the year? Season? Date?Q2. Where are you now? State? Floor?Q3. Could you count backward from

100 by sevens? (93, 86, …)

Clock-drawing test

Agrell et al. Age and Ageing, 1998.

[Joint work with Thomas Heldt and Charlie Sodini]

Dementia affects 50 million people worldwide today (75 million in 10 years) [World Alzheimer’s Report]

73


Use Eye Movements for Quantitative Evaluation

High-speed camera

Phantom v25-11

Substantial head support

SR EYELINK 1000 PLUS

IR illumination

Reulen et al., Med. & Biol. Eng. & Comp, 1988.

Clinical measurements of saccade latency are done in constrained environments that rely on specialized, costly equipment.

Eye movements can be used to quantitatively evaluate severity, progression or regression of neurodegenerative diseases

74


Measure Eye Movements Using Phone

[Saavedra Peña, EMBC 2018] [Lai, ICIP 2018]

Develop algorithm to measure eye movement using a consumer-grade

camera rather than high-cost research-grade camera.

Enable low-cost in-home longitudinal measurements.

Coun

t

Eye movement feature

Eye movementsSmartphone

Phantom ($100k)

iPhone 6 (< $1k)

Reaction Time (milliseconds)

75


Looking For Volunteers for Eye Reaction Time76

If you are near or on MIT Campus and interested

in volunteering your eye movements for this study,

please contact us at [email protected]

mailto:[email protected]


• Pulsed Time of Flight: Measure distance using round trip time

of laser light for each image pixel

– Illumination + Imager Power: 2.5 – 20 W for range from 1 - 8 m

• Use computer vision techniques and passive images to

estimate changes in depth without turning on laser

– CMOS Imaging Sensor Power: < 350 mW

Low Power 3D Time of Flight Imaging

Estimated Depth MapsReal-time Performance on Embedded Processor

VGA @ 30 fps on Cortex-A7 (< 0.5W active power)

[Noraky, ICIP 2017]

77


Results of Low Power Depth ToF Imaging

RGB Image Depth MapGround Truth

Depth MapEstimated

Mean Relative Error: 0.7%Duty Cycle (on-time of laser): 11%

[Noraky, ICIP 2017]

78


• Efficient computing extends the reach of AI beyond the cloud by reducing communication requirements, enabling privacy, and providing low latency so that AI can be used in wide range of applications ranging from robotics to health care.

• Cross-layer design with specialized hardware enables energy-efficient AI, and will be critical to the progress of AI over the next decade.

Summary

Today’s slides available at https://tinyurl.com/SzeMITDL2020

79

https://tinyurl.com/SzeMITDL2020


Additional Resources

Overview PaperV. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017

Book Coming Spring 2020!

More info about Tutorial on DNN Architectures http://eyeriss.mit.edu/tutorial.html

For updatesEEMS Mailing List

80



Additional Resources

MIT Professional Education Course on

“Designing Efficient Deep Learning Systems” http://shortprograms.mit.edu/dls

Next Offering: July 20-21, 2020 on MIT Campus

81

http://shortprograms.mit.edu/dls


Additional ResourcesTalks and Tutorial Available Online

https://www.rle.mit.edu/eems/publications/tutorials/

YouTube ChannelEEMS Group – PI: Vivienne Sze

82

https://www.rle.mit.edu/eems/publications/tutorials/


Acknowledgements

Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not be possible without the support of the following organizations:

Joel Emer

Sertac Karaman

Thomas Heldt

Mailing List: http://mailman.mit.edu/mailman/listinfo/eems-news

83

http://mailman.mit.edu/mailman/listinfo/eems-news


• Energy-Efficient Hardware for Deep Neural Networks– Project website: http://eyeriss.mit.edu

– Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017.

– Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.

– Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.

– Eyexam: https://arxiv.org/abs/1807.07928

• Limitations of Existing Efficient DNN Approaches – Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient

Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.

– V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.

– Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html

References84

http://eyeriss.mit.edu/

https://arxiv.org/abs/1807.07928



• Co-Design of Algorithms and Hardware for Deep Neural Networks– T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-

Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

– Energy estimation tool: http://eyeriss.mit.edu/energy.html– T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural

Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018.

– D. Wofk*, F. Ma*, T.-J. Yang, S. Karaman, V. Sze, “FastDepth: Fast Monocular Depth Estimation on Embedded Systems,” IEEE International Conference on Robotics and Automation (ICRA), May 2019. http://fastdepth.mit.edu/

• Energy-Efficient Visual Inertial Localization – Project website: http://navion.mit.edu

– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Symposium on VLSI Circuits (VLSI-Circuits), June 2018.

– Z. Zhang*, A. Suleiman*, L. Carlone, V. Sze, S. Karaman, “Visual-Inertial Odometry on Chip: An Algorithm-and-Hardware Co-design Approach,” Robotics: Science and Systems (RSS), July 2017.

– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A 2mW Fully Integrated Real-Time Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Journal of Solid State Circuits (JSSC), VLSI Symposia Special Issue, Vol. 54, No. 4, pp. 1106-1119, April 2019.

References85

http://eyeriss.mit.edu/energy.html

http://fastdepth.mit.edu/

http://navion.mit.edu/


• Fast Shannon Mutual Information for Robot Exploration– Z. Zhang, T. Henderson, V. Sze, S. Karaman, “FSMI: Fast computation of Shannon Mutual Information for

information-theoretic mapping,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.– P. Li*, Z. Zhang*, S. Karaman, V. Sze, “High-throughput Computation of Shannon Mutual Information on Chip,”

Robotics: Science and Systems (RSS), June 2019.

• Low Power Time of Flight Imaging– J. Noraky, V. Sze, “Low Power Depth Estimation of Rigid Objects for Time-of-Flight Imaging,” IEEE Transactions

on Circuits and Systems for Video Technology (TCSVT), 2019.

– J. Noraky, V. Sze, “Depth Estimation of Non-Rigid Objects For Time-Of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), October 2018.

– J. Noraky, V. Sze, “Low Power Depth Estimation for Time-of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), September 2017.

• Monitoring Neurodegenerative Disorders Using a Phone – H.-Y. Lai, G. Saavedra Peña, C. Sodini, T. Heldt, V. Sze, “Enabling Saccade Latency Measurements with

Consumer-Grade Cameras,” IEEE International Conference on Image Processing (ICIP), October 2018.

– G. Saavedra Peña, H.-Y. Lai, V. Sze, T. Heldt, “Determination of saccade latency distributions using video recordings from consumer-grade devices,” IEEE International Engineering in Medicine and Biology Conference (EMBC), 2018.

– H.-Y. Lai, G. Saavedra Peña, C. Sodini, V. Sze, T. Heldt, “Measuring Saccade Latency Using Smartphone Cameras,” IEEE Journal of Biomedical and Health Informatics (JBHI), March 2020.

References86

efficient computing for deep learning, ai and robotics - iap · •operations exhibit high...

Documents