efficient computing for deep learning, ai and robotics - iap · •operations exhibit high...
TRANSCRIPT
Vivienne Sze ( @eems_mit)
Efficient Computing for Deep Learning, AI and Robotics
Vivienne Sze ( @eems_mit)Massachusetts Institute of Technology
In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel
Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang
Slides available at https://tinyurl.com/SzeMITDL2020
1
Vivienne Sze ( @eems_mit)
Compute Demands for Deep Neural Networks2
Source: Open AI (https://openai.com/blog/ai-and-compute/)
Petaflop/s-days (exponential)
Year
AlexNet to AlphaGo Zero: A 300,000x Increase in Compute
Vivienne Sze ( @eems_mit)
Compute Demands for Deep Neural Networks3
[Strubell, ACL 2019]
[Strubell, ACL 2019]
Vivienne Sze ( @eems_mit)
Processing at “Edge” instead of the “Cloud”
Communication Privacy Latency
4
Vivienne Sze ( @eems_mit)
Computing Challenge for Self-Driving Cars
(Feb 2018)
Cameras and radar generate ~6 gigabytes of data every 30 seconds.
Generates wasted heat and some prototypes need water-cooling!
Self-driving car prototypes use approximately 2,500 Watts of
computing power.
5
Vivienne Sze ( @eems_mit)
Existing Processors Consume Too Much Power
< 1 Watt > 10 Watts
6
Vivienne Sze ( @eems_mit)
Transistors are NOT Getting More EfficientSlow down of Moore’s Law and Dennard Scaling
General purpose microprocessors not getting faster or more efficient
• Need specialized hardware for significant improvement in speed and energy efficiency
• Redesign computing hardware from the ground up!
Slowdown
7
Vivienne Sze ( @eems_mit)
Popularity of Specialized Hardware for DNNs8
Big Bets On A.I. Open a New Frontier for Chips Start-Ups, Too. (January 14, 2018)
“Today, at least 45 start-ups are working on chips that can power tasks like speech and self-driving cars, and at least five of them have raised more than $100 million from investors. Venture capitalists invested more than $1.5 billion in chip start-ups last year, nearly doubling the investments made two years ago, according to the research firm CB Insights.”
Vivienne Sze ( @eems_mit)
Power Dominated by Data MovementOperation: Energy
(pJ)8b Add 0.0316b Add 0.0532b Add 0.116b FP Add 0.432b FP Add 0.98b Mult 0.232b Mult 3.116b FP Mult 1.132b FP Mult 3.732b SRAM Read (8KB) 532b DRAM Read 640
Area (µm2)
366713713604184282349516407700N/AN/A
[Horowitz, ISSCC 2014]
Relative Energy Cost
1 10 102 103 104
Relative Area Cost
1 10 102 103
Memory access is orders of magnitude higher energy than compute
9
Vivienne Sze ( @eems_mit)
Autonomous Navigation Uses a Lot of Data
• Geometric Understanding
- Growing map size
2 million pixels 10x-100x more pixels
• Semantic Understanding
- High frame rate- Large resolutions- Data expansion
[Pire, RAS 2017]
10
Vivienne Sze ( @eems_mit)
Understanding the EnvironmentDepth Estimation
State-of-the-art approaches use Deep Neural Networks, which require up to several
hundred millions of operations and weights to
compute!>100x more complex than
video compression
Semantic Segmentation
11
Vivienne Sze ( @eems_mit)
Deep Neural Networks
Computer Vision Speech Recognition
Game Play Medical
Deep Neural Networks (DNNs) have become a cornerstone of AI
12
Vivienne Sze ( @eems_mit)
What Are Deep Neural Networks?
Input:Image
Output:“Volvo XC90”
Modified Image Source: [Lee, CACM 2011]
Low Level Features High Level Features
13
Vivienne Sze ( @eems_mit)
Weighted Sum
Yj = activation Wij × Xii=1
3
∑⎛
⎝⎜
⎞
⎠⎟
Input Layer
Output Layer
Hidden Layer
X1
X2
X3
Y1
Y2
Y3
Y4
W11
W34
Nonlinear ActivationFunction
Key operation is multiply and accumulate (MAC) Accounts for > 90% of computation
Sigmoid1
-1
0
0 1-1
Rectified Linear Unit (ReLU)1
-1
0
0 1-1
y=max(0,x)y=1/(1+e-x)
Image source: Caffe tutorial
14
Vivienne Sze ( @eems_mit)
• Fully Connected Layer– Feed forward, fully connected
– Multilayer Perceptron (MLP)
• Convolutional Layer– Feed forward, sparsely-connected w/ weight sharing
– Convolutional Neural Network (CNN)
– Typically used for images
• Recurrent Layer– Feedback
– Recurrent Neural Network (RNN)
– Typically used for sequential data (e.g., speech, language)
• Attention Layer/Mechanism– Attention (matrix multiply) + feed forward, fully connected
– Transformer [Vaswani, NeurIPS 2017]
Popular Types of Layers in DNNsFeedbackFeed
Forward
Input LayerOutput Layer
Hidden Layer
Input Layer
Output Layer
Hidden Layer
SparselyConnected
FullyConnected
15
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
R
S
H
a plane of input activations a.k.a. input feature map (fmap)
filter (weights)
W
16
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
R
filter (weights)
S
E
F Partial Sum (psum)
Accumulation
input fmap output fmap
Element-wise Multiplication
H
W
an output activation
17
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
H R
filter (weights)
S
E
Sliding Window Processing
input fmap an output activation
output fmap
W F
18
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
H R
S
…
…
…
C
input fmap
output fmap
…
…
…
… C …
filter
…
Many Input Channels (C)
E
W F
AlexNet: 3 – 192 Channels (C)
19
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
…
E
output fmap
…
…
many filters (M)
Many Output Channels (M)
M
…
R
S 1
R
S
…
…
…
C …
M
H
input fmap
…
…
…
… C …
C …
…
…
W F
AlexNet: 96 – 384 Filters (M)
20
Vivienne Sze ( @eems_mit)
High-Dimensional Convolution in CNN
…
M
…
Many Input fmaps (N) Many
Output fmaps (N) …
R
S
R
S
…
…
…
C …
C …
…
…
filters
…
E
F …
…
H
…
… C …
H
W
…
…
…
… C …
…
E …
…
1 1
N N
W F
Image batch size: 1 – 256 (N)
21
Vivienne Sze ( @eems_mit)
Define Shape for Each Layer
Filters
R
S
…
…
…C
H
W
…
…
…C
…
E
F
…
…
…M
…
…
…M…
R
S
…
…
…C
H
W…
…C
1
N
1
M
1
…
…
Input fmapsOutput fmaps
…
E
FN
H – Height of input fmap (activations) W – Width of input fmap (activations)C – Number of 2-D input fmaps /filters(channels)R – Height of 2-D filter (weights)S – Width of 2-D filter (weights)M – Number of 2-D output fmaps (channels)E – Height of output fmap (activations)F – Width of output fmap (activations)N – Number of input fmaps/output fmaps(batch size)
Shape varies across layers
22
Vivienne Sze ( @eems_mit)
Layers with Varying Shapes
Block Filter Size (RxS) # Filters (M) # Channels (C)1 3x3 16 3
3 1x1 64 163 3x3 64 13 1x1 24 64
6 1x1 120 406 5x5 120 16 1x1 40 120
MobileNetV3-Large Convolutional Layer Configurations
[Howard, ICCV 2019]
……
…
23
Vivienne Sze ( @eems_mit)
Popular DNN ModelsMetrics LeNet-5 AlexNet VGG-16 GoogLeNet
(v1)ResNet-50 EfficientNet-B4
Top-5 error (ImageNet)
n/a 16.4 7.4 6.7 5.3 3.7*
Input Size 28x28 227x227 224x224 224x224 224x224 380x380# of CONV Layers 2 5 16 21 (depth) 49 96# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M# of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G# of FC layers 2 3 3 1 1 65**# of Weights 58k 58.6M 124M 1M 2M 4.9M# of MACs 58k 58.6M 124M 1M 2M 4.9MTotal Weights 60k 61M 138M 7M 25.5M 19MTotal MACs 341k 724M 15.5G 1.43G 3.9G 4.4GReference Lecun,
PIEEE 1998Krizhevsky, NeurIPS 2012
Simonyan, ICLR 2015
Szegedy, CVPR 2015
He, CVPR 2016
Tan, ICML 2019
DNN models getting larger and deeper* Does not include multi-crop and ensemble** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification)
24
Vivienne Sze ( @eems_mit)
Efficient Hardware Acceleration for Deep Neural Networks
25
Vivienne Sze ( @eems_mit)
Properties We Can Leverage• Operations exhibit high parallelism
à high throughput possible
• Memory Access is the Bottleneck
ALU
Memory Read Memory WriteMAC*
* multiply-and-accumulate
filter weightimage pixelpartial sum updated
partial sum
• Example: AlexNet has 724M MACs à 2896M DRAM accesses required
Worst Case: all memory R/W are DRAM accesses
200x 1x
DRAM DRAM
26
Vivienne Sze ( @eems_mit)
Properties We Can Leverage• Operations exhibit high parallelism
à high throughput possible
• Input data reuse opportunities (up to 500x)
27
Filter Input Fmap
Convolutional Reuse (Activations, Weights)
CONV layers only(sliding window)
Filters
2
1
Input Fmap
Fmap Reuse(Activations)
CONV and FC layers
Filter
2
1
Input Fmaps
Filter Reuse(Weights)
CONV and FC layers(batch size > 1)
Vivienne Sze ( @eems_mit)
Exploit Data Reuse at Low-Cost Memories
DRAM Global Buffer
PE
PE PE
ALU fetch data to run a MAC here
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×
6×
PE ALU 2×
1× 1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
NoC: 200 – 1000 PEs
* measured from a commercial 65nm process
Farther and larger memories consume more power
0.5 – 1.0 kB
Control
Reg FileSpecialized
hardware with small (< 1kB)
low cost memory near compute
28
Vivienne Sze ( @eems_mit)
Weight Stationary (WS)
• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights
• Broadcast activations and accumulate partial sumsspatially across the PE array
• Examples: TPU [Jouppi, ISCA 2017], NVDLA
Global Buffer
W0 W1 W2 W3 W4 W5 W6 W7
Psum Activation
PE Weight
[Chen, ISCA 2016]
29
Vivienne Sze ( @eems_mit)
Output Stationary (OS)
[Chen, ISCA 2016]
• Minimize partial sum R/W energy consumption− maximize local accumulation
• Broadcast/Multicast filter weights and reuse activationsspatially across the PE array
• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]
Global Buffer
P0 P1 P2 P3 P4 P5 P6 P7
Activation Weight
PE Psum
30
Vivienne Sze ( @eems_mit)
• Minimize activation read energy consumption− maximize convolutional and fmap reuse of activations
• Unicast weights and accumulate partial sums spatiallyacross the PE array
• Example: [SCNN, ISCA 2017]
Input Stationary (IS)
[Chen, ISCA 2016]
Global Buffer
I0 I1 I2 I3 I4 I5 I6 I7
Psum
ActPE
Weight
31
Vivienne Sze ( @eems_mit)
Row Stationary Dataflow
• Maximize row convolutional reuse in RF− Keep a filter row and fmap
sliding window in RF
• Maximize row psumaccumulation in RF
PE 1Row 1 Row 1
Row 1
=*
*
[Chen, ISCA 2016] Select for Micro Top Picks
32
Vivienne Sze ( @eems_mit)
Row Stationary Dataflow
Optimize for overall energy efficiency instead for only a certain data type
PE 1Row 1 Row 1
PE 2Row 2 Row 2
PE 3Row 3 Row 3
Row 1
=*
PE 4Row 1 Row 2
PE 5Row 2 Row 3
PE 6Row 3 Row 4
Row 2
=*
PE 7Row 1 Row 3
PE 8Row 2 Row 4
PE 9Row 3 Row 5
Row 3
=*
* * *
* * *
* * *
[Chen, ISCA 2016] Select for Micro Top Picks
33
Vivienne Sze ( @eems_mit)
Dataflow Comparison: CONV Layers
0
0.5
1
1.5
2
NormalizedEnergy/MAC
WS OSA OSB OSC NLR RS
psums
weights
pixels
RS optimizes for the best overall energy efficiency
CNN Dataflows
[Chen, ISCA 2016]
34
Vivienne Sze ( @eems_mit)
Exploit Sparsity
== 0 Zero Buff
Scratch Pad
Enable
Zero Data Skipping
Register FileNo R/W No Switching
Method 1. Skip memory access and computation
Method 2. Compress data to reduce storage and data movement
0
1
2
3
4
5
6
1 2 3 4 5
DRAM
Access
(MB)
AlexNetConvLayer
Uncompressed Compressed
1 2 3 4 5AlexNet Conv Layer
DRAM Access (MB)
0
2
4
61.2×
1.4×1.7×
1.8×1.9×
UncompressedFmaps + Weights
RLE CompressedFmaps + Weights
45% power reduction
[Chen, ISSCC 2016]
35
Vivienne Sze ( @eems_mit)
Eyeriss: Deep Neural Network Accelerator
On-ch
ip Bu
ffer Spatial
PE Array
4mm
4mm
[Joint work with Joel Emer]
Results for AlexNet
Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)
Exploits data reuse for 100x reduction in memory accesses from global buffer and 1400x reduction in memory accesses from off-chip DRAM
Eyeriss Project Website: http://eyeriss.mit.edu
[Chen, ISSCC 2016]
36
Vivienne Sze ( @eems_mit)
Features: Energy vs. Accuracy
0.1
1
10
100
1000
10000
0 20 40 60 80Accuracy (Average Precision)
Energy/Pixel (nJ)
VGG162
AlexNet2
HOG1
Measured in on VOC 2007 Dataset1. DPM v5 [Girshick, 2012]2. Fast R-CNN [Girshick, CVPR 2015]
Exponential
Linear
Video Compression
Measured in 65nm*
* Only feature extraction. Does not include data, classification
energy, augmentation and ensemble, etc.
On-c
hip B
uffer Spatial
PE Array
4mm
4mm
4mm
4mm
[Suleiman, VLSI 2016] [Chen, ISSCC 2016]1 2
[Suleiman*, Chen*, ISCAS 2017]
37
Vivienne Sze ( @eems_mit)
Energy-Efficient Processing of DNNs
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE,
Dec. 2017Book Coming Spring 2020!
A significant amount of algorithm and hardware research on energy-efficient processing of DNNs
We identified various limitations to existing approaches
http://eyeriss.mit.edu/tutorial.html
38
Vivienne Sze ( @eems_mit)
• Popular efficient DNN algorithm approaches
Design of Efficient DNN Algorithms
pruning neurons
pruning synapses
after pruningbefore pruning
Network Pruning
C1
1S
R
1
R
SC
Efficient Network Architectures
Examples: SqueezeNet, MobileNet
... also reduced precision
• Focus on reducing number of MACs and weights• Does it translate to energy savings and reduced latency?
[Chen*, Yang*, SysML 2018]
39
Vivienne Sze ( @eems_mit)
Data Movement is Expensive
Energy of weight depends on memory hierarchy and dataflow
DRAM Global Buffer
PE
PE PE
ALU fetch data to run a MAC here
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×
6×
PE ALU 2×
1× 1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
NoC: 200 – 1000 PEs
* measured from a commercial 65nm process
40
Vivienne Sze ( @eems_mit)
Energy-Evaluation Methodology
DNN Shape Configuration(# of channels, # of filters, etc.)
DNN Weights and Input Data[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
DNN Energy Consumption L1 L2 L3
Energy
…
Memory Accesses
Optimization
# of MACsCalculation
…
# acc. at mem. level 1# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each MAC and Memory Access
Ecomp
Edata
Tool available at: https://energyestimation.mit.edu/
[Yang, CVPR 2017]
41
Vivienne Sze ( @eems_mit)
• Number of weights alone is not a good metric for energy
• All data types should be considered
Key Observations
Output Feature Map43%
Input Feature Map25%
Weights22%
Computation10%
Energy Consumption of GoogLeNet
[Yang, CVPR 2017]
42
Vivienne Sze ( @eems_mit)
Directly target energy and incorporate it into the
optimization of DNNs to provide greater energy savings
Energy-Aware Pruning
• Sort layers based on energy and prune layers that consume most energy first
• EAP reduces AlexNet energy by 3.7x and outperforms the previous work that uses magnitude-based pruning by 1.7x
0 0.5
1 1.5
2 2.5
3 3.5
4 4.5
Ori. DC EAP
Normalized Energy (AlexNet)
2.1x 3.7x
x109
Magnitude Based Pruning
Energy Aware Pruning
Pruned models available at http://eyeriss.mit.edu/energy.html
[Yang, CVPR 2017]
43
Vivienne Sze ( @eems_mit)
# of Operations vs. Latency
• # of operations (MACs) does not approximate latency well
Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)
44
Vivienne Sze ( @eems_mit)
NetAdapt: Platform-Aware DNN Adaptation• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
In collaboration with Google’s Mobile Vision Team
NetAdapt Measure
…
Network Proposals
Empirical MeasurementsMetric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…… …
PretrainedNetwork Metric Budget
Latency 3.8
Energy 10.5
Budget
AdaptedNetwork
… …
Platform
A B C D Z
Code available at http://netadapt.mit.edu [Yang, ECCV 2018]
45
Vivienne Sze ( @eems_mit)
Simplified Example of One Iteration
Latency: 100msBudget: 80ms
100ms 90ms 80ms
100ms 80ms
Selected
Selected
Layer 1
Layer 4
…
Acc: 60%
Acc: 40%
…
Selected
2. Meet Budget
Latency: 80msBudget: 60ms
1. Input 4. Output3. Maximize Accuracy
Network from Previous Iteration
Network for Next Iteration
[Yang, ECCV 2018]
46
Vivienne Sze ( @eems_mit)
• NetAdapt boosts the real inference speed of MobileNetby up to 1.7x with higher accuracy
Improved Latency vs. Accuracy Tradeoff
+0.3% accuracy1.7x faster
+0.3% accuracy1.6x faster
Reference:MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018
*Tested on the ImageNet dataset and a Google Pixel 1 CPU
[Yang, ECCV 2018]
47
Vivienne Sze ( @eems_mit)
FastDepth: Fast Monocular Depth EstimationDepth estimation from a single RGB image desirable, due to
the relatively low cost and size of monocular cameras.RGB Prediction
[Joint work with Sertac Karaman]
Auto Encoder DNN Architecture (Dense Output)
Reduction (similar to classification) Expansion
48
Vivienne Sze ( @eems_mit)
FastDepth: Fast Monocular Depth EstimationApply NetAdapt, compact network design, and depth wise decomposition
to decoder layer to enable depth estimation at high frame rates on an embedded platform while still maintaining accuracy
[Wofk*, Ma*, ICRA 2019]
Configuration: Batch size of one (32-bit float)
Models available at http://fastdepth.mit.edu
> 10x
~40fps on an iPhone
49
Vivienne Sze ( @eems_mit)
Many Efficient DNN Design Approaches
pruning neurons
pruning synapses
after pruningbefore pruning
Network Pruning Efficient Network Architectures
10100101000000000101000000000100
01100110
Reduce Precision
32-bit float
8-bit fixed
Binary 0
No guarantee that DNN algorithm designer will use a given approach.
Need flexible hardware!
C1
1S
R
1
R
SC
G
Depth-WiseLayer
Point-WiseLayer
ConvolutionalLayer
…ChannelGroups
[Chen*, Yang*, SysML 2018]
50
Vivienne Sze ( @eems_mit)
• Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy-efficiency
• Example: Reduce memory access by amortizing across MAC array
Existing DNN Architectures
MAC arrayWeightMemory
ActivationMemory
Weight reuse
Activationreuse
51
Vivienne Sze ( @eems_mit)
• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)
Limitation of Existing DNN Architectures
MAC array(spatial
accumulation)
Number of filters(output channels)
Number ofinput channels
MAC array(temporal
accumulation)
Number of filters(output channels)
feature mapor batch size
52
Vivienne Sze ( @eems_mit)
• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)
Limitation of Existing DNN Architectures
MAC array(spatial
accumulation)
Number of filters(output channels)
Number ofinput channels
MAC array(temporal
accumulation)
Number of filters(output channels)
feature mapor batch size
C1
1
S
R
1
Example mapping for depth wise layer
53
Vivienne Sze ( @eems_mit)
• Example: Reuse and array utilization depends on # of channels, feature map/batch size – Not efficient across all network architectures (e.g., compact DNNs)– Less efficient as array scales up in size– Can be challenging to exploit sparsity
Limitation of Existing DNN Architectures
MAC array(spatial
accumulation)
Number of filters(output channels)
Number ofinput channels
MAC array(temporal
accumulation)
Number of filters(output channels)
feature mapor batch size
54
Vivienne Sze ( @eems_mit)
Need Flexible Dataflow• Use flexible dataflow (Row Stationary) to exploit reuse in any
dimension of DNN to increase energy efficiency and array utilization
Example: Depth-wise layer
55
Vivienne Sze ( @eems_mit)
• When reuse available, need multicast to exploit spatial data reuse for energy efficiency and high array utilization
• When reuse not available, need unicast for high BW for weights for FC and weights & activations for high PE utilization
• An all-to-all satisfies above but too expensive and not scalable
Need Flexible NoC for Varying ReuseG
loba
l Buf
fer
PE PE PEPE
PE PE PEPE
PE PE PEPE
PEPE PE PE
Glo
bal B
uffe
r
PE PE PEPE
PE PE PEPE
PE PE PEPE
PE PE PEPE
Glo
bal B
uffe
r
PE PE PEPE
PE PE PEPE
PE PE PEPE
PE PE PEPE
Glo
bal B
uffe
r
PE PE PEPE
PE PE PEPE
PE PE PEPE
PE PE PEPE
High Bandwidth, Low Spatial Reuse Low Bandwidth, High Spatial Reuse
Unicast Networks Broadcast Network1D Multicast Networks1D Systolic Networks
[Chen, JETCAS 2019]
56
Vivienne Sze ( @eems_mit)
Hierarchical MeshGLBCluster
RouterCluster
PECluster
…
…
…
…
…… …
MeshNetwork
All-to-allNetwork
(a) (b) (c) (d) (e)
GLBCluster
RouterCluster
PECluster
…
…
…
…
…… …
MeshNetwork
All-to-allNetwork
(a) (b) (c) (d) (e)
High Bandwidth High Reuse Grouped Multicast Interleaved Multicast
All-to-AllMesh
[Chen, JETCAS 2019]
57
Vivienne Sze ( @eems_mit)
Eyeriss v2: Balancing Flexibility and Efficiency
Over an order of magnitude faster and more energy efficient than Eyeriss v1
Speed up over Eyeriss v1 scales with number of PEs
# of PEs 256 1024 16384
AlexNet 17.9x 71.5x 1086.7x
GoogLeNet 10.4x 37.8x 448.8x
MobileNet 15.7x 57.9x 873.0x
Efficiently supports• Wide range of filter shapes
– Large and Compact
• Different Layers – CONV, FC, depth wise, etc.
• Wide range of sparsity – Dense and Sparse
• Scalable architecture
[Joint work with Joel Emer]
5.610.912.6
[Chen, JETCAS 2019]
58
Vivienne Sze ( @eems_mit)
Looking Beyond the DNN Accelerator for Acceleration
59
Vivienne Sze ( @eems_mit)
Super-Resolution on Mobile Devices
Use super-resolution to improve the viewing experience of lower-resolution content (reduce communication bandwidth)
Screens are getting larger
Low ResolutionStreaming
Transmit low resolution for lower bandwidth
HighResolutionPlayback
60
Vivienne Sze ( @eems_mit)
FAST: A Framework to Accelerate SuperRes
A framework that accelerates any SR algorithm by up to 15x when running on compressed videos
FAST SR15x faster
Compressed video
SR algorithm
Real-time
[Zhang, CVPRW 2017]
61
Vivienne Sze ( @eems_mit)
Free Information in Compressed Videos
Compressed videoPixels
Video as a stack of pixels
Block-structure Motion-compensation
Representation in compressed video
This representation can help accelerate super-resolution
Decode
[Zhang, CVPRW 2017]
62
Vivienne Sze ( @eems_mit)
High-res video
Transfer is Lightweight
Low-res videoHigh-res video
SR
Low-res video
Transfer
FractionalInterpolation
BicubicInterpolation
Skip Flag
The complexity of the transfer is comparable to bicubic interpolation.Transfer N frames, accelerate by N
Transfer allows SR to run on only a subset of frames
SRSRSRSR
SR
[Zhang, CVPRW 2017]
63
Vivienne Sze ( @eems_mit)
Evaluation: Accelerating SRCNN
[Zhang, CVPRW 2017]
64
Vivienne Sze ( @eems_mit)
Visual Evaluation
SRCNN FAST + SRCNN
Bicubic
Code released at www.rle.mit.edu/eems/fast
Look beyond the DNN accelerator for opportunities to accelerate DNN processing (e.g., structure of data and temporal correlation)
[Zhang, CVPRW 2017]
65
Vivienne Sze ( @eems_mit)
Beyond Deep Neural Networks
66
Vivienne Sze ( @eems_mit)
Visual-Inertial Localization
Visual-Inertial Odometry
(VIO)
Localization
Mapping
Image sequence
IMU Inertial Measurement Unit
…
*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Slide 28
Determines location/orientation of robot from images and IMU
*
67
Vivienne Sze ( @eems_mit)
Localization at Under 25 mW
[Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]
Consumes 684× and 1582×less energy than
mobile and desktop CPUs, respectively
First chip that performs complete Visual-Inertial Odometry
[Joint work with Sertac Karaman]
Navion
Front-End for camera (Feature detection, tracking, and
outlier elimination)
Front-End for IMU (pre-integration of accelerometer
and gyroscope data)
Back-End Optimization of Pose Graph
Navion Project Website: http://navion.mit.edu
68
Vivienne Sze ( @eems_mit)
Key Methods to Reduce Data Size
Backend Control
Data & Control BusBuild Graph
Linear Solver
Linearize
Marginal
Retract
GraphLinear Solver
Horizon States
Shared Memory
Floating Point
Arithmetic
Matrix Operations
Cholesky
Back Substitute
Rodrigues Operations
Feature Tracking
(FT)
Previous FrameLine Buffers
Feature Detection
(FD)
Undistort & Rectify
(UR)
Undistort & Rectify
(UR)
Data & Control Bus
Sparse Stereo (SS)
Vision Frontend Control
RANSAC Fixed Point Arithmetic Point Cloud Pre-IntegrationFloating Point
Arithmetic
IMU memory
Current Frame
Left Frame
Right Frame
Vision Frontend (VFE)
IMU Frontend (IFE)
Backend (BE)
Register File
Apply Low Cost
FrameCompression
Use compression and exploit sparsity to reduce memory down to 854kB
Exploit Sparsity in Graph and
Linear Solver
Navion: Fully integrated system – no off-chip processing or storage
[Suleiman, VLSI-C 2018] Best Student Paper Award
69
Vivienne Sze ( @eems_mit)
Where to Go Next: Planning and Mapping
Select candidate scan locations
Compute Shannon MI and choose best location
Move to location
and scan
Update Occupancy
Map
Where to scan?
Occupancy map Mutual information map
Mutual Information Updated Map
Robot Exploration: Decide where to go by computing Shannon Mutual Information
Exploration with a mini race car using motion capture for localization
Occupancy map with planned path
MI surface
[Zhang, ICRA 2019]
70
Vivienne Sze ( @eems_mit)
Challenge is Data Delivery to All CoresProcess multiple beams in parallel
Core 1
Core 2
Core 3
Core N
Core N
Core 2
Core 1
Core N
Core 2Core 1
Data delivery from memory is limited
Read Port 1
Read Port 2
71
Vivienne Sze ( @eems_mit)
Specialized Memory ArchitectureBreak up map into separate memory banks and novel storage pattern to
minimize read conflicts when processing different beams in parallel.
Compute the mutual information for an entire map of 20m x 20m at 0.1m resolution in under a second à a 100x speed up versus CPU for 1/10th of the power.
[Joint work with Sertac Karaman]
XY
X
Y
Memory Access Pattern
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4 8
3 3 3 5 6 7
2 2 3 4
1 2 3 4 5 6 7 8
Diagonal Banking Pattern
X
Y
Bank 0
Bank 1
Bank 4
Bank 3
Bank 5
Bank 7
Bank 2
Bank 6
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4 8
3 3 3 5 6 7
2 2 3 4
1 2 3 4 5 6 7 8
[Li, RSS 2019]
72
Vivienne Sze ( @eems_mit)
Monitoring Neurodegenerative Disorders
• Neuropsychological assessments are time consuming and require a trained specialist
• Repeat medical assessments are sparse, mostly qualitative, and suffer from high retest variability
Mini-Mental State Examination (MMSE)
Q1. What is the year? Season? Date?Q2. Where are you now? State? Floor?Q3. Could you count backward from
100 by sevens? (93, 86, …)
Clock-drawing test
Agrell et al. Age and Ageing, 1998.
[Joint work with Thomas Heldt and Charlie Sodini]
Dementia affects 50 million people worldwide today (75 million in 10 years) [World Alzheimer’s Report]
73
Vivienne Sze ( @eems_mit)
Use Eye Movements for Quantitative Evaluation
High-speed camera
Phantom v25-11
Substantial head support
SR EYELINK 1000 PLUS
IR illumination
Reulen et al., Med. & Biol. Eng. & Comp, 1988.
Clinical measurements of saccade latency are done in constrained environments that rely on specialized, costly equipment.
Eye movements can be used to quantitatively evaluate severity, progression or regression of neurodegenerative diseases
74
Vivienne Sze ( @eems_mit)
Measure Eye Movements Using Phone
[Saavedra Peña, EMBC 2018] [Lai, ICIP 2018]
Develop algorithm to measure eye movement using a consumer-grade
camera rather than high-cost research-grade camera.
Enable low-cost in-home longitudinal measurements.
Coun
t
Eye movement feature
Eye movementsSmartphone
Phantom ($100k)
iPhone 6 (< $1k)
Reaction Time (milliseconds)
75
Vivienne Sze ( @eems_mit)
Looking For Volunteers for Eye Reaction Time76
If you are near or on MIT Campus and interested
in volunteering your eye movements for this study,
please contact us at [email protected]
Vivienne Sze ( @eems_mit)
• Pulsed Time of Flight: Measure distance using round trip time
of laser light for each image pixel
– Illumination + Imager Power: 2.5 – 20 W for range from 1 - 8 m
• Use computer vision techniques and passive images to
estimate changes in depth without turning on laser
– CMOS Imaging Sensor Power: < 350 mW
Low Power 3D Time of Flight Imaging
Estimated Depth MapsReal-time Performance on Embedded Processor
VGA @ 30 fps on Cortex-A7 (< 0.5W active power)
[Noraky, ICIP 2017]
77
Vivienne Sze ( @eems_mit)
Results of Low Power Depth ToF Imaging
RGB Image Depth MapGround Truth
Depth MapEstimated
Mean Relative Error: 0.7%Duty Cycle (on-time of laser): 11%
[Noraky, ICIP 2017]
78
Vivienne Sze ( @eems_mit)
• Efficient computing extends the reach of AI beyond the cloud by reducing communication requirements, enabling privacy, and providing low latency so that AI can be used in wide range of applications ranging from robotics to health care.
• Cross-layer design with specialized hardware enables energy-efficient AI, and will be critical to the progress of AI over the next decade.
Summary
Today’s slides available at https://tinyurl.com/SzeMITDL2020
79
Vivienne Sze ( @eems_mit)
Additional Resources
Overview PaperV. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017
Book Coming Spring 2020!
More info about Tutorial on DNN Architectures http://eyeriss.mit.edu/tutorial.html
For updatesEEMS Mailing List
80
Vivienne Sze ( @eems_mit)
Additional Resources
MIT Professional Education Course on
“Designing Efficient Deep Learning Systems” http://shortprograms.mit.edu/dls
Next Offering: July 20-21, 2020 on MIT Campus
81
Vivienne Sze ( @eems_mit)
Additional ResourcesTalks and Tutorial Available Online
https://www.rle.mit.edu/eems/publications/tutorials/
YouTube ChannelEEMS Group – PI: Vivienne Sze
82
Vivienne Sze ( @eems_mit)
Acknowledgements
Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not be possible without the support of the following organizations:
Joel Emer
Sertac Karaman
Thomas Heldt
Mailing List: http://mailman.mit.edu/mailman/listinfo/eems-news
83
Vivienne Sze ( @eems_mit)
• Energy-Efficient Hardware for Deep Neural Networks– Project website: http://eyeriss.mit.edu
– Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017.
– Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.
– Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.
– Eyexam: https://arxiv.org/abs/1807.07928
• Limitations of Existing Efficient DNN Approaches – Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient
Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.
– V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
– Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html
References84
Vivienne Sze ( @eems_mit)
• Co-Design of Algorithms and Hardware for Deep Neural Networks– T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-
Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
– Energy estimation tool: http://eyeriss.mit.edu/energy.html– T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural
Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018.
– D. Wofk*, F. Ma*, T.-J. Yang, S. Karaman, V. Sze, “FastDepth: Fast Monocular Depth Estimation on Embedded Systems,” IEEE International Conference on Robotics and Automation (ICRA), May 2019. http://fastdepth.mit.edu/
• Energy-Efficient Visual Inertial Localization – Project website: http://navion.mit.edu
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Symposium on VLSI Circuits (VLSI-Circuits), June 2018.
– Z. Zhang*, A. Suleiman*, L. Carlone, V. Sze, S. Karaman, “Visual-Inertial Odometry on Chip: An Algorithm-and-Hardware Co-design Approach,” Robotics: Science and Systems (RSS), July 2017.
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A 2mW Fully Integrated Real-Time Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Journal of Solid State Circuits (JSSC), VLSI Symposia Special Issue, Vol. 54, No. 4, pp. 1106-1119, April 2019.
References85
Vivienne Sze ( @eems_mit)
• Fast Shannon Mutual Information for Robot Exploration– Z. Zhang, T. Henderson, V. Sze, S. Karaman, “FSMI: Fast computation of Shannon Mutual Information for
information-theoretic mapping,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.– P. Li*, Z. Zhang*, S. Karaman, V. Sze, “High-throughput Computation of Shannon Mutual Information on Chip,”
Robotics: Science and Systems (RSS), June 2019.
• Low Power Time of Flight Imaging– J. Noraky, V. Sze, “Low Power Depth Estimation of Rigid Objects for Time-of-Flight Imaging,” IEEE Transactions
on Circuits and Systems for Video Technology (TCSVT), 2019.
– J. Noraky, V. Sze, “Depth Estimation of Non-Rigid Objects For Time-Of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), October 2018.
– J. Noraky, V. Sze, “Low Power Depth Estimation for Time-of-Flight Imaging,” IEEE International Conference on Image Processing (ICIP), September 2017.
• Monitoring Neurodegenerative Disorders Using a Phone – H.-Y. Lai, G. Saavedra Peña, C. Sodini, T. Heldt, V. Sze, “Enabling Saccade Latency Measurements with
Consumer-Grade Cameras,” IEEE International Conference on Image Processing (ICIP), October 2018.
– G. Saavedra Peña, H.-Y. Lai, V. Sze, T. Heldt, “Determination of saccade latency distributions using video recordings from consumer-grade devices,” IEEE International Engineering in Medicine and Biology Conference (EMBC), 2018.
– H.-Y. Lai, G. Saavedra Peña, C. Sodini, V. Sze, T. Heldt, “Measuring Saccade Latency Using Smartphone Cameras,” IEEE Journal of Biomedical and Health Informatics (JBHI), March 2020.
References86