mcunetv2: memory-efficient patch-based inference for tiny
TRANSCRIPT
1 Massachusetts Institute of Technology2 MIT-IBM Watson AI Lab
Ji Lin1, Wei-Ming Chen1, Han Cai1, Chuang Gan2, Song Han1
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Deep Learning Going “Tiny”
Deep Learning Going “Tiny”
Cloud AI Mobile AI
Data centers Expensive
Privacy issue
Smartphones Accessible
Process locally
Deep Learning Going “Tiny”
Cloud AI Mobile AI
?Data centers Expensive
Privacy issue
Smartphones Accessible
Process locally
Can we go even smaller?
- The future belongs to Tiny AI.
Can we go even smaller?
- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers
Can we go even smaller?
Personalized Healthcare
…
Smart Manufacturing Driving AssistSmart Home
tinyml.mit.edu
…
- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.
tinyml.mit.edu
Can we go even smaller?
Personalized HealthcareSmart Manufacturing Driving AssistSmart Home
…
- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.- Low-power: reduce carbon. Green AI.
tinyml.mit.edu
Can we go even smaller?
Personalized HealthcareSmart Manufacturing Driving AssistSmart Home
- Tiny model design is fundamentally different from mobile AI, due to limited memory.
But Tiny AI is Difficult
Cloud AI Mobile AI Tiny AI
Memory 32GB 4GB 256kB
16,000x smaller100,000x
smaller
- Tiny model design is fundamentally different from mobile AI, due to limited memory.
tinyml.mit.edu
But Tiny AI is Difficult
Cloud AI Mobile AI Tiny AI
Memory 32GB 4GB 256kB
16,000x smaller100,000x
smaller
- Tiny model design is fundamentally different from mobile AI, due to limited memory.- Existing work optimize for #parameters/#FLOPs, but #activation is the real bottleneck. - CANNOT directly scale.
tinyml.mit.edu
But Tiny AI is Difficult
Breaking the Memory Bottleneck of TinyML
TinyEngineTinyNAS
MCUNet
Toy applications Real-life applications
* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices
Breaking the Memory Bottleneck of TinyML
TinyEngineTinyNAS
MCUNet
Toy applications Real-life applications
* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices
- Problems:- Insufficient/imbalanced memory utilization across blocks- Poor performance on applications beyond classification (e.g., detection)
Imbalanced Memory Distribution of CNNs
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B)
• Per-block memory usage of MobileNetV2
Imbalanced Memory Distribution of CNNs
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B)
256kB constraint of MCU
Peak Mem1372kB
• Per-block memory usage of MobileNetV2
Imbalanced Memory Distribution of CNNs
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B) Peak Mem1372kB
• Per-block memory usage of MobileNetV2
High mem.
Low mem.
256kB constraint of MCU
CANNOT fit CAN fit
Imbalanced Memory Distribution of CNNs
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B) Peak Mem1372kB
• Per-block memory usage of MobileNetV2
High mem.
Low mem.
256kB constraint of MCU
8× larger
Imbalanced Memory Distribution of CNNs
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17 18 19 20 21
Mem
ory
Usa
ge (k
B)
6.1× smaller
FBNet
0
160
320
480
640
800
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Mem
ory
Usa
ge (k
B) 3.7× smaller
MnasNet
• Common case in efficient CNN design
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17
Imbalanced Memory Distribution of CNNs
Block Index
Mem
ory
Usa
ge (k
B)
Reduce memory usage of the initial stage-> Reduce the overall memory usage
High mem.
Low mem.
256kB constraint of MCU
1. Saving Memory with Patch-based Inference
H
WC
X Y
* W =
1. Saving Memory with Patch-based Inference
In memory
SRAM
H
WC
X Y
* W =
* weights are usually partially fetched from Flash
1. Saving Memory with Patch-based Inference
In memory
H
WC
X Y
* W =
1. Per-layer inference
Peak Mem = 2 WHC
1. Saving Memory with Patch-based Inference
In memory1. Per-layer inference
Peak Mem = 2 WHC
2. Per-patch inference
Peak Mem = 2 whC << 2WHC
hw
* can use more than 2x2 patches
1. Saving Memory with Patch-based Inference
In memory
conv1s=1
conv2s=2
conv1s=1
conv2s=2
• a practical 2-layer example
per-layer inference
1. Saving Memory with Patch-based Inference
In memory
conv1s=1
conv2s=2
conv1s=1
conv2s=2
• a practical 2-layer example
per-layer inference per-patch inference
*need to hold entire output(much smaller than previous layers)
1. Saving Memory with Patch-based Inference
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B) Peak Mem1372kB
High mem.
Low mem.
256kB constraint of MCU
per-layer inference peak mem: 1372kB
• Applying to MobileNetV2
1. Saving Memory with Patch-based Inference
0
280
560
840
1120
1400
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index
Mem
ory
Usa
ge (k
B) Org peak mem.
High mem.
Low mem.
256kB constraint of MCU
New peak mem.
8× larger
per-patch inference peak mem: 172kBper-layer inference
• Applying to MobileNetV2
Problem: Computation Overhead from Overlapping
conv 3x3s=2
conv 3x3s=1
• Using 2x2 patches
Problem: Computation Overhead from Overlapping
conv 3x3s=2
conv 3x3s=1
• Using 2x2 patches
Problem: Computation Overhead from Overlapping
conv 3x3s=2
conv 3x3s=1
• Using 2x2 patches
Problem: Computation Overhead from Overlapping
conv 3x3s=2
conv 3x3s=1
• Using 2x2 patches
Problem: Computation Overhead from Overlapping
conv 3x3s=2
conv 3x3s=1
• Using 2x2 patches
Spatial overlapping gets larger as receptive field grows!
Problem: Computation Overhead from Overlapping
M M
ACs
0
80
160
240
320
400
330301
patch-based inference +10%
MobileNetV2
2. Network Redistribution to Reduce Overhead
M M
ACs
0
80
160
240
320
400
330301
patch-based inference +10%
MobileNetV2
2. Network Redistribution to Reduce Overhead
M M
ACs
0
80
160
240
320
400
330301
patch-based inference +10%
MobileNetV2
-> same overall receptive field
2. Network Redistribution to Reduce Overhead
M M
ACs
0
80
160
240
320
400
330301
patch-based inference +10%
MobileNetV2
2. Network Redistribution to Reduce Overhead
M M
ACs
0
80
160
240
320
400
301330
301
patch-based inference +10%
MobileNetV2
redistribute Same performance on:• Image classification• Object detection• ….
Negligible overhead
3. Joint Automated Search for Optimization
…
Neural architecture #layers
#channelskernel size
…
Inference scheduling #patches
#layers for patch-basedother knobs from TinyEngine*
…
…
* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices
MCUNetV2
Reducing the Peak Memory of CNNs
0
64
128
192
256
320
0000 0000
234
310300315
Per-layer Per-patch (2x2) Per-patch (3x3)
FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144
Measured Peak SRAM (kB)
• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU
Reducing the Peak Memory of CNNs
0
64
128
192
256
320
0000
85
132
94113
234
310300315
Per-layer Per-patch (2x2) Per-patch (3x3)
FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144
Measured Peak SRAM (kB)
2.8x smaller 3.2x
smaller
2.3x smaller
2.8x smaller
• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU
Reducing the Peak Memory of CNNs
0
64
128
192
256
320
5676
516485
132
94113
234
310300315
Per-layer Per-patch (2x2) Per-patch (3x3)
• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU
FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144
Measured Peak SRAM (kB)
4.9x smaller 5.9x
smaller
4.1x smaller
4.2x smaller
Reducing the Peak Memory of CNNs
0
64
128
192
256
320
5676
516485
132
94113
234
310300315
Per-layer Per-patch (2x2) Per-patch (3x3)
• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU
FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144
Measured Peak SRAM (kB)
4.9x smaller 5.9x
smaller
4.1x smaller
4.2x smaller
MCUNetV2 for Tiny Image Classification
• Large-scale ImageNet classification• Models are quantized to int8• Serving using TinyEngine.
Top-
1 Ac
cura
cy (%
)
40
46
52
58
64
70
64.9
60.3
56.2
49.0
STM32F412 (256kB SRAM)
MbV20.35x
Proxyless0.3x
MCUNet MCUNetV2
+4.6%
60
63
66
69
7271.8
68.5
STM32F746 (320kB SRAM)
MCUNet MCUNetV2
+3.3%
MCUNetV2 for Tiny Image Classification
VW
W A
ccur
acy
(%)
84
86
88
90
92
94
20 88 156 224 292 360
4.0×smaller
Measured Peak SRAM (kB)
Flash < 1MB
30kB
118kB62kB256kB constraint on MCU
+4.0%
• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM
MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite
MCUNetV2 for Tiny Image Classification
VW
W A
ccur
acy
(%)
84
86
88
90
92
94
20 88 156 224 292 360
4.0×smaller
Measured Peak SRAM (kB)
Flash < 1MB
30kB
118kB62kB256kB constraint on MCU
+4.0%
• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM
MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite
0
30
60
90
120
30
119
MCUNet MCUNetV2
4x smaller
Peak SRAM (kB) @ 90%
MCUNetV2 for Tiny Object Detection
• Object detection is more sensitive to input resolution
Acc
urac
y/m
AP
(%)
60
65
70
75
Image Resolution
160 224 288 352
VOC mAPImgNet Top-1
larger degrade
MCUNetV2 for Tiny Object Detection
• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance
MCUNetV2 for Tiny Object Detection
• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance
mAP
on
Pasc
al V
OC
(%)
20
30
40
50
60
7068.3
51.4
31.6
MbV2+CMSIS
MCUNet MCUNetV2
<320kB+16.9%
MCUNetV2 for Tiny Object Detection• Face detection on WIDER Face • More robust results at a smaller
peak memory
(a) RNNPool-Face-Quant (b) MCUNetV2
Dissecting MCUNetV2 Architecture
Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}
Dissecting MCUNetV2 Architecture
• Kernel size in per-patch stage is small to reduce spatial overlapping
Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}
Dissecting MCUNetV2 Architecture
• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory
Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}
Dissecting MCUNetV2 Architecture
• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;
large in later stage to boost performance.
Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}
Dissecting MCUNetV2 Architecture
Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}
• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;
large in later stage to boost performance.• Larger input resolution for resolution-sensitive datasets like VWW
(MCUNet: 128x128)
MCUNetV2
Thanks for listening!