mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab

MCUNet: Tiny Deep Learning on IoT Devices

NeurIPS 2020 (spotlight)

Ji Lin1 Wei-Ming Chen1,2 John Cohn3Yujun Lin1 Song Han1Chuang Gan3

• Low-cost, low-power

Background: The Era of AIoT on Microcontrollers (MCUs)

• Low-cost, low-power • Rapid growth

illion

01020304050

12 13 14 15F 16F 17F 18F 19F

Smart Retail Personalized Healthcare Precision Agriculture Smart Home

• Wide applications

• Low-cost, low-power • Rapid growth

illion

01020304050

12 13 14 15F 16F 17F 18F 19F

Background: The Era of AIoT on Microcontrollers (MCUs)

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

Cloud AI

Memory (Activation)

Storage (Weights)

~TB/PB

Cloud AI Mobile AI

Memory (Activation)

Storage (Weights)

16GB 4GB

256GB~TB/PB

Cloud AI Mobile AI Tiny AI

Memory (Activation)

Storage (Weights)

16GB 4GB

1MB~TB/PB

Memory (Activation)

Storage (Weights)

~TB/PB

1MB13,000x smaller

50,000x smaller

Memory (Activation)

Storage (Weights)

~TB/PB

1MB13,000x smaller

50,000x smaller

We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.

Existing efficient network only reduces model size but NOT activation size!

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

~70% ImageNet Top-1

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

MCUNet

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

LibraryNAS

(b) Tune deep learning library given a NN modele.g., TVM

LibraryNAS LibraryNN Model

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

TinyNAS

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Full Network Space

Optimized Search SpaceFull Network Space

Memory/Storage Constraints

Optimized Search SpaceFull Network Space Model Specialization

Memory/Storage Constraints

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

TinyNAS: (1) Automated search space optimization

Out of memory!

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

Different R and W for different hardware capacity

R=224, W=1.0

(i.e., different optimized sub-space)

R=260, W=1.4 * R=224, W=1.0

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

R=260, W=1.4 R=224, W=1.0 R=?, W=?

F412/F743/H746/…

256kB/320kB/512kB/…

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

width-res. | mFLOPs

320kB?

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

width-res. | mFLOPs

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

ty p=80%(32.3M, 80%)

width-res. | mFLOPs

Bad design space(45.4M, 80%)

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

ty p=80%(32.3M, 80%)

width-res. | mFLOPs

Bad design space

(45.4M, 80%)

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r112 | 32.4w0.4-r128 | 39.3w0.4-r144 | 46.9w0.5-r112 | 38.3w0.5-r128 | 46.9w0.5-r144 | 52.0w0.6-r112 | 41.3w0.7-r96 | 31.4w0.7-r112 | 38.4p0.8

FLOPs (M)

ty p=80%(50.3M, 80%)(32.3M, 80%)

width-res. | mFLOPs

Good design space: likely to achieve high FLOPs under memory constraint

Bad design space

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

• Small sub-networks are nested in large sub-networks.

• One-shot NAS through weight sharing

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

Directly evaluate the accuracy of sub-nets

• One-shot NAS through weight sharing

Elastic Kernel Size

Elastic Depth

Elastic Width

Elastic Kernel Size

Elastic Depth

Elastic Width

Index central

Start with full kernel size Smaller kernel takes centered weights

Elastic Kernel Size

Elastic Depth

Elastic Width

unit i

train with full depth

unit i

shrink the depth

Allow later layers in each unit to be skipped to reduce the depth

Elastic Kernel Size

Elastic Depth

Elastic Width

train with full width

channel importance

channel sorting

reorg.

shrink the widthO1

Shrink the width Keep the most important channels when shrinking via channel sorting

TinyNAS Better Utilizes the Memory

MobileNetV2 TinyNAS

Peak Memory for First Two Stages

TinyNAS Better Utilizes the Memory

averageaverage

MobileNetV2 TinyNAS

• TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory

Peak Memory for First Two Stages

2.2x 1.6xmaxmax

TinyEngine: A Memory-Efficient Inference Library

1. Reducing overhead with separated compilation & runtime

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Inference

Runtime

Interprete

Computation overhead

Inference

Runtime

Interprete

Storage overhead

Inference

Runtime

Interprete

Memory overhead

Storage overhead

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Inference

Runtime

Interprete

Memory overhead

Storage overhead

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

1.7x smaller

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

write back

(Peak Mem: 2n) (Peak Mem: n+1)

peak mem 6x

Peak Memory: 2N

channels

write back

channels

write back

Peak Memory: N+1

peak mem 6x

Code generation

3.4x smallerIn-place depth-wise

Peak Memory: 2N

channels

write back

channels

write back

Peak Memory: N+1

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NNMillion MAC/s ↑

Analyzing Million MAC/s improved by each technique

Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overheadMillion MAC/s ↑

0 6452

(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique

NN ModelSpecialized ops

Inference

TinyEngine: Model-adaptive code generation.

Compile time Runtime

Memory schedule

Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuseMillion MAC/s ↑

0 64 7052

(a) Model-level memory scheduling

(b) Tile size configuration for Im2col

(2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusionMillion MAC/s ↑

0 64 7052 75

(3) Computation Kernel Specialization: Operation fusion

• Minimize memory footprint • Optimize the overall computation

Specialized kernel

e.g., fuse Pad+Conv+ReLU+BN

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrollingMillion MAC/s ↑

0 64 7052 75 79 82

(3) Computation Kernel Specialization: Loop unrolling

e.g., fully unroll for 3x3 conv

• Eliminate the branch instruction overheads of loops

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling TilingMillion MAC/s ↑

0 64 7052 75 79 82

Specialized loop unrolling/tiling according to network architecture

Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer

Im2col buffer

Weights

Tile size

Output

Tile size

• Tiling loops based on the kernel size and available memory

Code generation

Million MAC/s ↑

1.6x fasterBaseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

Million MAC/s ↑0 64 7052 75 79 82

adsfas

Code generation

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

100% 1.001.001.001.00

0.610.660.64

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

1.5xfaster

1.6x faster

Peak Mem (KB)↓

197217

3.1x smaller

4.8x smaller

2.7x smaller3.3x

smaller

• Consistent improvement on different networks

Code generation

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

100% 1.001.001.001.00

0.610.660.64

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

Normalized Speed↑

1.6x faster

3x faster

1.5xfaster

1.6x faster

Peak Mem (KB)↓

197217

3.1x smaller

4.8x smaller

2.7x smaller3.3x

smaller

• Consistent improvement on different networks

Experimental Results

We focus on large-scale datasets to reflect real-life use cases.

Datasets: (1) ImageNet-1000(2) Wake Words

• Visual: Visual Wake Words• Audio: Google Speech Commands

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

40Baseline (MbV2*+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

* scaled down version: width multiplier 0.3, input resolution 80

40Baseline (MbV2*+CMSIS)System-only (MbV2*+TinyEngine)

Model-only (TinyNAS+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

4440Baseline (MbV2*+CMSIS)

System-only (MbV2*+TinyEngine)Model-only (TinyNAS+CMSIS)

Co-design (TinyNAS+TinyEngine)

ImageNet Top1: 35% 45% 55% 65%

Handling Diverse Hardware

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

7570.7

65.963.562.0

The first to achieve >70% ImageNet accuracy on commercial MCUs

65.963.562.0

The first to achieve >70% ImageNet accuracy on commercial MCUs

MobileNetV2+CMSIS-NN

ResNet-18 MobileNetV2-0.75 MCUNet~70% ImageNet Top-1

Reduce Both Model Size and Activation Size

ResNet-18 MobileNetV2-0.75 MCUNet

~70% ImageNet Top-1

Visual Wake Words (VWW)V

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

50 162.5 275 387.5 500

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

0 440 880 1320 1760 2200

50 162.5 275 387.5 500

0 440 880 1320 1760 2200

50 162.5 275 387.5 500

Latency (ms) Peak SRAM (kB)

3.4× faster

3.7× smaller

2.4× faster

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

Audio Wake Words (Speech Commands)G

0 340 680 1020 1360 1700

MCUNet MobileNetV2 ProxylessNAS

30 147.5 265 382.5 500

5FPS2.8× faster 4.1× smaller

2% higher

256kB constraint

Visual Wake Word Detection

• Detecting whether a person is present in the frame

MCUNet: Tiny Deep Learning on IoT Devices

Project Page: http://tinyml.mit.edu

• Our study suggests that the era of tiny machine learning on IoT devices has arrived

ResNet MobileNet MCUNet

mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

Documents

2019 silicon laboratories, inc. annual...

1505 lts paper 1 aiot dlp

epaper technologies and practices in the aiot era · e ink:...

texas instruments innovation challenge: europe design...

accelerate the aiot - file.elecfans.com

gecko mcus - elecfans

suitable for battery-driven wearable products mcus mcus

upcoming challenge for the 1st year of aiot

stm32l0 series ultra-low-power mcus tailored to your...

automating the industry with low power microcontrollers...

programme and abstracts - tuttosteopatia.it · lucia...

new aethra mcus and gateways the new aethra mcus and...

texas instruments india university program › images ›...

g aiot g ai iot - kcg.gov.tw

kinetis kl2x—ultra-low-power mcus with usb · nxp...

2703 aiot paper

msp430 ultra-low power mcus - martin dobrovolný - · pdf...

1505 lts paper 2 aiot dlp

x2000 aiot application processor

artificial intelligence of things solutions (aiot) using...