mcunetv2: memory-efficient patch-based inference for tiny

1 Massachusetts Institute of Technology2 MIT-IBM Watson AI Lab

Ji Lin1, Wei-Ming Chen1, Han Cai1, Chuang Gan2, Song Han1

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Deep Learning Going “Tiny”


Cloud AI Mobile AI

Data centers Expensive

Privacy issue

Smartphones Accessible

Process locally


Cloud AI Mobile AI

?Data centers Expensive

Privacy issue

Smartphones Accessible

Process locally

Can we go even smaller?

- The future belongs to Tiny AI.


- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers


Personalized Healthcare

…

Smart Manufacturing Driving AssistSmart Home

tinyml.mit.edu

http://connected

…

- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.

tinyml.mit.edu


Personalized HealthcareSmart Manufacturing Driving AssistSmart Home

http://connected

…

- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.- Low-power: reduce carbon. Green AI.

tinyml.mit.edu


Personalized HealthcareSmart Manufacturing Driving AssistSmart Home

http://connected

- Tiny model design is fundamentally different from mobile AI, due to limited memory.

But Tiny AI is Difficult

Cloud AI Mobile AI Tiny AI

Memory 32GB 4GB 256kB

16,000x smaller100,000x

smaller

- Tiny model design is fundamentally different from mobile AI, due to limited memory.

tinyml.mit.edu


http://connected

Cloud AI Mobile AI Tiny AI

Memory 32GB 4GB 256kB

16,000x smaller100,000x

smaller

- Tiny model design is fundamentally different from mobile AI, due to limited memory.- Existing work optimize for #parameters/#FLOPs, but #activation is the real bottleneck. - CANNOT directly scale.

tinyml.mit.edu


http://connected

Breaking the Memory Bottleneck of TinyML

TinyEngineTinyNAS

MCUNet

Toy applications Real-life applications

* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices

Breaking the Memory Bottleneck of TinyML

TinyEngineTinyNAS

MCUNet

Toy applications Real-life applications


- Problems:- Insufficient/imbalanced memory utilization across blocks- Poor performance on applications beyond classification (e.g., detection)

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B)

• Per-block memory usage of MobileNetV2


0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B)

256kB constraint of MCU

Peak Mem1372kB



0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB


High mem.

Low mem.


CANNOT fit CAN fit


0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB


High mem.

Low mem.


8× larger


0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17 18 19 20 21

Mem

ory

Usa

ge (k

B)

6.1× smaller

FBNet

0

160

320

480

640

800

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Mem

ory

Usa

ge (k

B) 3.7× smaller

MnasNet

• Common case in efficient CNN design

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17


Block Index

Mem

ory

Usa

ge (k

B)

Reduce memory usage of the initial stage-> Reduce the overall memory usage

High mem.

Low mem.


1. Saving Memory with Patch-based Inference

H

WC

X Y

* W =


In memory

SRAM

H

WC

X Y

* W =

* weights are usually partially fetched from Flash


In memory

H

WC

X Y

* W =

1. Per-layer inference

Peak Mem = 2 WHC


In memory1. Per-layer inference

Peak Mem = 2 WHC

2. Per-patch inference

Peak Mem = 2 whC << 2WHC

hw

* can use more than 2x2 patches


In memory

conv1s=1

conv2s=2

conv1s=1

conv2s=2

• a practical 2-layer example

per-layer inference


In memory

conv1s=1

conv2s=2

conv1s=1

conv2s=2

• a practical 2-layer example

per-layer inference per-patch inference

*need to hold entire output(much smaller than previous layers)


0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB

High mem.

Low mem.


per-layer inference peak mem: 1372kB

• Applying to MobileNetV2


0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Org peak mem.

High mem.

Low mem.


New peak mem.

8× larger

per-patch inference peak mem: 172kBper-layer inference

• Applying to MobileNetV2

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches


conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Spatial overlapping gets larger as receptive field grows!


M M

ACs

0

80

160

240

320

400

330301

patch-based inference +10%

MobileNetV2

2. Network Redistribution to Reduce Overhead

M M

ACs

0

80

160

240

320

400

330301


MobileNetV2


M M

ACs

0

80

160

240

320

400

330301


MobileNetV2

-> same overall receptive field


M M

ACs

0

80

160

240

320

400

330301


MobileNetV2


M M

ACs

0

80

160

240

320

400

301330

301


MobileNetV2

redistribute Same performance on:• Image classification• Object detection• ….

Negligible overhead

3. Joint Automated Search for Optimization

…

Neural architecture #layers

#channelskernel size

…

Inference scheduling #patches

#layers for patch-basedother knobs from TinyEngine*

…

…


MCUNetV2

Reducing the Peak Memory of CNNs

0

64

128

192

256

320

0000 0000

234

310300315

Per-layer Per-patch (2x2) Per-patch (3x3)

FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144

Measured Peak SRAM (kB)

• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU


0

64

128

192

256

320

0000

85

132

94113

234

310300315




2.8x smaller 3.2x

smaller

2.3x smaller

2.8x smaller



0

64

128

192

256

320

5676

516485

132

94113

234

310300315





4.9x smaller 5.9x

smaller

4.1x smaller

4.2x smaller

MCUNetV2 for Tiny Image Classification

• Large-scale ImageNet classification• Models are quantized to int8• Serving using TinyEngine.

Top-

1 Ac

cura

cy (%

)

40

46

52

58

64

70

64.9

60.3

56.2

49.0

STM32F412 (256kB SRAM)

MbV20.35x

Proxyless0.3x

MCUNet MCUNetV2

+4.6%

60

63

66

69

7271.8

68.5

STM32F746 (320kB SRAM)

MCUNet MCUNetV2

+3.3%


VW

W A

ccur

acy

(%)

84

86

88

90

92

94

20 88 156 224 292 360

4.0×smaller


Flash < 1MB

30kB

118kB62kB256kB constraint on MCU

+4.0%

• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM

MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite


VW

W A

ccur

acy

(%)

84

86

88

90

92

94

20 88 156 224 292 360

4.0×smaller


Flash < 1MB

30kB

118kB62kB256kB constraint on MCU

+4.0%

• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM

MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite

0

30

60

90

120

30

119

MCUNet MCUNetV2

4x smaller

Peak SRAM (kB) @ 90%

MCUNetV2 for Tiny Object Detection

• Object detection is more sensitive to input resolution

Acc

urac

y/m

AP

(%)

60

65

70

75

Image Resolution

160 224 288 352

VOC mAPImgNet Top-1

larger degrade


• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance


• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance

mAP

on

Pasc

al V

OC

(%)

20

30

40

50

60

7068.3

51.4

31.6

MbV2+CMSIS

MCUNet MCUNetV2

<320kB+16.9%

MCUNetV2 for Tiny Object Detection• Face detection on WIDER Face • More robust results at a smaller

peak memory

(a) RNNPool-Face-Quant (b) MCUNetV2

Dissecting MCUNetV2 Architecture

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}


• Kernel size in per-patch stage is small to reduce spatial overlapping



• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory



• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;

large in later stage to boost performance.




• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;

large in later stage to boost performance.• Larger input resolution for resolution-sensitive datasets like VWW

(MCUNet: 128x128)

MCUNetV2

Thanks for listening!

mcunetv2: memory-efficient patch-based inference for tiny

Documents