mcunetv2: memory-efficient patch-based inference for tiny

56
1 Massachusetts Institute of Technology 2 MIT-IBM Watson AI Lab Ji Lin 1 , Wei-Ming Chen 1 , Han Cai 1 , Chuang Gan 2 , Song Han 1 MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Upload: others

Post on 28-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1 Massachusetts Institute of Technology2 MIT-IBM Watson AI Lab

Ji Lin1, Wei-Ming Chen1, Han Cai1, Chuang Gan2, Song Han1

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Page 2: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Deep Learning Going “Tiny”

Page 3: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Deep Learning Going “Tiny”

Cloud AI Mobile AI

Data centers Expensive

Privacy issue

Smartphones Accessible

Process locally

Page 4: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Deep Learning Going “Tiny”

Cloud AI Mobile AI

?Data centers Expensive

Privacy issue

Smartphones Accessible

Process locally

Page 5: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Can we go even smaller?

Page 6: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

- The future belongs to Tiny AI.

Can we go even smaller?

Page 7: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers

Can we go even smaller?

Personalized Healthcare

Smart Manufacturing Driving AssistSmart Home

tinyml.mit.edu

Page 8: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.

tinyml.mit.edu

Can we go even smaller?

Personalized HealthcareSmart Manufacturing Driving AssistSmart Home

Page 9: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

- The future belongs to Tiny AI. - Billions of IoT devices around the world based on microcontrollers- Low-cost: low-income people can have access. Democratize AI.- Low-power: reduce carbon. Green AI.

tinyml.mit.edu

Can we go even smaller?

Personalized HealthcareSmart Manufacturing Driving AssistSmart Home

Page 10: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

- Tiny model design is fundamentally different from mobile AI, due to limited memory.

But Tiny AI is Difficult

Page 11: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Cloud AI Mobile AI Tiny AI

Memory 32GB 4GB 256kB

16,000x smaller100,000x

smaller

- Tiny model design is fundamentally different from mobile AI, due to limited memory.

tinyml.mit.edu

But Tiny AI is Difficult

Page 12: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Cloud AI Mobile AI Tiny AI

Memory 32GB 4GB 256kB

16,000x smaller100,000x

smaller

- Tiny model design is fundamentally different from mobile AI, due to limited memory.- Existing work optimize for #parameters/#FLOPs, but #activation is the real bottleneck. - CANNOT directly scale.

tinyml.mit.edu

But Tiny AI is Difficult

Page 13: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Breaking the Memory Bottleneck of TinyML

TinyEngineTinyNAS

MCUNet

Toy applications Real-life applications

* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices

Page 14: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Breaking the Memory Bottleneck of TinyML

TinyEngineTinyNAS

MCUNet

Toy applications Real-life applications

* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices

- Problems:- Insufficient/imbalanced memory utilization across blocks- Poor performance on applications beyond classification (e.g., detection)

Page 15: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B)

• Per-block memory usage of MobileNetV2

Page 16: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B)

256kB constraint of MCU

Peak Mem1372kB

• Per-block memory usage of MobileNetV2

Page 17: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB

• Per-block memory usage of MobileNetV2

High mem.

Low mem.

256kB constraint of MCU

CANNOT fit CAN fit

Page 18: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB

• Per-block memory usage of MobileNetV2

High mem.

Low mem.

256kB constraint of MCU

8× larger

Page 19: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Imbalanced Memory Distribution of CNNs

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17 18 19 20 21

Mem

ory

Usa

ge (k

B)

6.1× smaller

FBNet

0

160

320

480

640

800

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Mem

ory

Usa

ge (k

B) 3.7× smaller

MnasNet

• Common case in efficient CNN design

Page 20: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17

Imbalanced Memory Distribution of CNNs

Block Index

Mem

ory

Usa

ge (k

B)

Reduce memory usage of the initial stage-> Reduce the overall memory usage

High mem.

Low mem.

256kB constraint of MCU

Page 21: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

H

WC

X Y

* W =

Page 22: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

In memory

SRAM

H

WC

X Y

* W =

* weights are usually partially fetched from Flash

Page 23: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

In memory

H

WC

X Y

* W =

1. Per-layer inference

Peak Mem = 2 WHC

Page 24: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

In memory1. Per-layer inference

Peak Mem = 2 WHC

2. Per-patch inference

Peak Mem = 2 whC << 2WHC

hw

* can use more than 2x2 patches

Page 25: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

In memory

conv1s=1

conv2s=2

conv1s=1

conv2s=2

• a practical 2-layer example

per-layer inference

Page 26: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

In memory

conv1s=1

conv2s=2

conv1s=1

conv2s=2

• a practical 2-layer example

per-layer inference per-patch inference

*need to hold entire output(much smaller than previous layers)

Page 27: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Peak Mem1372kB

High mem.

Low mem.

256kB constraint of MCU

per-layer inference peak mem: 1372kB

• Applying to MobileNetV2

Page 28: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

1. Saving Memory with Patch-based Inference

0

280

560

840

1120

1400

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 16 17Block Index

Mem

ory

Usa

ge (k

B) Org peak mem.

High mem.

Low mem.

256kB constraint of MCU

New peak mem.

8× larger

per-patch inference peak mem: 172kBper-layer inference

• Applying to MobileNetV2

Page 29: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Page 30: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Page 31: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Page 32: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Page 33: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

conv 3x3s=2

conv 3x3s=1

• Using 2x2 patches

Spatial overlapping gets larger as receptive field grows!

Page 34: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Problem: Computation Overhead from Overlapping

M M

ACs

0

80

160

240

320

400

330301

patch-based inference +10%

MobileNetV2

Page 35: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

2. Network Redistribution to Reduce Overhead

M M

ACs

0

80

160

240

320

400

330301

patch-based inference +10%

MobileNetV2

Page 36: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

2. Network Redistribution to Reduce Overhead

M M

ACs

0

80

160

240

320

400

330301

patch-based inference +10%

MobileNetV2

-> same overall receptive field

Page 37: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

2. Network Redistribution to Reduce Overhead

M M

ACs

0

80

160

240

320

400

330301

patch-based inference +10%

MobileNetV2

Page 38: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

2. Network Redistribution to Reduce Overhead

M M

ACs

0

80

160

240

320

400

301330

301

patch-based inference +10%

MobileNetV2

redistribute Same performance on:• Image classification• Object detection• ….

Negligible overhead

Page 39: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

3. Joint Automated Search for Optimization

Neural architecture #layers

#channelskernel size

Inference scheduling #patches

#layers for patch-basedother knobs from TinyEngine*

* Lin et al., MCUNet: Tiny Deep Learning on IoT Devices

MCUNetV2

Page 40: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Reducing the Peak Memory of CNNs

0

64

128

192

256

320

0000 0000

234

310300315

Per-layer Per-patch (2x2) Per-patch (3x3)

FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144

Measured Peak SRAM (kB)

• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU

Page 41: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Reducing the Peak Memory of CNNs

0

64

128

192

256

320

0000

85

132

94113

234

310300315

Per-layer Per-patch (2x2) Per-patch (3x3)

FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144

Measured Peak SRAM (kB)

2.8x smaller 3.2x

smaller

2.3x smaller

2.8x smaller

• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU

Page 42: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Reducing the Peak Memory of CNNs

0

64

128

192

256

320

5676

516485

132

94113

234

310300315

Per-layer Per-patch (2x2) Per-patch (3x3)

• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU

FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144

Measured Peak SRAM (kB)

4.9x smaller 5.9x

smaller

4.1x smaller

4.2x smaller

Page 43: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Reducing the Peak Memory of CNNs

0

64

128

192

256

320

5676

516485

132

94113

234

310300315

Per-layer Per-patch (2x2) Per-patch (3x3)

• Baseline: TinyEngine, the SOTA system stack for tinyML• Measured on STM32F746 MCU

FBNet-A MCUNetMbV2-RDMbV2w0.5, r144 w0.5, r144 w0.45, r144 w1.0, r144

Measured Peak SRAM (kB)

4.9x smaller 5.9x

smaller

4.1x smaller

4.2x smaller

Page 44: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Image Classification

• Large-scale ImageNet classification• Models are quantized to int8• Serving using TinyEngine.

Top-

1 Ac

cura

cy (%

)

40

46

52

58

64

70

64.9

60.3

56.2

49.0

STM32F412 (256kB SRAM)

MbV20.35x

Proxyless0.3x

MCUNet MCUNetV2

+4.6%

60

63

66

69

7271.8

68.5

STM32F746 (320kB SRAM)

MCUNet MCUNetV2

+3.3%

Page 45: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Image Classification

VW

W A

ccur

acy

(%)

84

86

88

90

92

94

20 88 156 224 292 360

4.0×smaller

Measured Peak SRAM (kB)

Flash < 1MB

30kB

118kB62kB256kB constraint on MCU

+4.0%

• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM

MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite

Page 46: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Image Classification

VW

W A

ccur

acy

(%)

84

86

88

90

92

94

20 88 156 224 292 360

4.0×smaller

Measured Peak SRAM (kB)

Flash < 1MB

30kB

118kB62kB256kB constraint on MCU

+4.0%

• TinyML application: Visual Wake Words (VWW) • Higher accuracy, lower SRAM

MCUNetV2 MCUNetMbV2+TF-Lite Proxyless+TF-Lite

0

30

60

90

120

30

119

MCUNet MCUNetV2

4x smaller

Peak SRAM (kB) @ 90%

Page 47: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Object Detection

• Object detection is more sensitive to input resolution

Acc

urac

y/m

AP

(%)

60

65

70

75

Image Resolution

160 224 288 352

VOC mAPImgNet Top-1

larger degrade

Page 48: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Object Detection

• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance

Page 49: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Object Detection

• Object detection is more sensitive to input resolution• Patch-based inference allows for a larger resolution, improving detection performance

mAP

on

Pasc

al V

OC

(%)

20

30

40

50

60

7068.3

51.4

31.6

MbV2+CMSIS

MCUNet MCUNetV2

<320kB+16.9%

Page 50: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2 for Tiny Object Detection• Face detection on WIDER Face • More robust results at a smaller

peak memory

(a) RNNPool-Face-Quant (b) MCUNetV2

Page 51: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Dissecting MCUNetV2 Architecture

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}

Page 52: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Dissecting MCUNetV2 Architecture

• Kernel size in per-patch stage is small to reduce spatial overlapping

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}

Page 53: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Dissecting MCUNetV2 Architecture

• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}

Page 54: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Dissecting MCUNetV2 Architecture

• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;

large in later stage to boost performance.

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}

Page 55: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

Dissecting MCUNetV2 Architecture

Sample arch from VWW. Legend: MB{expansion}_{k_size}x{k_size}

• Kernel size in per-patch stage is small to reduce spatial overlapping• Expansion ratio in middle stage is small to reduce peak memory;

large in later stage to boost performance.• Larger input resolution for resolution-sensitive datasets like VWW

(MCUNet: 128x128)

Page 56: MCUNetV2: Memory-Efficient Patch-based Inference for Tiny

MCUNetV2

Thanks for listening!