mcunet: tiny deep learning on iot devicesbackground: the era of aiot on microcontrollers (mcus) •...

Post on 27-Feb-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab

MCUNet: Tiny Deep Learning on IoT Devices

NeurIPS 2020 (spotlight)

Ji Lin1 Wei-Ming Chen1,2 John Cohn3Yujun Lin1 Song Han1Chuang Gan3

• Low-cost, low-power

Background: The Era of AIoT on Microcontrollers (MCUs)

Background: The Era of AIoT on Microcontrollers (MCUs)

• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F

Smart Retail Personalized Healthcare Precision Agriculture Smart Home

• Wide applications

• Low-cost, low-power • Rapid growth

#Uni

ts (B

illion

)

01020304050

12 13 14 15F 16F 17F 18F 19F

Background: The Era of AIoT on Microcontrollers (MCUs)

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

Cloud AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

Cloud AI Mobile AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB 4GB

256GB~TB/PB

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB 4GB

256GB

320kB

1MB~TB/PB

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation)

Storage (Weights)

16GB

~TB/PB

4GB

256GB

320kB

1MB13,000x smaller

50,000x smaller

We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.

Existing efficient network only reduces model size but NOT activation size!

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

~70% ImageNet Top-1

4.6x

1.8x

Challenge: Memory Too Small to Hold DNN

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

Challenge: Memory Too Small to Hold DNN

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

Challenge: Memory Too Small to Hold DNN

MCUNet

ResNet-50

MobileNetV2

MobileNetV2 (int8)

0 2000 4000 6000 8000320kB constraint

Peak Memory (kB)

22x

23x

5x

MCUNet: System-Algorithm Co-design

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

LibraryNAS

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

LibraryNAS LibraryNN Model

MCUNet: System-Algorithm Co-design

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

MCUNet: System-Algorithm Co-design

TinyEngineTinyNAS

Efficient Compiler / Runtime

Efficient Neural Architecture

LibraryNN Model

(a) Search NN model on an existing librarye.g., ProxylessNAS, MnasNet

(b) Tune deep learning library given a NN modele.g., TVM

(c) MCUNet: system-algorithm co-design

MCUNet

LibraryNAS

TinyNAS

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Full Network Space

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Optimized Search SpaceFull Network Space

Memory/Storage Constraints

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performanceThere is no prior expertise on MCU model design

Optimized Search SpaceFull Network Space Model Specialization

Memory/Storage Constraints

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

TinyNAS: (1) Automated search space optimization

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

TinyNAS: (1) Automated search space optimization

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2

TinyNAS: (1) Automated search space optimization

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

k=7

k=5

k=3

e=6

e=4

e=2

pw1dw

pw2

d=4

d=3

d=2

TinyNAS: (1) Automated search space optimization

Revisit ProxylessNAS search space:S = kernel size × expansion ratio × depth

Out of memory!

TinyNAS: (1) Automated search space optimization

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

TinyNAS: (1) Automated search space optimization

Different R and W for different hardware capacity

R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

Different R and W for different hardware capacity

R=260, W=1.4 * R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

Different R and W for different hardware capacity

R=260, W=1.4 R=224, W=1.0 R=?, W=?

F412/F743/H746/…

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space)

Extended search space to cover wide range of hardware capacity:S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

256kB/320kB/512kB/…

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

TinyNAS: (1) Automated search space optimization

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

320kB?

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

width-res. | mFLOPs

Bad design space(45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r144 | 46.9p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Bad design space

best

acc:

74.2

%

(45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

0%

25%

50%

75%

100%

25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5w0.4-r112 | 32.4w0.4-r128 | 39.3w0.4-r144 | 46.9w0.5-r112 | 38.3w0.5-r128 | 46.9w0.5-r144 | 52.0w0.6-r112 | 41.3w0.7-r96 | 31.4w0.7-r112 | 38.4p0.8

FLOPs (M)

Cum

ulat

ive

Prob

abili

ty p=80%(50.3M, 80%)(32.3M, 80%)

best

acc:

76.4%

width-res. | mFLOPs

Good design space: likely to achieve high FLOPs under memory constraint

Bad design space

best

acc:

78.7%

best

acc:

74.2

%

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space:Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

• Small sub-networks are nested in large sub-networks.

• One-shot NAS through weight sharing

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size,

expansion, depth)

Jointly fine-tune multiple sub-

networks

Super Network

Directly evaluate the accuracy of sub-nets

• One-shot NAS through weight sharing

TinyNAS: (2) Resource-constrained model specialization

40

Elastic Kernel Size

Elastic Depth

Elastic Width

TinyNAS: (2) Resource-constrained model specialization

41

Elastic Kernel Size

Elastic Depth

Elastic Width

7x7

Index central

5x5

Index central

3x3

Start with full kernel size Smaller kernel takes centered weights

TinyNAS: (2) Resource-constrained model specialization

42

Elastic Kernel Size

Elastic Depth

Elastic Width

unit i

train with full depth

unit i

shrink the depth

O1

O2

O3

Allow later layers in each unit to be skipped to reduce the depth

TinyNAS: (2) Resource-constrained model specialization

Elastic Kernel Size

Elastic Depth

Elastic Width

train with full width

channel importance

0.02

0.15

0.85

0.63

channel sorting

reorg.

shrink the widthO1

O2

O3

O1

Shrink the width Keep the most important channels when shrinking via channel sorting

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

MobileNetV2 TinyNAS

Peak Memory for First Two Stages

TinyNAS Better Utilizes the Memory

Peak

Mem

(kB)

0

75

150

225

300

averageaverage

MobileNetV2 TinyNAS

• TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory

Peak Memory for First Two Stages

2.2x 1.6xmaxmax

TinyEngine: A Memory-Efficient Inference Library

1. Reducing overhead with separated compilation & runtime

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

1. Reducing overhead with separated compilation & runtime

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Memory overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

(a) Existing libraries based on runtime interpretatione.g., TF-Lite Micro, CMSIS-NN

NN ModelAll Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation overhead

Memory overhead

Storage overhead

1. Reducing overhead with separated compilation & runtime

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

1.7x smaller

1. Reducing overhead with separated compilation & runtime

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N

Peak Memory: N+1

TinyEngine: A Memory-Efficient Inference Library

2. In-place depth-wise convolution

The dilemma of inverted bottleneck 1x

peak mem 6x

1x

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Peak Memory: 2N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N-1N

channels

(a) Depth-wise convolution (b) In-place depth-wise convolution

Input activation Output activation Input/output activation Temp buffer

21

nn-1

1

2

n

write back

(Peak Mem: 2n) (Peak Mem: n+1)

N

Peak Memory: N+1

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NNMillion MAC/s ↑

0 52

Analyzing Million MAC/s improved by each technique

Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overheadMillion MAC/s ↑

0 6452

(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique

OOM

NN ModelSpecialized ops

Inference

TinyEngine: Model-adaptive code generation.

Compile time Runtime

Memory schedule

TinyEngine: Faster Inference Speed

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuseMillion MAC/s ↑

0 64 7052

(a) Model-level memory scheduling

(b) Tile size configuration for Im2col

(2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusionMillion MAC/s ↑

0 64 7052 75

(3) Computation Kernel Specialization: Operation fusion

• Minimize memory footprint • Optimize the overall computation

Pad

Conv

ReLU

BN

Specialized kernel

e.g., fuse Pad+Conv+ReLU+BN

Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrollingMillion MAC/s ↑

0 64 7052 75 79 82

(3) Computation Kernel Specialization: Loop unrolling

e.g., fully unroll for 3x3 conv

...

• Eliminate the branch instruction overheads of loops

Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling TilingMillion MAC/s ↑

0 64 7052 75 79 82

Specialized loop unrolling/tiling according to network architecture

Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer

Im2col buffer

x

Weights

...

Tile size

Output

Tile size

• Tiling loops based on the kernel size and available memory

TinyEngine: Faster Inference Speed

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

1.6x fasterBaseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

Million MAC/s ↑0 64 7052 75 79 82

adsfas

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

0%

25%

50%

75%

100% 1.001.001.001.00

0.610.660.64

0.94

0.82

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

3x faster

3x faster

3x faster

1.5xfaster

1.6x faster

OO

M

OO

M

OO

M

Peak Mem (KB)↓

0

46

92

138

184

230

84

4165

46

228

197217

67

144

216

161

211

64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x smaller

4.8x smaller

OO

M

2.7x smaller3.3x

smaller

OO

M

OO

M

• Consistent improvement on different networks

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓0 7648

Baseline: ARM CMSIS-NN

160

3.4x smallerIn-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling

0 64 7052 75 79 82

1.6x faster

0%

25%

50%

75%

100% 1.001.001.001.00

0.610.660.64

0.94

0.82

0.320.330.320.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster

3x faster

3x faster

3x faster

3x faster

1.5xfaster

1.6x faster

OO

M

OO

M

OO

M

Peak Mem (KB)↓

0

46

92

138

184

230

84

4165

46

228

197217

67

144

216

161

211

64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x smaller

4.8x smaller

OO

M

2.7x smaller3.3x

smaller

OO

M

OO

M

• Consistent improvement on different networks

Experimental Results

We focus on large-scale datasets to reflect real-life use cases.

Datasets: (1) ImageNet-1000(2) Wake Words

• Visual: Visual Wake Words• Audio: Google Speech Commands

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

40Baseline (MbV2*+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

* scaled down version: width multiplier 0.3, input resolution 80

5644

40Baseline (MbV2*+CMSIS)System-only (MbV2*+TinyEngine)

Model-only (TinyNAS+CMSIS)

ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

6256

4440Baseline (MbV2*+CMSIS)

System-only (MbV2*+TinyEngine)Model-only (TinyNAS+CMSIS)

Co-design (TinyNAS+TinyEngine)

ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

• ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

Handling Diverse Hardware

50

55

60

65

70

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

Handling Diverse Hardware

50

55

60

65

70

7570.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

Handling Diverse Hardware

50

55

60

65

70

75

53.8

70.7

65.963.562.0

ImageNet Top-1 Accuracy (%)

STM32F412(256kB/1MB)

STM32F746(320kB/1MB)

STM32F765(512kB/1MB)

STM32H743(512kB/2MB)

• Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

MobileNetV2+CMSIS-NN

+17%

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet~70% ImageNet Top-1

4.6x

1.8x

Reduce Both Model Size and Activation Size

Reduce Both Model Size and Activation Size

0

10

20

30

40

50

Param (MB) Peak Activation (MB)

ResNet-18 MobileNetV2-0.75 MCUNet

24.6x

13.8x

~70% ImageNet Top-1

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

OOM

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Visual Wake Words (VWW)V

WW

Acc

urac

y

84

86

88

90

92

0 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84

86

88

90

92

50 162.5 275 387.5 500

10FPS

5FPS

OOM

Latency (ms) Peak SRAM (kB)

3.4× faster

3.7× smaller

2.4× faster

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

Audio Wake Words (Speech Commands)G

SC A

ccur

acy

88

90

92

94

96

0 340 680 1020 1360 1700

MCUNet MobileNetV2 ProxylessNAS

88

90

92

94

96

30 147.5 265 382.5 500

10FPS

5FPS2.8× faster 4.1× smaller

2% higher

256kB constraint

Latency (ms) Peak SRAM (kB)(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

OOM

Visual Wake Word Detection

• Detecting whether a person is present in the frame

MCUNet: Tiny Deep Learning on IoT Devices

Project Page: http://tinyml.mit.edu

• Our study suggests that the era of tiny machine learning on IoT devices has arrived

Cloud AI Mobile AI Tiny AI

ResNet MobileNet MCUNet

top related