once for all: train one network and specialize it for efficient … · 2020. 6. 16. · 8 diverse...

Massachusetts Institute of Technology

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Once for All: Train One Network and Specialize it for Efficient Deployment

Once-for-All, ICLR’20

https://arxiv.org/abs/1908.09791

• Memory: 32GB • Computation: FLOPS1012

Cloud AI Mobile AI Tiny AI (AIoT)

• Memory: 4GB • Computation: FLOPS109

• Memory: 100 KB • Computation: < FLOPS106

Challenge: Efficient Inference on Diverse Hardware Platforms

• Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices.

less resource

less resource

3

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

Design Cost (GPU hours)

40K


4



160K

40K



2019 2017 2015 2013

5


Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106

…


160K

40K

1600K



6



…

160K

40K

1600K



11.4k lbs CO2 emission→



1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

7


?

…



160K

40K

1600K






8


…

Once-for-All Network



160K

40K

1600K






Once-for-All Network: Decouple Model Training and Architecture Design

9

once-for-all network


10



11



12

…


Progressive Shrinking for Training OFA Networks

13

• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.

1019

Progressive Shrinking

14

• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.

1019

Train the full model

Shrink the model (4 dimensions)

Jointly fine-tune both large and

small sub-networks

• Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and

joint fine-tuning process.



Connection to Network Pruning

15


Shrink the model (only width)

Fine-tune the small net

single pruned network

Network Pruning


Shrink the model (4 dimensions)

Fine-tune both large and small sub-nets


• Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions.


16

Randomly sample input image size for each batch

Elastic Kernel Size

Elastic Depth

Elastic Width

Full Full FullElastic Resolution

Full

Partial


17

7x7

Transform Matrix 25x25

5x5

Transform Matrix 9x9

3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution

Elastic Kernel Size

Elastic Depth

Elastic Width

Full Full Full Full

Partial Partial


18

unit i

train with full depth

unit i

shrink the depth

O1

O2

unit i

shrink the depth

O1

O2

O3

Gradually allow later layers in each unit to be skipped to reduce the depth

Elastic Width

Full Full Full Full

Partial Partial Partial

Elastic Resolution

Elastic Kernel Size

Elastic Depth


19

train with full width

channel importance

0.02

0.15

0.85

0.63

channel sorting

progressively shrink the width

channel importance

0.82

0.11

0.46 reorg.

reorg.

progressively shrink the width

channel sorting

O1O2

O3

O1

O2

O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting

Full Full Full Full

Partial Partial Partial

Elastic Resolution

Elastic Kernel Size

Partial

Elastic Width

Elastic Depth


20

Performances of Sub-networks on ImageNetIm

ageN

et T

op-1

Acc

(%)

67

70

73

75

78w/o PS w/ PS

D=2 W=3 K=3

D=2 W=3 K=7

D=2 W=6 K=3

D=2 W=6 K=7

D=4 W=3 K=3

D=4 W=3 K=7

D=4 W=6 K=3

D=4 W=6 K=7

2.5%2.8%

3.5%3.4% 3.3%

3.4%3.7%

3.5%

Sub-networks under various architecture configurations D: depth, W: width, K: kernel size

• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

OFA: 80% Top-1 Accuracy on ImageNet

21

0 1 2 3 4 5 6 7 8 9MACs (Billion)

69

71

73

75

77

79

81

Imag

eNet

Top

-1 a

ccur

acy

(%)

2M 4M 8M

Handcrafted

16M

AutoML

32M 64M

→→

The higher the better

The lower the better

Once-for-All (ours)

EfficientNet

ProxylessNASMBNetV3

AmoebaNet

MBNetV2PNASNetShuffleNetDARTS

IGCV3-D

MobileNetV1 (MBNetV1)

NASNet-A

InceptionV2

DenseNet-121

DenseNet-169

ResNet-50

ResNetXt-50

InceptionV3

DenseNet-264

DPN-92

ResNet-101

Xception

ResNetXt-101

14x less computation595M MACs 80.0% Top-1

Model Size

• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile setting (< 600M MACs).

Comparison with EfficientNet and MobileNetV3

22

Top-

1 Im

ageN

et A

cc (%

)

76

77

78

79

80

81

0 50 100 150 200 250 300 350 400

OFA EfficientNet

76.3

78.8

79.879.8

78.7

Google Pixel1 Latency (ms)

80.1 2.6x faster

3.8% higher accuracy


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

18 24 30 36 42 48 54 60

OFA MobileNetV3

75.2

73.3

70.4

67.4

76.4

74.9

73.3

71.4

4% higher accuracy

1.5x faster

• Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3 on Google Pixel1 without loss of accuracy.

OFA for Fast Specialization on Diverse Hardware Platforms

23

Samsung S7 Edge Latency (ms)

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

25 40 55 70 85 100

OFA MobileNetV3 MobileNetV2

75.2

73.3

70.4

67.4

70.5

73.1

74.7

76.3


67

69

71

73

75

77

23 28 33 38 43 48 53 58 63 68

75.2

73.3

70.4

67.4

75.874.7

73.4

71.5

LG G8 Latency (ms)

67

69

71

73

75

77

7 10 13 16 19 22 25

75.2

73.3

70.4

67.4

76.4

74.7

73.0

71.1

Top-

1 Im

ageN

et A

cc (%

)

58

62

66

69

73

77

10 14 18 22 26 30NVIDIA 1080Ti Latency (ms)

Batch Size = 64

60.3

65.4

69.872.0

72.673.8

75.3 76.4

58

62

66

69

73

77

9 11 13 15 17 19Intel Xeon CPU Latency (ms)

Batch Size = 1

60.3

65.4

69.872.0

71.1

74.675.7

72.0

58

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.0Xilinx ZU3EG FPGA Latency (ms)

Batch Size = 1 (Quantized)

59.1

63.3

69.071.5

67.069.6

72.873.7

OFA Saves Orders of Magnitude Design Cost

24

• Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing the carbon footprint by 1,335x compared to MnasNet under 40 platforms.

Measured results on FPGA

OFA for FPGA Accelerators

Arit

hmet

ic In

tens

ity (

OP

S/B

yte)

0.0

12.5

25.0

37.5

50.0

ZU3E

G F

PG

A (G

OP

S/s

)

0.0

20.0

40.0

60.0

80.0

MobileNetV2 MnasNet OFA (Ours)

40% higher 57%

higher

• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for improvement via neural network specialization.

Summary

• Released 50 different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP). net, image_size = ofa_specialized(net_id, pretrained=True)

• Released the training code & pre-trained OFA network that provides diverse sub-networks without training. ofa_network = ofa_net(net_id, pretrained=True)

• We introduce once-for-all network for efficient inference on diverse hardware platforms. • We present an effective progressive shrinking approach for training once-for-all networks.

Project Page: https://ofa.mit.edu

• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios, setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).

• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19 • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19


Shrink the model In 4 dimensions

Fine-tune both large and small sub-nets



http://www.apple.com

once for all: train one network and specialize it for efficient … · 2020. 6. 16. · 8 diverse...

Documents