once for all: train one network and specialize it for efficient … · 2020. 6. 16. · 8 diverse...

26
Massachusetts Institute of Technology Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Once for All: Train One Network and Specialize it for Efficient Deployment Once-for-All, ICLR’20

Upload: others

Post on 21-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Massachusetts Institute of Technology

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Once for All: Train One Network and Specialize it for Efficient Deployment

Once-for-All, ICLR’20

Page 2: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

• Memory: 32GB • Computation: FLOPS1012

Cloud AI Mobile AI Tiny AI (AIoT)

• Memory: 4GB • Computation: FLOPS109

• Memory: 100 KB • Computation: < FLOPS106

Challenge: Efficient Inference on Diverse Hardware Platforms

• Different hardware platforms have different resource constraints. We need to customize our models for each platform to achieve the best accuracy-efficiency trade-off, especially on resource-constrained edge devices.

less resource

less resource

Page 3: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

3

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

Design Cost (GPU hours)

40K

Challenge: Efficient Inference on Diverse Hardware Platforms

Page 4: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

4

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

160K

40K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

2019 2017 2015 2013

Page 5: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

5

Diverse Hardware Platforms

Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

160K

40K

1600K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

Page 6: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

6

Diverse Hardware Platforms

Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106

160K

40K

1600K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

11.4k lbs CO2 emission→

45.4k lbs CO2 emission→

454.4k lbs CO2 emission→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

Page 7: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

7

Diverse Hardware Platforms

?

Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106

Challenge: Efficient Inference on Diverse Hardware Platforms

160K

40K

1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission→

454.4k lbs CO2 emission→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

45.4k lbs CO2 emission→

Page 8: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

8

Diverse Hardware Platforms

Once-for-All Network

Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106

Challenge: Efficient Inference on Diverse Hardware Platforms

160K

40K

1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission→

454.4k lbs CO2 emission→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

45.4k lbs CO2 emission→

Page 9: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Once-for-All Network: Decouple Model Training and Architecture Design

9

once-for-all network

Page 10: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Once-for-All Network: Decouple Model Training and Architecture Design

10

once-for-all network

Page 11: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Once-for-All Network: Decouple Model Training and Architecture Design

11

once-for-all network

Page 12: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Once-for-All Network: Decouple Model Training and Architecture Design

12

once-for-all network

Page 13: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Progressive Shrinking for Training OFA Networks

13

• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.

1019

Page 14: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Progressive Shrinking

14

• More than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

• Directly optimizing the once-for-all network from scratch is much more challenging than training a normal neural network given so many sub-networks to support.

1019

Train the full model

Shrink the model (4 dimensions)

Jointly fine-tune both large and

small sub-networks

• Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and

joint fine-tuning process.

once-for-all network

Progressive Shrinking

Page 15: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Connection to Network Pruning

15

Train the full model

Shrink the model (only width)

Fine-tune the small net

single pruned network

Network Pruning

Train the full model

Shrink the model (4 dimensions)

Fine-tune both large and small sub-nets

once-for-all network

• Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions.

Progressive Shrinking

Page 16: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

16

Randomly sample input image size for each batch

Elastic Kernel Size

Elastic Depth

Elastic Width

Full Full FullElastic Resolution

Full

Partial

Progressive Shrinking

Page 17: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

17

7x7

Transform Matrix 25x25

5x5

Transform Matrix 9x9

3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution

Elastic Kernel Size

Elastic Depth

Elastic Width

Full Full Full Full

Partial Partial

Progressive Shrinking

Page 18: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

18

unit i

train with full depth

unit i

shrink the depth

O1

O2

unit i

shrink the depth

O1

O2

O3

Gradually allow later layers in each unit to be skipped to reduce the depth

Elastic Width

Full Full Full Full

Partial Partial Partial

Elastic Resolution

Elastic Kernel Size

Elastic Depth

Progressive Shrinking

Page 19: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

19

train with full width

channel importance

0.02

0.15

0.85

0.63

channel sorting

progressively shrink the width

channel importance

0.82

0.11

0.46 reorg.

reorg.

progressively shrink the width

channel sorting

O1O2

O3

O1

O2

O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting

Full Full Full Full

Partial Partial Partial

Elastic Resolution

Elastic Kernel Size

Partial

Elastic Width

Elastic Depth

Progressive Shrinking

Page 20: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

20

Performances of Sub-networks on ImageNetIm

ageN

et T

op-1

Acc

(%)

67

70

73

75

78w/o PS w/ PS

D=2 W=3 K=3

D=2 W=3 K=7

D=2 W=6 K=3

D=2 W=6 K=7

D=4 W=3 K=3

D=4 W=3 K=7

D=4 W=6 K=3

D=4 W=6 K=7

2.5%2.8%

3.5%3.4% 3.3%

3.4%3.7%

3.5%

Sub-networks under various architecture configurations D: depth, W: width, K: kernel size

• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

Page 21: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

OFA: 80% Top-1 Accuracy on ImageNet

21

0 1 2 3 4 5 6 7 8 9MACs (Billion)

69

71

73

75

77

79

81

Imag

eNet

Top

-1 a

ccur

acy

(%)

2M 4M 8M

Handcrafted

16M

AutoML

32M 64M

→→

The higher the better

The lower the better

Once-for-All (ours)

EfficientNet

ProxylessNASMBNetV3

AmoebaNet

MBNetV2PNASNetShuffleNetDARTS

IGCV3-D

MobileNetV1 (MBNetV1)

NASNet-A

InceptionV2

DenseNet-121

DenseNet-169

ResNet-50

ResNetXt-50

InceptionV3

DenseNet-264

DPN-92

ResNet-101

Xception

ResNetXt-101

14x less computation595M MACs 80.0% Top-1

Model Size

• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile setting (< 600M MACs).

Page 22: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Comparison with EfficientNet and MobileNetV3

22

Top-

1 Im

ageN

et A

cc (%

)

76

77

78

79

80

81

0 50 100 150 200 250 300 350 400

OFA EfficientNet

76.3

78.8

79.879.8

78.7

Google Pixel1 Latency (ms)

80.1 2.6x faster

3.8% higher accuracy

Google Pixel1 Latency (ms)

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

18 24 30 36 42 48 54 60

OFA MobileNetV3

75.2

73.3

70.4

67.4

76.4

74.9

73.3

71.4

4% higher accuracy

1.5x faster

• Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3 on Google Pixel1 without loss of accuracy.

Page 23: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

OFA for Fast Specialization on Diverse Hardware Platforms

23

Samsung S7 Edge Latency (ms)

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

25 40 55 70 85 100

OFA MobileNetV3 MobileNetV2

75.2

73.3

70.4

67.4

70.5

73.1

74.7

76.3

Google Pixel2 Latency (ms)

67

69

71

73

75

77

23 28 33 38 43 48 53 58 63 68

75.2

73.3

70.4

67.4

75.874.7

73.4

71.5

LG G8 Latency (ms)

67

69

71

73

75

77

7 10 13 16 19 22 25

75.2

73.3

70.4

67.4

76.4

74.7

73.0

71.1

Top-

1 Im

ageN

et A

cc (%

)

58

62

66

69

73

77

10 14 18 22 26 30NVIDIA 1080Ti Latency (ms)

Batch Size = 64

60.3

65.4

69.872.0

72.673.8

75.3 76.4

58

62

66

69

73

77

9 11 13 15 17 19Intel Xeon CPU Latency (ms)

Batch Size = 1

60.3

65.4

69.872.0

71.1

74.675.7

72.0

58

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.0Xilinx ZU3EG FPGA Latency (ms)

Batch Size = 1 (Quantized)

59.1

63.3

69.071.5

67.069.6

72.873.7

Page 24: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

OFA Saves Orders of Magnitude Design Cost

24

• Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing the carbon footprint by 1,335x compared to MnasNet under 40 platforms.

Page 25: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Measured results on FPGA

OFA for FPGA Accelerators

Arit

hmet

ic In

tens

ity (

OP

S/B

yte)

0.0

12.5

25.0

37.5

50.0

ZU3E

G F

PG

A (G

OP

S/s

)

0.0

20.0

40.0

60.0

80.0

MobileNetV2 MnasNet OFA (Ours)

40% higher 57%

higher

• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for improvement via neural network specialization.

Page 26: Once for All: Train One Network and Specialize it for Efficient … · 2020. 6. 16. · 8 Diverse Hardware Platforms … Once-for-All Network Cloud AI (1012 FLOPS) Mobile AI (109

Summary

• Released 50 different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP). net, image_size = ofa_specialized(net_id, pretrained=True)

• Released the training code & pre-trained OFA network that provides diverse sub-networks without training. ofa_network = ofa_net(net_id, pretrained=True)

• We introduce once-for-all network for efficient inference on diverse hardware platforms. • We present an effective progressive shrinking approach for training once-for-all networks.

Project Page: https://ofa.mit.edu

• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios, setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).

• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19 • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19

Train the full model

Shrink the model In 4 dimensions

Fine-tune both large and small sub-nets

once-for-all network

Progressive Shrinking