processing and training deep neural networks on chip · 2020. 2. 14. · processing and training...

49
Processing and Training Deep Neural Networks on Chip Ghouthi BOUKLI HACENE, Vincent GRIPON, Nicolas FARRUGIA, Matthieu ARZEL, Michel JEZEQUEL, Yoshua BENGIO

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Processing and Training Deep Neural Networks on Chip

    Ghouthi BOUKLI HACENE, Vincent GRIPON, Nicolas FARRUGIA, Matthieu ARZEL, Michel JEZEQUEL, Yoshua BENGIO

  • Context 2

    Processing and Training DNNs on Chip

  • Context 2

    1024 V100 during 1 day for training 4 TPUs during 1 month for training

    1T FLOPs for one decision 100M parameters to learn

    Processing and Training DNNs on Chip

  • Challenges 3

    Technical ChallengesReal time applications.Running deep learning on limited resources embedded systems.

    1T FLOPs for one decision 100M parameters to learn

    Processing and Training DNNs on Chip

  • Challenges 3

    Scientific ChallengesLarge architectures harden visualization and interpretation.Simulation time limits the progress of the field.

    4 TPUs during 1 month for training 100M parameters to learn

    Processing and Training DNNs on Chip

  • Challenges 3

    Societal ChallengesLarge energy consumption.Accessibility of deep learning to everyone.

    1024 V100 during 1 day for training4 TPUs during 1 month for training

    Processing and Training DNNs on Chip

  • 1. Deep Learning

    •1.1 Deep Learning

    •1.2 Some DNN Architectures

    •1.3 Importance of the Architecture

    2. Efficient Inference

    •2.1 Reducing DNNs Size

    •2.2 Compression Methods

    •2.3 Quantization

    •2.4 Pruning

    3. Conclusion

    Outline

  • Deep Learning

  • Deep Learning 5

    A deep learning architecture is basically an assembly of functions.

    Each function can be represented by an entity called layer.

    Twomost important layers:

    Processing and Training DNNs on Chip

  • Deep Learning 5

    A deep learning architecture is basically an assembly of functions.

    Each function can be represented by an entity called layer.

    Twomost important layers:

    Processing and Training DNNs on Chip

  • Deep Learning 5

    A deep learning architecture is basically an assembly of functions.

    Each function can be represented by an entity called layer.

    Twomost important layers:

    Fully connected layerInput

    Output

    w1,1

    1 2 3

    1 2 3 4 5

    w3,5

    w1,1 w1,2 w1,3 w1,4 w1,5w2,1 w2,2 w2,3 w2,4 w2,5w3,1 w3,2 w3,3 w3,4 w3,5

    Processing and Training DNNs on Chip

  • Deep Learning 5

    A deep learning architecture is basically an assembly of functions.

    Each function can be represented by an entity called layer.

    Twomost important layers:

    Convolutional layerInput

    Output

    w1 w2 w3 0 00 w1 w2 w3 00 0 w1 w2 w3

    w4 w5 w6 0 00 w4 w5 w6 0

    0 0 w4 w5 w6

    w7 w8 w9 0 00 w7 w8 w9 0

    0 0 w7 w8 w9

    Processing and Training DNNs on Chip

  • Some DNN Architectures 6

    AlexNet-like NetworkInput

    Layer 1

    Layer 2

    Layer 3

    Output

    Residual-like NetworkInput

    Layer 1

    Layer 2

    Addition

    Layer 3

    Output

    Processing and Training DNNs on Chip

  • Some DNN Architectures 6

    AlexNet-like NetworkInput

    Layer 1

    Layer 2

    Layer 3

    Output

    Residual-like NetworkInput

    Layer 1

    Layer 2

    Addition

    Layer 3

    Output

    Processing and Training DNNs on Chip

  • Importance of the architecture: memory 7

    Memory vs. Top-1 error for models trained on 2012 ILSVRC

    0 100 200 300 400 500 600 700 800

    20

    25

    30

    35

    40

    45alexnetca�enetsqueezenet1-0squeezenet1-1 vgg-f

    vgg-mvgg-svgg-m-2048vgg-m-1024

    vgg-m-128

    vgg-vd-16vgg-vd-19

    googlenet

    resnet18

    resnet34

    resnet-50resnet-101 resnet-152resnext-50-32x4dresnext-101-32x4d

    resnext-101-64x4d

    inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d

    SENet

    SE-BN-Inceptiondensenet121

    densenet161densenet169

    densenet201

    mcn-mobilenet

    Memory (MB)

    Top-1error(%)

    Processing and Training DNNs on Chip

  • Importance of the architecture: memory 7

    Memory vs. Top-1 error for models trained on 2012 ILSVRC

    0 100 200 300 400 500 600 700 800

    20

    25

    30

    35

    40

    45alexnetca�enetsqueezenet1-0squeezenet1-1 vgg-f

    vgg-mvgg-svgg-m-2048vgg-m-1024

    vgg-m-128

    vgg-vd-16vgg-vd-19

    googlenet

    resnet18

    resnet34

    resnet-50resnet-101 resnet-152resnext-50-32x4dresnext-101-32x4d

    resnext-101-64x4d

    inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d

    SENet

    SE-BN-Inceptiondensenet121

    densenet161densenet169

    densenet201

    mcn-mobilenet

    Memory (MB)

    Top-1error(%)

    Processing and Training DNNs on Chip

  • Importance of the architecture: FLOPs 8

    FLOPs vs. Top-1 error for models trained on 2012 ILSVRC

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2·104

    20

    25

    30

    35

    40

    45alexnetca�enetsqueezenet1-0squeezenet1-1vgg-f

    vgg-mvgg-svgg-m-2048vgg-m-1024

    vgg-m-128

    vgg-vd-16 vgg-vd-19

    googlenet

    resnet18

    resnet34

    resnet-50resnet-101 resnet-152resnext-50-32x4d

    resnext-101-32x4dresnext-101-64x4d

    inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d

    SENet

    SE-BN-Inceptiondensenet121

    densenet161densenet169

    densenet201

    mcn-mobilenet

    MFLOPs

    Top-1error(%)

    Processing and Training DNNs on Chip

  • Importance of the architecture: FLOPs 8

    FLOPs vs. Top-1 error for models trained on 2012 ILSVRC

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2·104

    20

    25

    30

    35

    40

    45alexnetca�enetsqueezenet1-0squeezenet1-1vgg-f

    vgg-mvgg-svgg-m-2048vgg-m-1024

    vgg-m-128

    vgg-vd-16 vgg-vd-19

    googlenet

    resnet18

    resnet34

    resnet-50resnet-101 resnet-152resnext-50-32x4d

    resnext-101-32x4dresnext-101-64x4d

    inception-v3SE-ResNet-50 SE-ResNet-101SE-ResNet-152SE-ResNeXt-50-32x4dSE-ResNeXt-101-32x4d

    SENet

    SE-BN-Inceptiondensenet121

    densenet161densenet169

    densenet201

    mcn-mobilenet

    MFLOPs

    Top-1error(%)

    Processing and Training DNNs on Chip

  • Efficient Inference

  • Reducing DNNs size (CIFAR100) 9

    50 100 150 200

    65

    70

    75

    80

    Memory (MB)

    Testsetaccuracy(%)

    VGG19MobileNetV2Resnet34Densenet121

    Memory vs accuracy.

    0 1 2 3

    65

    70

    75

    80

    GFLOPs

    Testsetaccuracy(%)

    VGG19MobileNetV2Resnet34Densenet121

    FLOPs vs accuracy.

    Processing and Training DNNs on Chip

  • Reducing DNNs size (CIFAR100) 9

    50 100 150 200

    65

    70

    75

    80

    Memory (MB)

    Testsetaccuracy(%)

    VGG19MobileNetV2Resnet34Densenet121

    Memory vs accuracy.

    0 1 2 3

    65

    70

    75

    80

    GFLOPs

    Testsetaccuracy(%)

    VGG19MobileNetV2Resnet34Densenet121

    FLOPs vs accuracy.

    Scaling down proportionally the number of feature maps of each layer.

    Processing and Training DNNs on Chip

  • Compression Methods 10

    Layer 1 Layer 2

    0.478 0.314 0.231 1.231 −0.423

    −1.987 1.332 0.977 −0.541 1.230

    0.322 0.431 0.221 0.112 −0.445

    −0.718 0.891 −0.231−1.231−0.331

    −1.412 0.490 0.791 0.901 −1.002

    0.528 0.710 0.730 0.231 −1.423

    −1.087 0.132 1.797 −1.041 1.131

    1.220 0.321 0.341 1.912 −1.445

    −0.798 1.291 1.481 −0.871−0.821

    1.772 −0.484 0.179 −0.121−1.921

    Baseline

    Processing and Training DNNs on Chip

  • Compression Methods 10

    Layer 1 Layer 2

    0.478 0.314 0.231 1.231 −0.423

    −1.987 1.332 0.977 −0.541 1.230

    0.322 0.431 0.221 0.112 −0.445

    −0.718 0.891 −0.231−1.231−0.331

    −1.412 0.490 0.791 0.901 −1.002

    0.528 0.710 0.730 0.231 −1.423

    −1.087 0.132 1.797 −1.041 1.131

    1.220 0.321 0.341 1.912 −1.445

    −0.798 1.291 1.481 −0.871−0.821

    1.772 −0.484 0.179 −0.121−1.921

    Baseline

    Processing and Training DNNs on Chip

  • Compression Methods 10

    Layer 1 Layer 2

    0.5 0.3 0.2 1.2 −0.4

    −2.0 1.3 1.0 −0.5 1.2

    0.3 0.4 0.2 0.1 −0.4

    −0.7 0.9 −0.2 −1.2 −0.3

    −1.4 0.5 0.8 0.9 −1.0

    0.5 0.7 0.7 0.2 −1.4

    −1.1 0.1 1.8 −1.0 1.1

    1.2 0.3 0.3 1.9 −1.4

    −0.8 1.3 1.5 −0.9 −0.8

    1.8 −0.5 0.2 −0.1 −1.9

    Quantization

    Processing and Training DNNs on Chip

  • HowMany Bits DoWe Need? 11

    5 10 15 20 25 30

    20

    40

    60

    80

    100

    values precision (in bits)

    Testsetaccuracy(%)

    Resnet18PreActResnet18SeNet18MobileNetV2

    Weights only (CIFAR10).

    5 10 15 20 25 30

    20

    40

    60

    80

    100

    values precision (in bits)

    Testsetaccuracy(%)

    Resnet18PreActResnet18SeNet18MobileNetV2

    Weights and activations (CIFAR10).

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).

    Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Binary Connect (BC) 12

    BC straight through principle:

    1 Map weight values to their signs (1 or−1).

    2 Compute feed forward and back propagation.

    3 Update initial floating point values.

    BinaryWeight Network (BWN) outperforms BC by adding ascaling factor (−α or α).Comparison of accuracy between baseline, BC and BWN on CIFAR10.

    Resnet34 Densenet121 MobilenetV2Full-precision 95.0% 95.0% 93.8%BC 93.6% 94.5% 93.0%BWN 94.3% 94.7% 93.4%

    Processing and Training DNNs on Chip

  • Compression Methods 13

    Layer 1 Layer 2w11 w

    12 w

    13 w

    14 w

    15

    w16 w17 w

    18 w

    19 w

    110

    w111 w112 w

    113 w

    114 w

    115

    w116 w117 w

    118 w

    119 w

    120

    w121 w122 w

    123 w

    124 w

    125

    w21 w22 w

    23 w

    24 w

    25

    w26 w27 w

    28 w

    29 w

    210

    w211 w212 w

    213 w

    214 w

    215

    w216 w217 w

    218 w

    219 w

    220

    w221 w222 w

    223 w

    224 w

    225

    Baseline

    Processing and Training DNNs on Chip

  • Compression Methods 13

    Layer 1 Layer 20 w12 0 w

    14 w

    15

    0 0 w18 w19 w

    110

    w111 0 w113 w

    114 0

    w116 w117 0 w

    119 0

    w121 w122 0 w

    124 w

    125

    0 0 0 0 0

    w26 w27 w

    28 w

    29 w

    210

    w211 w212 w

    213 w

    214 w

    215

    0 0 0 0 0

    w221 w222 w

    223 w

    224 w

    225

    StructuredNon structuredPruning

    Processing and Training DNNs on Chip

  • Pruning 14

    Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.

    Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.

    Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.

    Processing and Training DNNs on Chip

  • Pruning 14

    Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.

    Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.

    Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.

    Processing and Training DNNs on Chip

  • Pruning 14

    Evaluate the importance of neurons and eliminate the leastimportant ones to reduce neural network size.

    Non structured pruning: eliminate neurons independently, onlyexploitable for very large levels of sparsity.

    Structured pruning: eliminate kernels, filters or even layers,exploitable for even low levels of sparsity.

    Processing and Training DNNs on Chip

  • Structured pruning and shift layers 15

    L

    C

    S

    X Yd

    L

    C

    S

    X Yd

    L

    Shifted X Yd

    C

    Shift Attention Layer (SAL)Simplified operations,Reduced number of parameters,Fully exploitable technique.

    Processing and Training DNNs on Chip

  • Structured pruning and shift layers 15

    L

    C

    S

    X Yd

    L

    C

    S

    X Yd

    L

    Shifted X Yd

    C

    Shift Attention Layer (SAL)Simplified operations,Reduced number of parameters,Fully exploitable technique.

    Processing and Training DNNs on Chip

  • Experimental Results 16

    Comparison of accuracy, number of parameters and number of floating pointoperations (FLOPs) using ResNet architectures.

    Method CIFAR10 CIFAR100Accuracy NP (M) MFLOPs Accuracy NP (M) MFLOPs

    Pruned-B 93.06% 0.73 91 73.6% 7.83 616NISP 93.01% 0.49 71 − − −PCAS 93.58% 0.39 56 73.84% 4.02 475SAL 93.6% 0.36 42 77.6% 3.9 251

    Processing and Training DNNs on Chip

  • Quantized Shift Layers 17

    Shift Layers + BC or BWNShift layer: replace a convolution by a multiplication.BC/BWN: replace a multiplication by a low-cost multiplexer.Shift layer +BC/BWN: replace a convolution by a low-costmultiplexer.

    Comparison of accuracy and memory usage between Resnet-20 baseline,SAL, SAL with BC and SAL with BWN on CIFAR10.

    Accuracy(%) Memory(Mb)Baseline 94.66 39.04SAL 95.52 31.36SAL + BC 93.20 6.87SAL + BWN 94.00 6.87

    Processing and Training DNNs on Chip

  • Quantized Shift Layers 17

    Shift Layers + BC or BWNShift layer: replace a convolution by a multiplication.BC/BWN: replace a multiplication by a low-cost multiplexer.Shift layer +BC/BWN: replace a convolution by a low-costmultiplexer.

    Comparison of accuracy and memory usage between Resnet-20 baseline,SAL, SAL with BC and SAL with BWN on CIFAR10.

    Accuracy(%) Memory(Mb)Baseline 94.66 39.04SAL 95.52 31.36SAL + BC 93.20 6.87SAL + BWN 94.00 6.87

    Processing and Training DNNs on Chip

  • Comparison of Methods 18

    0 0.2 0.4 0.6 0.8 190

    92

    94

    96

    Memory footprint

    Testsetaccuracy(%)

    MobileNetV2

    Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.

    Processing and Training DNNs on Chip

  • Comparison of Methods 18

    0 0.2 0.4 0.6 0.8 190

    92

    94

    96

    Memory footprint

    Testsetaccuracy(%)

    MobileNetV2SqueezeNet

    Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.

    Processing and Training DNNs on Chip

  • Comparison of Methods 18

    0 0.2 0.4 0.6 0.8 190

    92

    94

    96

    Memory footprint

    Testsetaccuracy(%)

    MobileNetV2SqueezeNetResnet18

    Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.

    Processing and Training DNNs on Chip

  • Comparison of Methods 18

    0 0.2 0.4 0.6 0.8 190

    92

    94

    96

    Memory footprint

    Testsetaccuracy(%)

    MobileNetV2SqueezeNetResnet18Resnet18+BCResnet18+BWN

    Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.

    Processing and Training DNNs on Chip

  • Comparison of Methods 18

    0 0.2 0.4 0.6 0.8 190

    92

    94

    96

    Memory footprint

    Testsetaccuracy(%)

    MobileNetV2SqueezeNetResnet18Resnet18+BCResnet18+BWNResnet18+SAL

    Evolution of accuracy when applying compression methods on di�erent DNNarchitectures for the CIFAR10 dataset.

    Processing and Training DNNs on Chip

  • Conclusion

  • Conclusion 19

    Compression MethodsDi�erent ways to reduce DNNs size, complexity and thus energyconsumption.Compression methods are only applicable to the inference part,and not the learning part.

    DirectionsWhich combinations of quantization methods are e�cient?Can training process decide the most e�cient number of bits toquantize values?Can SAL perform well in other domains than classification?Can compression methods be reconsidered to reduce trainingcomplexity?

    Processing and Training DNNs on Chip

  • Conclusion 19

    Compression MethodsDi�erent ways to reduce DNNs size, complexity and thus energyconsumption.Compression methods are only applicable to the inference part,and not the learning part.

    DirectionsWhich combinations of quantization methods are e�cient?Can training process decide the most e�cient number of bits toquantize values?Can SAL perform well in other domains than classification?Can compression methods be reconsidered to reduce trainingcomplexity?

    Processing and Training DNNs on Chip

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 1

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfContext and Challenges

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 2Slide 3

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 4

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdff05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdfSlide 5

    f05eaa30cba787dd37bf2058e73fef3cdd960a92da81ab3d6d6a6702d650135a.pdf