training simplification and model simplification for deep ... · update only 1–4% of the weights...

69
Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method Xu Sun Peking University [email protected]

Upload: others

Post on 29-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Training Simplification and Model Simplification for Deep Learning:

A Minimal Effort Back Propagation Method

Xu Sun

Peking University

[email protected]

Page 2: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Research Background & Challenges

Introduction

machine learning for Natural Language Processing

Language Generation

Machine Learning

Language Understanding

Page 3: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Outline

Minimal Back Propagation

meProp

meSimp

Others

Page 4: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Outline

Minimal Back Propagation

meProp

meSimp

Others

Page 5: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

How to identify a taxi?

Motivation: about overfitting

Feature 1 (shape): essential feature

Feature 3 (color): non-essential feature

Feature 2 (wheel): essential feature

Page 6: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

How to identify a taxi?

Motivation: about overfitting

feature 3 is helpless, and could be harmful simply “memorize” the label based on many non-essential features the essential feature could be insufficiently trained

Feature 1 (shape): essential feature

Feature 3 (color): non-essential feature

Feature 2 (wheel): essential feature

Page 7: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

The question: how to identify the essential features?

This is actually quite challenging for a learning system

We try to use the back propagation information to find those essential features (and the related neurons)

In deep learning, back propagation computes the "importance" of a input feature

Thus, a feature with higher magnitude in the back propagation indicates it is more essential for this sample

Motivation: about overfitting

Page 8: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp

Proposal: minimal effort backprop (meProp)

Minimal effort only update the “essential” features/parameters

Page 9: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Benefit: only some neurons are related

Hello World

I

Love Natural

Language

Processing

Page 10: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

The computational cost of back propagation can also be reduced

POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time

LSTM (h=500) 7,334s 16,522s 23,856s

Parsing (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 3,906s 9,114s 13,020s

MINST (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 69s 171s 240s

Benefit: computational cost

Forward propagation (FP) time vs. Backward propagation (BP) time

Back propagation is costly in deep learning

Page 11: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Back propagation is costly in deep learning

Motivation 2: computational cost

Forward propagation (FP) time vs. Backward propagation (BP) time

0% 20% 40% 60% 80% 100%

POS-Tag

Parsing

MNIST

Ov. FP time Ov. BP time

Page 12: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Back propagation is costly in deep learning

Motivation 2: computational cost

Forward propagation (FP) time vs. Backward propagation (BP) time

0% 20% 40% 60% 80% 100%

POS-Tag

Parsing

MNIST

Ov. FP time Ov. BP time

Back propagation has the major computational cost

Page 13: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp (ICML 2017)

Method

Page 14: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp (ICML 2017)

Method

Page 15: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp (ICML 2017)

Method

Page 16: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Method

Original backprop

Page 17: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Method

meProp: Top-k sparsified backprop

Page 18: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Computation cost is proportional to n

Consider a basic computation unit

Back propagation

Method

Page 19: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Computation cost is proportional to n

Consider a basic computation unit

Back propagation

Top-k sparsifying leads to a linear reduction in the computation cost

Method

Page 20: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp on a mini-batch learning setting

Method

Page 21: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meProp on a mini-batch learning setting

Method

Page 22: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)

Experiments

Page 23: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)

Experiments

Page 24: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Overall forward propagation time vs. overall back propagation time.

Experiments

Page 25: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Overall forward propagation time vs. overall back propagation time.

Experiments

Varying the number of hidden layers

Page 26: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Acceleration results on MNIST using GPU.

Acceleration results on the matrix multiplication synthetic data using GPU.

Speedup on GPU (significant for heavy models, i.e., with large h)

Page 27: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Experiments

Accuracy of MLP vs. meProp’s backprop ratio.

Further analysis

Page 28: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Experiments

Accuracy of MLP vs. meProp’s backprop ratio.

Results of top-k meProp vs. random meProp.

Further analysis

Page 29: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Experiments

Accuracy of MLP vs. meProp’s backprop ratio.

Results of top-k meProp vs. random meProp.

Results of top-k meProp vs. baseline with the hidden dimension h.

Further analysis

Page 30: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Propose a very sparsified back propagation method,

Update only 1–4% of the weights at each backprop pass

Does not result in a larger number of training iterations

The accuracy is actually improved rather than degraded

Conclusions

Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Github: https://github.com/lancopku/meProp

Page 31: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Outline

Minimal Back Propagation

meProp

meSimp

Others

Page 32: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

What is the proper size for a network w.r.t. a task?

Motivation 1: neural network size

Size of the neural network has a huge impact, and need to adjust for each task accordingly Smaller network is faster to train, but can underfit Bigger network is slower to train, and can overfit

Page 33: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Motivation: rare features

Hello World

I

Love Natural

Language

Processing

Meaningless neurons given current data

Page 34: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

The question: how to automatically determine the size?

This is very important for deep learning, where tuning the size is costly

We use the back propagation information to automatically determine the network size network

Eliminating the redundant neurons

If a feature is essential for most of the sample, the related neuron plays a more important role in the layer

Motivation 1: neural network size

Page 35: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meSimp

Proposal: minimal effort simplification (meSimp)

Minimal effort only keep the “essential” features/parameters

forward propagation (original)

model simplification

activeness collected

from multiple

examples

inactive neurons

are eliminated

back propagation (meProp)

Page 36: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

The computational cost of forward propagation can also be reduced

POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time

LSTM (h=500) 7,334s 16,522s 23,856s

Parsing (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 3,906s 9,114s 13,020s

MINST (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 69s 171s 240s

Motivation 2: computational cost

Forward propagation (FP) time vs. Backward propagation (BP) time

When a trained model is deployed, only forward propagation is executed

Page 37: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meSimp

Method

Minimal effort only keep the “essential” features/parameters

forward propagation (original)

activeness collected

from multiple

examples

back propagation (meProp)

Page 38: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

An illustration of meSimp

Method

Minimal effort only keep the “essential” features/parameters

forward propagation (original)

model simplification

activeness collected

from multiple

examples

inactive neurons

are eliminated

back propagation (meProp)

Page 39: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Computation cost is proportional to n

Consider a basic computation unit

Forward propagation

Method

Page 40: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Computation cost is proportional to n

Consider a basic computation unit

Forward propagation

Eliminating the inactive neurons leads to a linear reduction in the computation cost

Method

Page 41: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Results based on various models (LSTM/MLP)

Experiments

Page 42: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Results based on various models (LSTM/MLP)

Experiments

The size of the network can be reduced by about 9x on two NLP tasks. So is the computational cost of forward propagation. It indicates that forward propagation can be substantially accelerated.

Page 43: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Results based on various models (LSTM/MLP)

Experiments

The performance of the model can be improved. Automatically determine the appropriate dimension for a task.

Page 44: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Automatically determine the appropriate dimension for each layer in the neural network.

Page 45: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Automatically determine the appropriate dimension for each layer in the neural network.

Parsing: Better than the model of the same dimension

Page 46: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Experiments

Experiments

We propose to test the claim using meAct the inactive neurons w.r.t. each examples are deactivated after 10 epochs

Neural network needs an appropriate size meSimp automatically determine the size by the minimal effort back

propagation

Further analysis: Redundant neurons are fitting to noise

Accuracy rises sharply Weight update dropped suddenly

Page 47: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Propose a model simplification method, based on activeness of the neurons

The size of the neural network can be reduced up to 9x

The accuracy of the simplified model is actually improved

Conclusions

Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Sun et al. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method. Arxiv 2017. Github: https://github.com/lancopku/meProp

Page 48: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Outline

Minimal Back Propagation

meProp

meSimp

Others

Page 49: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

How to model the complex structural dependencies in natural language?

Designing models of higher structural complexity

Typically, employing complex output structure

e.g. third-order tags

StructReg: Motivation

Page 50: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

StructReg: Motivation

More Powerful Higher Complexity Less

Stable

Page 51: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

However, models of higher structural complexity can actually hurt accuracy

What is the relation between structural complexity and generalization?

Theoretical analysis

→ Structure regularization decoding methods

StructReg: Motivation

Test F1 Chunking Eng-NER BLSTM-1st-tag 93.97 87.65 BLSTM-2nd-tag 93.24 (-0.73) 87.59 (-0.06) BLSTM-3rd-tag 92.50 (-1.47) 87.16 (-0.49)

Page 52: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

StructReg: Theoretical Analysis

empirical risk overfit-bound

Page 53: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

StructReg: Theoretical Analysis

Conclusions from our analysis: 1. Complex structure low empirical risk & high overfitting risk 2. Simple structure high empirical risk & low overfitting risk

3. Need a balanced complexity of structures

empirical risk overfit-bound

Page 54: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Proposal: Structure Regularization Decoding

Using two models of different structure complexity

Using the simple structure model to regularize the complex structure model

Advantage

Reduce the structural overfitting bound

Maintain the low empirical risk of the complex structure model

StructReg: Method

Page 55: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Apply to linear-chain structures

Neural model based on BLSTM

StructReg: Experiments

complex structures often hurts accuracy

Page 56: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Apply to linear-chain structures

Neural model based on BLSTM

StructReg: Experiments

SR decoding improves accuracy

Page 57: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Apply to hierarchical structures

Joint empty category detection and dependency parsing

StructReg: Experiments

complex structures often hurts accuracy

Page 58: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Apply to hierarchical structures

Joint empty category detection and dependency parsing

StructReg: Experiments

complex structures often hurts accuracy

SR decoding improves accuracy

Page 59: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

A structural complexity regularization framework

Reduce error rate by 36.4% for the third-order models

StructReg: Conclusions

X. Sun. Structure Regularization for Structured Prediction. NIPS 2014 Sun et al. Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing. Arxiv 2017

Page 60: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Label Embedding for Soft Training for Neural Networks

Adaptively learn meaningful embedding for labels

Using the learned embedding to soften the training

Label Embedding

SoftTrain SoftTrain with LabelEmb

Page 61: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Label Embedding for Soft Training for Neural Networks

Substantial improvements for both CV and NLP tasks

Label Embedding

summarization

Image Recognition

Page 62: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: NLP

Label Embedding

Label similarity for Summarization

Label similarity for Machine Traslation

Page 63: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: CV

Label Embedding

Sun et al. Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks. arXiv 2017

Page 64: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Chinese social media text summarization Existing models are based on the

encoder-decoder framework

The generated summaries are similar to source texts literally

But they have low semantic relevance

SRB: Motivation

Page 65: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Semantic Relevance Based neural model (ACL 2017)

encourage high semantic similarity between texts and summaries

It consists of decoder (above), encoder (below) and cosine similarity function.

Text Representation

Source text representation 𝑉𝑡 = ℎ𝑁

Generated summary representation 𝑉𝑠 = 𝑠𝑀 − ℎ𝑁

Semantic Relevance 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡 = 𝑉𝑡∙𝑉𝑠

𝑉𝑡 𝑉𝑠

Training Objective function 𝐿 = −𝑝 𝑦 𝑥;𝜃 − 𝜆 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡

SRB: Method

Page 66: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Large Scale Chinese Short Text Summarization Dataset (LCSTS)

SRB: Experiments

Our models achieve substantial improvement of all ROUGE scores over baseline systems. (W: Word level; C: Character level).

Page 67: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Example of SRB Generated Summary

SRB: Experiments

Page 68: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Proposal: Semantic Relevance Based model

transform the text and the summary into a dense vector

encourage high similarity of their representation.

The generated summary has higher semantic relevance

SRB: Conclusions

Ma et al. Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization. ACL 2017

Page 69: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label

Thanks!

Also thanks to collaborators Xuancheng Ren, Shuming Ma, Binzhen Wei