training simplification and model simplification for deep ... · update only 1–4% of the weights...

Training Simplification and Model Simplification for Deep Learning:

A Minimal Effort Back Propagation Method

Xu Sun

Peking University

[email protected]

Research Background & Challenges

Introduction

machine learning for Natural Language Processing

Language Generation

Machine Learning

Language Understanding

Outline

Minimal Back Propagation

meProp

meSimp

Others

How to identify a taxi?

Motivation: about overfitting

Feature 1 (shape): essential feature

Feature 3 (color): non-essential feature

Feature 2 (wheel): essential feature

How to identify a taxi?


feature 3 is helpless, and could be harmful simply “memorize” the label based on many non-essential features the essential feature could be insufficiently trained

Feature 1 (shape): essential feature

Feature 3 (color): non-essential feature

Feature 2 (wheel): essential feature

The question: how to identify the essential features?

This is actually quite challenging for a learning system

We try to use the back propagation information to find those essential features (and the related neurons)

In deep learning, back propagation computes the "importance" of a input feature

Thus, a feature with higher magnitude in the back propagation indicates it is more essential for this sample


An illustration of meProp

Proposal: minimal effort backprop (meProp)

Minimal effort only update the “essential” features/parameters

Benefit: only some neurons are related

Hello World

I

Love Natural

Language

Processing

The computational cost of back propagation can also be reduced

POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time

LSTM (h=500) 7,334s 16,522s 23,856s

Parsing (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 3,906s 9,114s 13,020s

MINST (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 69s 171s 240s

Benefit: computational cost

Forward propagation (FP) time vs. Backward propagation (BP) time

Back propagation is costly in deep learning


Motivation 2: computational cost


0% 20% 40% 60% 80% 100%

POS-Tag

Parsing

MNIST

Ov. FP time Ov. BP time




0% 20% 40% 60% 80% 100%

POS-Tag

Parsing

MNIST

Ov. FP time Ov. BP time

Back propagation has the major computational cost

An illustration of meProp (ICML 2017)

Method

Method

Original backprop

Method

meProp: Top-k sparsified backprop

Computation cost is proportional to n

Consider a basic computation unit

Back propagation

Method



Back propagation

Top-k sparsifying leads to a linear reduction in the computation cost

Method

An illustration of meProp on a mini-batch learning setting

Method

Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)

Experiments

Overall forward propagation time vs. overall back propagation time.

Experiments

Overall forward propagation time vs. overall back propagation time.

Experiments

Varying the number of hidden layers

Experiments

Acceleration results on MNIST using GPU.

Acceleration results on the matrix multiplication synthetic data using GPU.

Speedup on GPU (significant for heavy models, i.e., with large h)

Experiments

Experiments

Accuracy of MLP vs. meProp’s backprop ratio.

Further analysis

Experiments

Experiments


Results of top-k meProp vs. random meProp.

Further analysis

Experiments

Experiments


Results of top-k meProp vs. random meProp.

Results of top-k meProp vs. baseline with the hidden dimension h.

Further analysis

Propose a very sparsified back propagation method,

Update only 1–4% of the weights at each backprop pass

Does not result in a larger number of training iterations

The accuracy is actually improved rather than degraded

Conclusions

Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Github: https://github.com/lancopku/meProp

Outline


meProp

meSimp

Others

What is the proper size for a network w.r.t. a task?

Motivation 1: neural network size

Size of the neural network has a huge impact, and need to adjust for each task accordingly Smaller network is faster to train, but can underfit Bigger network is slower to train, and can overfit

Motivation: rare features

Hello World

I

Love Natural

Language

Processing

Meaningless neurons given current data

The question: how to automatically determine the size?

This is very important for deep learning, where tuning the size is costly

We use the back propagation information to automatically determine the network size network

Eliminating the redundant neurons

If a feature is essential for most of the sample, the related neuron plays a more important role in the layer

Motivation 1: neural network size

An illustration of meSimp

Proposal: minimal effort simplification (meSimp)

Minimal effort only keep the “essential” features/parameters

forward propagation (original)

model simplification

activeness collected

from multiple

examples

inactive neurons

are eliminated

back propagation (meProp)

The computational cost of forward propagation can also be reduced

POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time

LSTM (h=500) 7,334s 16,522s 23,856s

Parsing (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 3,906s 9,114s 13,020s

MINST (Adam) Ov. FP time Ov. BP time Ov. time

MLP (h=500) 69s 171s 240s



When a trained model is deployed, only forward propagation is executed


Method




from multiple

examples



Method



model simplification


from multiple

examples

inactive neurons

are eliminated




Forward propagation

Method



Forward propagation

Eliminating the inactive neurons leads to a linear reduction in the computation cost

Method

Results based on various models (LSTM/MLP)

Experiments


Experiments

The size of the network can be reduced by about 9x on two NLP tasks. So is the computational cost of forward propagation. It indicates that forward propagation can be substantially accelerated.


Experiments

The performance of the model can be improved. Automatically determine the appropriate dimension for a task.

Experiments

Automatically determine the appropriate dimension for each layer in the neural network.

Experiments

Automatically determine the appropriate dimension for each layer in the neural network.

Parsing: Better than the model of the same dimension

Experiments

Experiments

We propose to test the claim using meAct the inactive neurons w.r.t. each examples are deactivated after 10 epochs

Neural network needs an appropriate size meSimp automatically determine the size by the minimal effort back

propagation

Further analysis: Redundant neurons are fitting to noise

Accuracy rises sharply Weight update dropped suddenly

Propose a model simplification method, based on activeness of the neurons

The size of the neural network can be reduced up to 9x

The accuracy of the simplified model is actually improved

Conclusions

Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Sun et al. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method. Arxiv 2017. Github: https://github.com/lancopku/meProp

Outline


meProp

meSimp

Others

How to model the complex structural dependencies in natural language?

Designing models of higher structural complexity

Typically, employing complex output structure

e.g. third-order tags

StructReg: Motivation


More Powerful Higher Complexity Less

Stable

However, models of higher structural complexity can actually hurt accuracy

What is the relation between structural complexity and generalization?

Theoretical analysis

→ Structure regularization decoding methods


Test F1 Chunking Eng-NER BLSTM-1st-tag 93.97 87.65 BLSTM-2nd-tag 93.24 (-0.73) 87.59 (-0.06) BLSTM-3rd-tag 92.50 (-1.47) 87.16 (-0.49)

StructReg: Theoretical Analysis

empirical risk overfit-bound

StructReg: Theoretical Analysis

Conclusions from our analysis: 1. Complex structure low empirical risk & high overfitting risk 2. Simple structure high empirical risk & low overfitting risk

3. Need a balanced complexity of structures

empirical risk overfit-bound

Proposal: Structure Regularization Decoding

Using two models of different structure complexity

Using the simple structure model to regularize the complex structure model

Advantage

Reduce the structural overfitting bound

Maintain the low empirical risk of the complex structure model

StructReg: Method

Apply to linear-chain structures

Neural model based on BLSTM

StructReg: Experiments

complex structures often hurts accuracy

Apply to linear-chain structures

Neural model based on BLSTM


SR decoding improves accuracy

Apply to hierarchical structures

Joint empty category detection and dependency parsing



Apply to hierarchical structures

Joint empty category detection and dependency parsing



SR decoding improves accuracy

A structural complexity regularization framework

Reduce error rate by 36.4% for the third-order models

StructReg: Conclusions

X. Sun. Structure Regularization for Structured Prediction. NIPS 2014 Sun et al. Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing. Arxiv 2017

Label Embedding for Soft Training for Neural Networks

Adaptively learn meaningful embedding for labels

Using the learned embedding to soften the training

Label Embedding

SoftTrain SoftTrain with LabelEmb

Label Embedding for Soft Training for Neural Networks

Substantial improvements for both CV and NLP tasks

Label Embedding

summarization

Image Recognition

Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: NLP

Label Embedding

Label similarity for Summarization

Label similarity for Machine Traslation

Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: CV

Label Embedding

Sun et al. Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks. arXiv 2017

Chinese social media text summarization Existing models are based on the

encoder-decoder framework

The generated summaries are similar to source texts literally

But they have low semantic relevance

SRB: Motivation

Semantic Relevance Based neural model (ACL 2017)

encourage high semantic similarity between texts and summaries

It consists of decoder (above), encoder (below) and cosine similarity function.

Text Representation

Source text representation 𝑉𝑡 = ℎ𝑁

Generated summary representation 𝑉𝑠 = 𝑠𝑀 − ℎ𝑁

Semantic Relevance 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡 = 𝑉𝑡∙𝑉𝑠

𝑉𝑡 𝑉𝑠

Training Objective function 𝐿 = −𝑝 𝑦 𝑥;𝜃 − 𝜆 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡

SRB: Method

Large Scale Chinese Short Text Summarization Dataset (LCSTS)

SRB: Experiments

Our models achieve substantial improvement of all ROUGE scores over baseline systems. (W: Word level; C: Character level).

Example of SRB Generated Summary

SRB: Experiments

Proposal: Semantic Relevance Based model

transform the text and the summary into a dense vector

encourage high similarity of their representation.

The generated summary has higher semantic relevance

SRB: Conclusions

Ma et al. Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization. ACL 2017

Thanks!

Also thanks to collaborators Xuancheng Ren, Shuming Ma, Binzhen Wei

training simplification and model simplification for deep ... · update only 1–4% of the weights...

Documents