training simplification and model simplification for deep ... · update only 1–4% of the weights...
TRANSCRIPT
![Page 1: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/1.jpg)
Training Simplification and Model Simplification for Deep Learning:
A Minimal Effort Back Propagation Method
Xu Sun
Peking University
![Page 2: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/2.jpg)
Research Background & Challenges
Introduction
machine learning for Natural Language Processing
Language Generation
Machine Learning
Language Understanding
![Page 3: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/3.jpg)
Outline
Minimal Back Propagation
meProp
meSimp
Others
![Page 4: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/4.jpg)
Outline
Minimal Back Propagation
meProp
meSimp
Others
![Page 5: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/5.jpg)
How to identify a taxi?
Motivation: about overfitting
Feature 1 (shape): essential feature
Feature 3 (color): non-essential feature
Feature 2 (wheel): essential feature
![Page 6: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/6.jpg)
How to identify a taxi?
Motivation: about overfitting
feature 3 is helpless, and could be harmful simply “memorize” the label based on many non-essential features the essential feature could be insufficiently trained
Feature 1 (shape): essential feature
Feature 3 (color): non-essential feature
Feature 2 (wheel): essential feature
![Page 7: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/7.jpg)
The question: how to identify the essential features?
This is actually quite challenging for a learning system
We try to use the back propagation information to find those essential features (and the related neurons)
In deep learning, back propagation computes the "importance" of a input feature
Thus, a feature with higher magnitude in the back propagation indicates it is more essential for this sample
Motivation: about overfitting
![Page 8: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/8.jpg)
An illustration of meProp
Proposal: minimal effort backprop (meProp)
Minimal effort only update the “essential” features/parameters
![Page 9: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/9.jpg)
Benefit: only some neurons are related
Hello World
I
Love Natural
Language
Processing
![Page 10: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/10.jpg)
The computational cost of back propagation can also be reduced
POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time
LSTM (h=500) 7,334s 16,522s 23,856s
Parsing (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 3,906s 9,114s 13,020s
MINST (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 69s 171s 240s
Benefit: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
Back propagation is costly in deep learning
![Page 11: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/11.jpg)
Back propagation is costly in deep learning
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
0% 20% 40% 60% 80% 100%
POS-Tag
Parsing
MNIST
Ov. FP time Ov. BP time
![Page 12: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/12.jpg)
Back propagation is costly in deep learning
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
0% 20% 40% 60% 80% 100%
POS-Tag
Parsing
MNIST
Ov. FP time Ov. BP time
Back propagation has the major computational cost
![Page 13: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/13.jpg)
An illustration of meProp (ICML 2017)
Method
![Page 14: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/14.jpg)
An illustration of meProp (ICML 2017)
Method
![Page 15: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/15.jpg)
An illustration of meProp (ICML 2017)
Method
![Page 16: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/16.jpg)
Method
Original backprop
![Page 17: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/17.jpg)
Method
meProp: Top-k sparsified backprop
![Page 18: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/18.jpg)
Computation cost is proportional to n
Consider a basic computation unit
Back propagation
Method
![Page 19: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/19.jpg)
Computation cost is proportional to n
Consider a basic computation unit
Back propagation
Top-k sparsifying leads to a linear reduction in the computation cost
Method
![Page 20: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/20.jpg)
An illustration of meProp on a mini-batch learning setting
Method
![Page 21: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/21.jpg)
An illustration of meProp on a mini-batch learning setting
Method
![Page 22: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/22.jpg)
Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)
Experiments
![Page 23: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/23.jpg)
Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)
Experiments
![Page 24: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/24.jpg)
Overall forward propagation time vs. overall back propagation time.
Experiments
![Page 25: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/25.jpg)
Overall forward propagation time vs. overall back propagation time.
Experiments
Varying the number of hidden layers
![Page 26: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/26.jpg)
Experiments
Acceleration results on MNIST using GPU.
Acceleration results on the matrix multiplication synthetic data using GPU.
Speedup on GPU (significant for heavy models, i.e., with large h)
![Page 27: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/27.jpg)
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Further analysis
![Page 28: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/28.jpg)
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Results of top-k meProp vs. random meProp.
Further analysis
![Page 29: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/29.jpg)
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Results of top-k meProp vs. random meProp.
Results of top-k meProp vs. baseline with the hidden dimension h.
Further analysis
![Page 30: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/30.jpg)
Propose a very sparsified back propagation method,
Update only 1–4% of the weights at each backprop pass
Does not result in a larger number of training iterations
The accuracy is actually improved rather than degraded
Conclusions
Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Github: https://github.com/lancopku/meProp
![Page 31: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/31.jpg)
Outline
Minimal Back Propagation
meProp
meSimp
Others
![Page 32: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/32.jpg)
What is the proper size for a network w.r.t. a task?
Motivation 1: neural network size
Size of the neural network has a huge impact, and need to adjust for each task accordingly Smaller network is faster to train, but can underfit Bigger network is slower to train, and can overfit
![Page 33: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/33.jpg)
Motivation: rare features
Hello World
I
Love Natural
Language
Processing
Meaningless neurons given current data
![Page 34: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/34.jpg)
The question: how to automatically determine the size?
This is very important for deep learning, where tuning the size is costly
We use the back propagation information to automatically determine the network size network
Eliminating the redundant neurons
If a feature is essential for most of the sample, the related neuron plays a more important role in the layer
Motivation 1: neural network size
![Page 35: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/35.jpg)
An illustration of meSimp
Proposal: minimal effort simplification (meSimp)
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
model simplification
activeness collected
from multiple
examples
inactive neurons
are eliminated
back propagation (meProp)
![Page 36: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/36.jpg)
The computational cost of forward propagation can also be reduced
POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time
LSTM (h=500) 7,334s 16,522s 23,856s
Parsing (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 3,906s 9,114s 13,020s
MINST (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 69s 171s 240s
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
When a trained model is deployed, only forward propagation is executed
![Page 37: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/37.jpg)
An illustration of meSimp
Method
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
activeness collected
from multiple
examples
back propagation (meProp)
![Page 38: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/38.jpg)
An illustration of meSimp
Method
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
model simplification
activeness collected
from multiple
examples
inactive neurons
are eliminated
back propagation (meProp)
![Page 39: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/39.jpg)
Computation cost is proportional to n
Consider a basic computation unit
Forward propagation
Method
![Page 40: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/40.jpg)
Computation cost is proportional to n
Consider a basic computation unit
Forward propagation
Eliminating the inactive neurons leads to a linear reduction in the computation cost
Method
![Page 41: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/41.jpg)
Results based on various models (LSTM/MLP)
Experiments
![Page 42: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/42.jpg)
Results based on various models (LSTM/MLP)
Experiments
The size of the network can be reduced by about 9x on two NLP tasks. So is the computational cost of forward propagation. It indicates that forward propagation can be substantially accelerated.
![Page 43: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/43.jpg)
Results based on various models (LSTM/MLP)
Experiments
The performance of the model can be improved. Automatically determine the appropriate dimension for a task.
![Page 44: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/44.jpg)
Experiments
Automatically determine the appropriate dimension for each layer in the neural network.
![Page 45: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/45.jpg)
Experiments
Automatically determine the appropriate dimension for each layer in the neural network.
Parsing: Better than the model of the same dimension
![Page 46: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/46.jpg)
Experiments
Experiments
We propose to test the claim using meAct the inactive neurons w.r.t. each examples are deactivated after 10 epochs
Neural network needs an appropriate size meSimp automatically determine the size by the minimal effort back
propagation
Further analysis: Redundant neurons are fitting to noise
Accuracy rises sharply Weight update dropped suddenly
![Page 47: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/47.jpg)
Propose a model simplification method, based on activeness of the neurons
The size of the neural network can be reduced up to 9x
The accuracy of the simplified model is actually improved
Conclusions
Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Sun et al. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method. Arxiv 2017. Github: https://github.com/lancopku/meProp
![Page 48: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/48.jpg)
Outline
Minimal Back Propagation
meProp
meSimp
Others
![Page 49: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/49.jpg)
How to model the complex structural dependencies in natural language?
Designing models of higher structural complexity
Typically, employing complex output structure
e.g. third-order tags
StructReg: Motivation
![Page 50: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/50.jpg)
StructReg: Motivation
More Powerful Higher Complexity Less
Stable
![Page 51: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/51.jpg)
However, models of higher structural complexity can actually hurt accuracy
What is the relation between structural complexity and generalization?
Theoretical analysis
→ Structure regularization decoding methods
StructReg: Motivation
Test F1 Chunking Eng-NER BLSTM-1st-tag 93.97 87.65 BLSTM-2nd-tag 93.24 (-0.73) 87.59 (-0.06) BLSTM-3rd-tag 92.50 (-1.47) 87.16 (-0.49)
![Page 52: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/52.jpg)
StructReg: Theoretical Analysis
empirical risk overfit-bound
![Page 53: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/53.jpg)
StructReg: Theoretical Analysis
Conclusions from our analysis: 1. Complex structure low empirical risk & high overfitting risk 2. Simple structure high empirical risk & low overfitting risk
3. Need a balanced complexity of structures
empirical risk overfit-bound
![Page 54: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/54.jpg)
Proposal: Structure Regularization Decoding
Using two models of different structure complexity
Using the simple structure model to regularize the complex structure model
Advantage
Reduce the structural overfitting bound
Maintain the low empirical risk of the complex structure model
StructReg: Method
![Page 55: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/55.jpg)
Apply to linear-chain structures
Neural model based on BLSTM
StructReg: Experiments
complex structures often hurts accuracy
![Page 56: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/56.jpg)
Apply to linear-chain structures
Neural model based on BLSTM
StructReg: Experiments
SR decoding improves accuracy
![Page 57: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/57.jpg)
Apply to hierarchical structures
Joint empty category detection and dependency parsing
StructReg: Experiments
complex structures often hurts accuracy
![Page 58: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/58.jpg)
Apply to hierarchical structures
Joint empty category detection and dependency parsing
StructReg: Experiments
complex structures often hurts accuracy
SR decoding improves accuracy
![Page 59: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/59.jpg)
A structural complexity regularization framework
Reduce error rate by 36.4% for the third-order models
StructReg: Conclusions
X. Sun. Structure Regularization for Structured Prediction. NIPS 2014 Sun et al. Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing. Arxiv 2017
![Page 60: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/60.jpg)
Label Embedding for Soft Training for Neural Networks
Adaptively learn meaningful embedding for labels
Using the learned embedding to soften the training
Label Embedding
SoftTrain SoftTrain with LabelEmb
![Page 61: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/61.jpg)
Label Embedding for Soft Training for Neural Networks
Substantial improvements for both CV and NLP tasks
Label Embedding
summarization
Image Recognition
![Page 62: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/62.jpg)
Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: NLP
Label Embedding
Label similarity for Summarization
Label similarity for Machine Traslation
![Page 63: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/63.jpg)
Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: CV
Label Embedding
Sun et al. Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks. arXiv 2017
![Page 64: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/64.jpg)
Chinese social media text summarization Existing models are based on the
encoder-decoder framework
The generated summaries are similar to source texts literally
But they have low semantic relevance
SRB: Motivation
![Page 65: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/65.jpg)
Semantic Relevance Based neural model (ACL 2017)
encourage high semantic similarity between texts and summaries
It consists of decoder (above), encoder (below) and cosine similarity function.
Text Representation
Source text representation 𝑉𝑡 = ℎ𝑁
Generated summary representation 𝑉𝑠 = 𝑠𝑀 − ℎ𝑁
Semantic Relevance 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡 = 𝑉𝑡∙𝑉𝑠
𝑉𝑡 𝑉𝑠
Training Objective function 𝐿 = −𝑝 𝑦 𝑥;𝜃 − 𝜆 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡
SRB: Method
![Page 66: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/66.jpg)
Large Scale Chinese Short Text Summarization Dataset (LCSTS)
SRB: Experiments
Our models achieve substantial improvement of all ROUGE scores over baseline systems. (W: Word level; C: Character level).
![Page 67: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/67.jpg)
Example of SRB Generated Summary
SRB: Experiments
![Page 68: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/68.jpg)
Proposal: Semantic Relevance Based model
transform the text and the summary into a dense vector
encourage high similarity of their representation.
The generated summary has higher semantic relevance
SRB: Conclusions
Ma et al. Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization. ACL 2017
![Page 69: Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights at each backprop pass ... Substantial improvements for both CV and NLP tasks . Label](https://reader033.vdocuments.site/reader033/viewer/2022042807/5f7de876572a1c3e7b133afe/html5/thumbnails/69.jpg)
Thanks!
Also thanks to collaborators Xuancheng Ren, Shuming Ma, Binzhen Wei