lstm compression for language modeling · pdf filelstm compression for language modeling...
TRANSCRIPT
LSTM compression for language modeling
Artem Grachev
CS HSE, Samsung R&D Russia
April 27th, 2017
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 1 / 17
Language modeling
We need to compute the probability of a sentence or sequence of words(w1, . . . ,wT ) in language L.
P (w1, . . . ,wT ) = P (w1, . . . ,wT−1)P (wT |w1, . . . ,wT−1) =
=T∏t=1
P (wt |w1, . . . ,wt−1)
Fix N and try to compute P (wt |wt−N , . . . ,wt−1)
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 2 / 17
RNN
RNN – recurrent neural network. T – number timesteps, L – numberrecurrent layers. Each layer we can describe as follows:
z tl =Wixtl−1 + Vlx
t−1l + bl (1)
x tl =σ(zl) (2)
t ∈ 1, . . . ,N; l ∈ 1, . . . , L. The output of the network is given by
y t = softmax[WL+1x
tL + bL+1
t
]Then define
P (wt |wt−N , . . . ,wt−1) = y t
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 3 / 17
LSTM
Equations for one LSTM-layer:
i tl =σ[W i
l xtl−1 + V i
l xt−1l + bil + D(v il )c
t−1l
]input gate (3)
f tl =σ[W f
l xtl−1 + V f
l xt−1l + bfl + D(v fl )c
t−1l
]forget gate (4)
ctl =f tl · ct−1l + i tl tanh
[W c
l xtl−1 + Uc
l xt−1l + bcl
]cell state (5)
otl =σ[W o
l xtl−1 + V o
l xt−1l + bol + D(vol )c
t−1l
]output gate (6)
x tl =otl · tanh[ctl ] (7)
t ∈ 1, . . . ,N; l ∈ 1, . . . , L. The output of the network is given by
y t = softmax[WL+1x
tL + bL+1
t
]A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 4 / 17
RNN and LSTM unit
Figure 1: Left: RNN unit, Right: LSTM unit
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 5 / 17
ProblemNeural network models can take up a lot of space on disk and in memory.Also they need a lot of time for inference.
Let’s compress neural networks
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 6 / 17
Compression methods
1 Pruning2 Quantization3 Low-rank decomposition
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 7 / 17
Pruning
DefinitionPruning – method for reducing number of parameters in NN. Allconnections with weights below a threshold are removed from the network.After this we retrain the network to learn the final weights for theremaining sparse connections
Figure 2: Weights distribution before and after pruning
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 8 / 17
Quantization1
DefinitionQuantization – method for compression the size of neural network inmemory. We are compressing each float value to an eight-bit integerrepresenting the closest real number in a linear set of 256 within the range.
When we use the model for inference we still need to do full-precisioncomputations.
1https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 9 / 17
PTB Results
Method Size (Mb) Test PPOriginal. 2-layer (each 200) LSTM. 18.6 Mb 117.659
Pruning output layer 90%w/o additional training 5.5 Mb 149.310
Pruning output layer 90%with additional training 5.5 Mb 121.123
Quantization. 4.7 Mb 118.232
Table 1: Pruning and quantization results
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 10 / 17
Low-rank factorization
Simple factorization can be done as follows:
x tl = σ[W a
l Wbl h
tl−1 + Ua
l Ubl h
t−1l + bl
]Let’s require W b
l = Ubl−1. After this we can rewrite our equation for RNN:
x tl =σ[W a
l mtl−1 + Ua
l mt−1l + bl (8)
mtl =Ub
l xtl (9)
yt =softmax[WL+1m
tL + bL+1
](10)
Note that for LSTM it is pretty the same
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 11 / 17
Low-rank factorization. Visual results
Figure 3: up — original, down — compressed
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 12 / 17
Low-rank factorization. Results
Method Size (Mb) Test PPPlain vanilla model2-layer (each 600) LSTM 71.1 43.4SVD for output layer (128) 52.5 43.6Reduce embedding size (100) 46.3 42.9Reduce embedding size (100) andSVD for output layer (128) 27.7 42.9Reduce embedding size (100) andLow-rank factorization (128) 14.5 45.0Reduce embedding size (100) andLow-rank factorization (128) in half-precision 7.3 45.0
Table 2: Low-rank factorization results. (not PTB)
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 13 / 17
Tensor Train decomposition [Oseledets, 2011]
TT-format for a tensor ~A:
A(i1, . . . , id) =∑
α0,...,αd
G1(α0, i1, α1)G2(α1, i2, α2) . . .Gd(αd−1, id , αd).
(11)This can be written compactly as a matrix product:
A(i1, . . . , id) = G1[i1]︸ ︷︷ ︸1×r1
G2[i2]︸ ︷︷ ︸r1×r2
. . .Gd [id ]︸ ︷︷ ︸rd−1×1
(12)
Terminology:Gi : TT-cores (collections of matrices)ri : TT-ranksr = max ri : maximal TT-rank
TT-format uses O(dnr2) memory to store O(nd) elements.Efficient only if the ranks are small.
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 14 / 17
PTB results
Method Size (Mb) Test PPPTB 200-200. 18.6 Mb 115.91PTB 650-650 79.1 Mb 82.07PTB 1500-1500 264.1 Mb 78.29LR PTB 650-650 16.8 Mb 92.885LR PTB 1500-1500 94.9 Mb 89.462
TT-LSTM PTB 600-600 50.4 Mb 168.639
Table 3: PTB matrix decomposition results
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 15 / 17
References
Song Han and Huizi Mao and William J. Dally Deep Compression: CompressingDeep Neural Networks with Pruning, Trained Quantization and HuffmanCoding — Acoustics, Speech and Signal Processing (ICASSP), 2016
Zhiyun Lu and Vikas Sindhwan and Tara N. Sainath Learning compact recurrentneural networks — Acoustics, Speech and Signal Processing (ICASSP), 2016
Y. Bengio. Learning Deep Architectures for AI — Foundations and Trends inMachine Learning 2009
Ivan V. Oseledets Tensor-Train Decomposition — SIAM J. Scientific ComputingVol. 33(5), pp. 2295–2317., 2011
Sepp Hochreiter and Jurgen Schmidhuber Long Short-Term Memor — NeuralComputation, Vol. 9(8), pp 1735-1780., 1997
Tomas Mikolov Statistical Language Models Based on Neural Networks — PhdThesis, Brno, CZ, 2012
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 16 / 17
Thank you for listening!
A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 17 / 17