neural segmental crfs for sequence modelling · boundary-factored scrf [he 2012] 26.5 deep...

27
Neural Segmental CRFs for Sequence Modelling Liang Lu Based on the work with Lingpeng Kong @ CMU 28 June 2016

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Neural Segmental CRFs for Sequence Modelling

Liang Lu

Based on the work with Lingpeng Kong @ CMU

28 June 2016

Page 2: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Carry on from Mark1

• Sequence-to-sequence modelling◦ speech synthesis:

word sequence → waveform

◦ speech recognition:waveform → word sequence

◦ machine translation:word sequence → word sequence

1Assuming that you were in the talk with attention and long-term memory.2 of 27

Page 3: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Next Question

• Is speech recognition more special?◦ monotonic alignment

◦ long input sequence

◦ output sequence is much shorter (word/phonme)

3 of 27

Page 4: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Speech Recognition

• monotonic alignment◦ encoder-decoder model does not naturally apply◦ x1:T → c→ y1:L

• long input sequence◦ expensive for global normalised model

• output sequence is much shorter (word/phonme)◦ length mismatch

4 of 27

Page 5: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Speech Recognition

• Hidden Markov Model

◦ monotonic alignment√

◦ long input sequence → locally normalised

◦ length mismatch → hidden states

• Connectionist Temporal Classification◦ monotonic alignment

◦ long input sequence → locally normalised

◦ length mismatch → blank state

5 of 27

Page 6: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Speech Recognition

• Locally normalised models:◦ conditional independence assumption

◦ label bias problem

◦ better results given by sequence training: local → global normalisation

• Question:Why not sticking to the globally normalised models from scratch?

[1] D. Andor, et al, “Globally Normalized Transition-Based Neural Networks”, ACL, 2016.

[2] D. Povey, et al, “Purely sequence-trained neural networks for ASR based on lattice-free

MMI” Interspeech, 2016

6 of 27

Page 7: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

(Segmental) Conditional Random Field

CRF segmental CRF

7 of 27

Page 8: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

(Segmental) Conditional Random Field

• CRF [Lafferty et al. 2001]

P(y1:L | x1:T ) =1

Z (x1:T )

∏j

exp(w>Φ(yj , x1:T )

)(1)

where L = T .

• Segmental (semi-Markov) CRF [Sarawagi and Cohen 2004]

P(y1:L,E, | x1:T ) =1

Z (x1:T )

∏j

exp(w>Φ(yj , ej , x1:T )

)(2)

where ej = 〈sj , nj〉 denotes the beginning (sj) and end (nj) timetag of yj ; E = {e1:L} is the latent segment label.

8 of 27

Page 9: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

(Segmental) Conditional Random Field

1Z(x1:T )

∏j exp

(w>Φ(yj , x1:T )

)

• Learnable parameter w

• Engineering the feature function Φ(·)

• Designing Φ(·) is much harder for speech than NLP

9 of 27

Page 10: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Neural Segmental CRF

• Using (recurrent) neural networks to learn the feature functionΦ(·).

x1 x2 x3 x4

y2y1

x5 x6

y3

[1] Y. Liu, et al, “Exploring Segment Representations for Neural Segmentation Models”,arXiv 2016.

10 of 27

Page 11: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Neural conditional random fields

• Training criteria◦ Conditional maximum likelihood

L(θ) = logP(y1:L | x1:T )

= log∑E

P(y1:L,E | x1:T ) (3)

◦ Hinge loss – similar to structured SVM.

something complicated!

11 of 27

Page 12: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Neural conditional random fields

• Viterbi decoding◦ Partially Viterbi decoding

y∗1:L = arg maxy1:L

log∑E

P(y1:L,E | x1:T ) (4)

◦ Fully Viterbi decoding

y∗1:L,E

∗ = arg maxy1:L,E

logP(y1:L,E | x1:T ) (5)

[1] L. Lu, et al, “Segmental Recurrent Neural Networks for End-to-end Speech

Recognition”, Interspeech 2016.

12 of 27

Page 13: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Related works

• (Segmental) CRFs for speech

• Neural CRFs

• Structured SVMs

13 of 27

Page 14: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

[1] A. Senior, et al, “Acoustic Modelling with CD-CTC-sMBR LSTM RNNs”, ASRU 2015.

14 of 27

Page 15: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y1 y2 y3 y4

15 of 27

Page 16: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y2 y3 y4<b>

16 of 27

Page 17: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y2 y3 y4<b>

×

17 of 27

Page 18: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y3 y4<b>

×

<b>

18 of 27

Page 19: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y3 y4<b>

×

<b>

×

19 of 27

Page 20: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Comparison to CTC

x1 x2 x3 x4

y3 y4<b>

×

<b>

×

20 of 27

Page 21: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Experiment

• TIMIT dataset◦ 3696 training utterances (∼ 3 hours)

◦ core test set (192 testing utterances)

◦ trained on 48 phonemes, and mapped to 39 for scoring

◦ log filterbank features (FBANK)

◦ using LSTM as an implementation of RNN

21 of 27

Page 22: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Experiment

• Speed up training

x1 x2 x3 x4· · ·

x1 x2 x3 x4· · ·

a) concatenate / add

b) skip

22 of 27

Page 23: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Experiment

Table: Results of tuning the hyperparameters.

Dropout layers hidden PER3 128 21.2

0.2 3 250 20.16 250 19.33 128 21.3

0.1 3 250 20.96 250 20.4

× 6 250 21.9

23 of 27

Page 24: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Experiment

Table: Results of three types of acoustic features.

Features Deltas d(xt) PER24-dim FBANK

√72 19.3

40-dim FBANK√

120 18.9Kaldi × 40 17.3

Kaldi features – 39 dimensional MFCCs spliced by a context window of 7, followed byLDA and MLLT transform and with feature-space speaker-dependent MLLR

24 of 27

Page 25: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Experiment

Table: Comparison to related works.

System LM SD PERHMM-DNN

√ √18.5

first-pass SCRF [Zweig 2012]√ × 33.1

Boundary-factored SCRF [He 2012] × × 26.5Deep Segmental NN [Abdel 2013]

√ × 21.9Discriminative segmental cascade [Tang 2015]

√ × 21.7+ 2nd pass with various features

√ × 19.9CTC [Graves 2013] × × 18.4RNN transducer [Graves 2013] – × 17.7Attention-based RNN [Chorowski 2015] – × 17.6Segmental RNN × × 18.9Segmental RNN × √

17.3

25 of 27

Page 26: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Conclusion

• Neural Segmental CRFs are flexible and powerful sequence models◦ handwriting recognition

◦ joint word segmentation and POS tagging

• However, speed matters for large vocabulary speech recognition◦ WFST-based decoder

◦ context-dependent vs. context-independent phones

[1] L. Kong, et al, “Segmental Recurrent Neural Networks”, ICLR 2016.

26 of 27

Page 27: Neural Segmental CRFs for Sequence Modelling · Boundary-factored SCRF [He 2012] 26.5 Deep Segmental NN [Abdel 2013] p 21.9 Discriminative segmental cascade [Tang 2015] p 21.7 + 2nd

Thank you ! Questions?

27 of 27