深層学習 - github pagestarowatanabe.github.io/alaginmt2015/alaginmt-deep.pdf · 2020. 4....

深層学習渡辺太郎

taro.watanabe at nict.go.jp

https://sites.google.com/site/alaginmt2015/

https://sites.google.com/site/alaginmt2015/https://sites.google.com/site/alaginmt2015/https://sites.google.com/site/alaginmt2015/https://sites.google.com/site/alaginmt2015/

I want to study about machine translation -2 -3 -4

I need to master machine translation -3 -4 -4

machine translation want to study -2 -5 -1

I don’t want to learn anything -5 -2 -3

-9-11-8

-10

kbest翻訳候補log Pr(�|e) log Pr(e)

0.5×-2 0.4×-3 0.2×-4

0.5×-3 0.4×-4 0.2×-4

0.5×-2 0.4×-5 0.2×-1

0.5×-5 0.4×-2 0.2×-3

-3.0-3.9-3.2-3.9

log Pr(f , �|�)

2

機械翻訳について勉強したい。

重み付けにより並び替え

• より一般化:

重み付け

3

0.5ê =

ePr(f , �|�, e) Pr(�|e) Pr(e)

0.40.2

最適化 = 最適なwを識別学習

(Och and Ney, 2002)

ê = arg maxe

�d exp

�w�h(f , d, e)

��

e�,d� exp�w�h(f , d�, e�)

�

� arg max�e,d�

w�h(f , d, e)

なぜ深層学習?

4

日本 ||| a japanese ||| 0.272727 0.283379 0.00953895 0.00411563日本 ||| all over japan ||| 0.222222 0.541131 0.00317965 3.72914e-07日本 ||| around japan ||| 0.05 0.541131 0.00158983 0.000337958日本 ||| as some japanese ||| 1 0.283379 0.00158983 5.45274e-07日本 ||| japan , ||| 0.0897436 0.541131 0.0111288 0.0225871日本 ||| japan : ||| 1 0.541131 0.00158983 0.00131649日本 ||| japan ||| 0.396648 0.541131 0.338633 0.463146 日本 ||| japanese , ||| 0.0769231 0.283379 0.00158983 0.00557971日本 ||| japanese ||| 0.242553 0.283379 0.09062 0.114411

素性を試行錯誤で開発

素性を自動的に学習

機械翻訳への貢献

5

We also perform 1000-best rescoring with thefollowing features:

• 5-gram Kneser-Ney LM• Recurrent neural network language model

(RNNLM) (Mikolov et al., 2010)

Although we consider the RNNLM to be partof our baseline, we give it special treatment in theresults section because we would expect it to havethe highest overlap with our NNJM.

5.2 Training and OptimizationFor Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLTcondition, and the Sakhr9 tokenizer for the NISTcondition. For Chinese tokenization, we use a sim-ple longest-match-first lexicon-based approach.

For word alignment, we align all of the train-ing data with both GIZA++ (Och and Ney, 2003)and NILE (Riesa et al., 2011), and concatenate thecorpora together for rule extraction.

For MT feature weight optimization, we useiterative k-best optimization with an Expected-BLEU objective function (Rosti et al., 2010).

6 Experimental Results

We present MT primary results on Arabic-Englishand Chinese-English for the NIST OpenMT12 andDARPA BOLT conditions. We also present a setof auxiliary results in order to further analyze ourfeatures.

6.1 NIST OpenMT12 ResultsOur NIST system is fully compatible with theOpenMT12 constrained track, which consists of10M words of high-quality parallel training forArabic, and 25M words for Chinese.10 TheKneser-Ney LM is trained on 5B words of datafrom English GigaWord. For test, we usethe “Arabic-To-English Original Progress Test”(1378 segments) and “Chinese-to-English Orig-inal Progress Test + OpenMT12 Current Test”(2190 segments), which consists of a mix ofnewswire and web data.11 All test segments have4 references. Our tuning set contains 5000 seg-ments, and is a mix of the MT02-05 eval set aswell as held-out parallel training.

9http://www.sakhr.com10We also make weak use of 30M-100M words of UN data

+ ISI comparable corpora, but this data provides almost nobenefit.

11http://www.nist.gov/itl/iad/mig/openmt12results.cfm

NIST MT12 TestAr-En Ch-EnBLEU BLEU

OpenMT12 - 1st Place 49.5 32.6OpenMT12 - 2nd Place 47.5 32.2OpenMT12 - 3rd Place 47.4 30.8· · · · · · · · ·OpenMT12 - 9th Place 44.0 27.0OpenMT12 - 10th Place 41.2 25.7Baseline (w/o RNNLM) 48.9 33.0Baseline (w/ RNNLM) 49.8 33.4+ S2T/L2R NNJM (Dec) 51.2 34.2+ S2T NNLTM (Dec) 52.0 34.2+ T2S NNLTM (Resc) 51.9 34.2+ S2T/R2L NNJM (Resc) 52.2 34.3+ T2S/L2R NNJM (Resc) 52.3 34.5+ T2S/R2L NNJM (Resc) 52.8 34.7“Simple Hier.” Baseline 43.4 30.1+ S2T/L2R NNJM (Dec) 47.2 31.5+ S2T NNLTM (Dec) 48.5 31.8+ Other NNJMs (Resc) 49.7 32.2

Table 3: Primary results on Arabic-English andChinese-English NIST MT12 Test Set. The firstsection corresponds to the top and bottom rankedsystems from the evaluation, and are taken fromthe NIST website. The second section correspondsto results on top of our strongest baseline. Thethird section corresponds to results on top of asimpler baseline. Within each section, each rowincludes all of the features from previous rows.BLEU scores are mixed-case.

Results are shown in the second section of Ta-ble 3. On Arabic-English, the primary S2T/L2RNNJM gains +1.4 BLEU on top of our baseline,while the S2T NNLTM gains another +0.8, andthe directional variations gain +0.8 BLEU more.This leads to a total improvement of +3.0 BLEUfrom the NNJM and its variations. Consideringthat our baseline is already +0.3 BLEU better thanthe 1st place result of MT12 and contains a strongRNNLM, we consider this to be quite an extraor-dinary improvement.12

For the Chinese-English condition, there is animprovement of +0.8 BLEU from the primaryNNJM and +1.3 BLEU overall. Here, the base-line system is already +0.8 BLEU better than the

12Note that the official 1st place OpenMT12 result was ourown system, so we can assure that these comparisons are ac-curate.

1375

(Devlin et al., 2014)

背景

線形分類

• 線形モデルで識別可能:

7

☼

☂☼ ☼

☼☂☂☂ ☂

y = sign(Wx + b)

XOR問題

• 線形モデルで解けない

8

☼

☂

☂

☂

☂ ☂

☼

☼

☼☼

☂

☂

☂☂

☼

☼☼

XOR問題

• 複数の座標変換 + 線形分類9

☼

☂

☂

☂

☂ ☂

☼

☼

☼☼

☂

☂

☂☂

☼

☼☼

☼☼☼☼☼

☼

☼

☂☂☂

☂☂☂

☼☼

☂☂ ☂活性化関数

W1x + b1

W2x + b2

y = sign

�W3

�f(W2x + b2)f(W1x + b1)

�+ b3

�

(Cybenko, 1989)

活性化関数

-2

-1

0

1

2

-4 -3 -2 -1 0 1 2 3 4

y

x

sigmoidtanh

softplus -2

-1

0

1

2

-4 -3 -2 -1 0 1 2 3 4

y

x

htanhReLU -2

-1

0

1

2

-4 -3 -2 -1 0 1 2 3 4

y

x

10

• 非線形な変換による柔軟な設計

maxoutsigmoid/tanh/

softplushtanh/ReLU

(Goodfellow et al., 2013)

ニューラルネットワーク

11

入力写像活性化出力

12

入力写像活性化出力


多次元化

13

入力写像活性化出力写像活性化


多次元化+多層化

14

入力写像活性化出力写像活性化写像活性化


多次元化+多層化+多層化

学習

• 学習データ(x, label)のxからネットワークを計算

• 損失を計算: 例、hinge損失: max(0, 1 - label * y)15

x yh3 h4

f(h3) Woh4 + bo

晴れなら1、曇なら-1

h1

W1x + b1h2

f(h1) W2h2 + b2

loss

最適化

• 例: t番目の学習データが与えられた時、SGDで更新: ϴ = {W1, b1, W2, b2, Wo, bo}

x yh3 h4

f(h3) Woh4 + boh1

W1x + b1h2

f(h1) W2h2 + b2

loss

16

�(t+1) � �(t) � �loss��

わかっているのは、ここだけ。

伝搬

17

x yh3 h4

f(h3) Woh4 + boh1

W1x + b1h2

f(h1) W2h2 + b2

loss

�y =�loss�y

= �label

�4 =�y

�h4�y

= W�o �y

�3 =�h4�h3

�4

= f �(h3)�4

�2 =�h3�h2

�3

= W�2 �3

�1 =�h2�h1

�2

= f �(h1)�2

• 損失は一番上の層で計算: 下層へ伝搬

勾配

18

x yh3 h4

f(h3) Woh4 + boh1

W1x + b1h2

f(h1) W2h2 + b2

loss

• 各層のδから各層毎にパラメータの勾配を計算

�y�4�3�2�1

�loss�Wo

= �yh�4

�loss�bo

= �y

�loss�W2

= �3h�2

�loss�b2

= �3

�loss�W1

= �1x�

�loss�b1

= �1

更新

19

x yh3 h4

f(h3) Woh4 + boh1

W1x + b1h2

f(h1) W2h2 + b2

loss

• 各パラメータを更新

�y�4�3�2�1

W (t+1)o � W (t)o ��loss�Wo

b(t+1)o � b(t)o ��loss�bo

W (t+1)2 � W(t)2 �

�loss�W2

b(t+1)2 � b(t)2 �

�loss�b2

W (t+1)2 � W(t)2 �

�loss�W2

b(t+1)2 � b(t)2 �

�loss�b2

注

• SGDの代わりに: AdaGrad、AdaDelta、Adam

• 過学習しやすい: DropOut、正則化、事前学習

• 学習に時間が掛かる: オンライン学習、mini-batch、GPU

20

Feed-Forward NN

• いわゆる(?)ニューラルネットワーク

• 注意: 簡潔にするため活性化関数を省略21

Convolution

• 局所的に各層を接続(同じ向きでパラメータを共有)+次元数を減らす(pooling)

22

Autoencoder (AE)

• 隠れ層から入力を再現するように学習

• 教師なし学習が可能23

入力 = 出力

Recurrent NN

• 系列をモデル化

• 注意: 簡潔にするため細かい接続を無視24

時間tの入力

時間t-1の層

時間tの層

t-1 t t+1

t-1 t t+1(Elman, 1990)

Recursive NN

• 任意の構造(例、木構造)をモデル化25

(Pollack, 1990)

機械翻訳への応用

• 言語モデル

• 翻訳モデル

• 単語アライメント

26

言語モデル

FFNN 言語モデル

• softmaxによる出力層

28

this

is

a

0.2

0.3

-0.6

-0.3

0.7

1.5

0.7

-0.3

1.0

the,

andpenwhiletowndevelopedscholars

Pr(pen|this is a) =(ypen)�

w�V (yw)

y分散表現

全ての単語について∑を計算

(Schwenk, 2007)

Recurrent NN 言語モデル

• 文全体のコンテキストを隠れ層で表現

• 近似的にデコーダの素性として使用可能29

this is a

0.2

0.3

-0.6

-0.3

0.7

1.5

0.7

-0.3

1.0

0.1

-0.5

1.3

thisis

a

pen

(Mikolov et al, 2010)

翻訳モデル

句単位のFFNN

31

• FFNN言語モデルと同様な学習法

• 問題点: 従来法により句を抽出するため、改善は少ない

機械

翻訳

について

about

machine

translation(Schwenk, 2012)

句単位のRecursive AE

32

機械翻訳について about machinetranslation

一致するように学習

素性として利用

• 各言語独立にRecursive AEのエラーが最小になる木構造を推定

(Zhang et al., 2014)

単語単位のFFNN

• 単語アライメント単位の計算により、句単位の制約を排除

33

機械

翻訳

について

about machine

translation

Pr(translation|about machine,機械翻訳について)(Devlin et al., 2014)

addNNによるモデル

34 (Liu et al., 2013)

h(X について Y したい, I want to Y about X) =

log Pr(X について Y したい, I want to Y about X)log Pr(I want to Y about X, X について Y したい)

...

sigmoidについてしたい

I want toabout

0.60.4-0.40.8

0.1-0.3-0.30.4• BLEUによる損失関数で学習

句によるRNN

35

(Wu et al., 2014)

機械翻訳machine translation

についてabout

勉強to study

したいI want

。.

機械翻訳machine

NULL

translation

についてabout

勉強to study

NULL

• 句単位のモデルを単語単位に展開

Convolution+RNN


I want to study

原言語の分散表現

(Kalchbrenner and Blunsom, 2013)

I want to

Sequenceモデル

37。したいについて勉強機械翻訳

I want to study

I want to

(Sutskever et al., 2014)

encoderdecoder

単語アライメント

単語アライメント: FFNN

• 単語アライメントが付与されたデータから学習

39


NULL I want to study .about machine translation

周辺のコンテキストを利用

Yes or No

(Yang et al., 2013)

単語アライメント: RNN

40


NULL I want to study .about machine translation(Tamura et al., 2014)

• 単語アライメントの全ての履歴を表現 + noise contrastive estimateによる教師無し学習

まとめ• ニューラルネットワークの機械翻訳への応用

• 言語モデル、翻訳モデル、単語アライメント

• 始まったばかりなのに、数多くの研究

• 全てを列挙できない: 似たり寄ったり

• 今後に期待

参考文献• G. Cybenko. 1989. Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals and Systems, 2(4):303–314.

• Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1370–1380, Baltimore, Maryland, June. Association for Computational Linguistics.

• Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179–211.• Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza Aaron C. Courville, and Yoshua

Bengio. 2013. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1319–1327.

• Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pages 1700–1709, Seattle, Washington, USA, October. Association for Computational Linguistics.

参考文献• lemao liu, Taro Watanabe, Eiichiro Sumita, and Tiejun Zhao 2013. Additive neural

networks for statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791–801, Sofia, Bulgaria, August. Association for Computational Linguistics.

• Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neura network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010 pages 1045–1048.

• Jordan B. Pollack. 1990. Recursive distributed representations. Artificial Intelligence, 46(1-2):77–105, November.

• Holger Schwenk. 2007. Continuous space language models Computer Speech and Language, 21(3):492–518, July.

• Holger Schwenk. 2012. Continuous space translation models for phrase-based statistical machine translation. In Pro ceedings of COLING 2012: Posters, pages 1071–1080, Mumbai, India, December. The COLING 2012 Organizing Committee.

• Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proc NIPS, Montreal, CA.

参考文献• Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2014 Recurrent neural networks

for word alignment model In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pages 1470–1480, Baltimore, Maryland, June. Association for Computational Linguistics.

• Youzheng Wu, Taro Watanabe, and Chiori Hori. 2014 Recurrent neural network-based tuple sequence model for machine translation. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1908–1917, Dublin, Ireland, August. Dublin City University and Association for Computational Linguistics.

• Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meet ing of the Association for Computational Linguistics (Volume 1 Long Papers), pages 166–175, Sofia, Bulgaria, August. Association for Computational Linguistics.

• Jiajun Zhang, Shujie Liu, Mu Li, Ming Zhou, and Chengqing Zong. 2014. Bilingually-constrained phrase embeddings for machine translation. In Proceedings of the 52nd Annual Meet ing of the Association for Computational Linguistics (Volume 1 Long Papers), pages 111–121, Baltimore, Maryland, June. Association for Computational Linguistics.

深層学習 - github pagestarowatanabe.github.io/alaginmt2015/alaginmt-deep.pdf · 2020. 4....

Documents