a report of ijcnlp 2011 #tokyonlp
DESCRIPTION
TokyoNLP is a meetup about natural language processing at Tokyo. This slide is presented at the 5th presentation of the 8th event.TRANSCRIPT
![Page 1: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/1.jpg)
A Report of IJCNLP 2011 @nokuno
#tokyonlp
![Page 2: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/2.jpg)
About the presenter
• Name: Yoh Okuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, etc.
• Website: http://www.yoh.okuno.name/
![Page 3: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/3.jpg)
Recent nokuno (1)
![Page 4: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/4.jpg)
Recent nokuno (2)
![Page 5: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/5.jpg)
Recent nokuno (3)
#emnlpreading 2011. 12. 23.
at Cybozu Labs
![Page 6: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/6.jpg)
Today’s Topic • Japanese Pronunciation Prediction as Statistical
Machine Translation
• Integrating Models Derived from non-‐Parametric
Bayesian Co-‐segmentation into a Statistical Machine
Transliteration System
• Discriminative Phrase-‐based Lexicalized Reordering
Models using Weighted Reordering Graphs
![Page 7: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/7.jpg)
Japanese Pronunciation Prediction as Statistical Machine Translation
Jun Hatori and Hisami Suzuki
University of Tokyo, Microsoft Research
IJCNLP 2011
![Page 8: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/8.jpg)
Motivation • Japanese words and sentences have multiple
pronunciations
• Proposed method predicts pronunciations of
out-‐of-‐vocabulary (OOV) words [Hatori+ 11] and
known words sentence simultaneously
• Used statistical machine translation (SMT)
framework at word and character level
![Page 9: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/9.jpg)
An Example
• Input: 東京都美術館の狩野探幽展に行った
• Output: とうきょうとびじゅつかんのかのうたんゆうてんにいった
• Training corpus: Japanese dictionary and corpus with pronunciation
![Page 10: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/10.jpg)
Discriminative model • Similar to phrase-‐based SMT with monotone
alignment, no insertion and no deletion
• Used averaged perceptron training
λ:parameters , f: features
![Page 11: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/11.jpg)
Features
• Bidirectional translation probability
• Target character n-‐gram model
• Target character length
• Joint n-‐gram model
– Pairs of (source, target) probability
![Page 12: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/12.jpg)
Translation Process
![Page 13: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/13.jpg)
Training • Produce translation table and language model
![Page 14: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/14.jpg)
Experimental Result
• Dictionary-‐based approach outperformed
substring-‐based approach [Hatori+ 11]
![Page 15: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/15.jpg)
References • HS11: [Hatori+ 11] Predicting Word Pronunciation in
Japanese
• Mecab: [Kudo+ 04] Applying conditional random fields to
Japanese morphological analysis
• KyTea: [Neubig+ 10] Word-‐based partial annotation for
efficient corpus construction
• [Suzuki+ 05] Microsoft Research IME Corpus
• [Maekawa+ 08] Compilation of the KOTONOHA-‐BCCWJ
corpus (in Japanese)
![Page 16: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/16.jpg)
Integrating Models Derived from non-‐Parametric
Bayesian Co-‐segmentation into a Statistical Machine Transliteration System
Andrew Finch and Eiichiro Sumita (NICT)
NEWS 2011
![Page 17: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/17.jpg)
Transliteration Task • Transliteration is defined as phonetic translation of names across languages
[Zhang+ 11]
![Page 18: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/18.jpg)
Nonparametric co-‐segmentation
• Extended monolingual word segmentation
[Mochihashi+ 09] [Goldwater+ 06]
• Used Unigram Dirichlet Process Model as
language model, and Poisson distribution as base
measure (no character-‐level LM)
• Simple Gibbs Sampling with Forward-‐Backward
[Finch+ 10]
![Page 19: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/19.jpg)
Joint Source-‐Channel Model • Model parallel corpus as bilingual sequence-‐pairs
• Bilingual sequence-‐pairs don’t cross word boundary
(1)
[Finch+ 10]
s: sources, t: targets, w: words, γ: bilingual segmentation
![Page 20: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/20.jpg)
Unigram Dirichlet Process Model • Bilingual sequence-‐pairs are generated from Unigram
Dirichlet Process
• Used Chinese Restaurant Process representation • Bilingual sequence-‐pairs are generated as:
1. One of the existing type according to their count
2. New type according to a constant (α=0.3 in this case)
[Finch+ 10]
![Page 21: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/21.jpg)
The Base Measure • Double Poisson distribution for bilingual sequence-‐pairs
• Characters are uniformly generated
[Finch+ 10]
v: vocabulary size, λ: parameter (=2 in this case)
![Page 22: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/22.jpg)
The Generative Model • Generation from the history of bilingual
sequence-‐pairs
-‐k: “up to but not including k” α = 0.3: New bilingual sequence-‐pair
[Finch+ 10]
![Page 23: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/23.jpg)
The Generative Process [Finch+ 10]
Sample from multinomial distribution
Generate new pair?
Sample each characters uniformly
Sample each lengths of the pair
No
Yes
![Page 24: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/24.jpg)
Gibbs Sampling • Used the Blocked version of Forward-‐Filtering-‐
Backward-‐Sampling (FFBS) [Mochihashi+ 09]
[Finch+ 10]
![Page 25: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/25.jpg)
Graph for all co-‐segmentation of (abba, アッバ)
![Page 26: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/26.jpg)
Experimental Result • Outperform m2m baseline with all language pairs
![Page 27: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/27.jpg)
Translation table example
![Page 28: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/28.jpg)
References • [Zhang+ 11] Whitepaper of NEWS 2011 Shared Task on
Machine Transliteration
• [Finch+ 10] A Bayesian Model of Bilingual
Segmentation for Transliteration
• [Mochihashi+ 09] Bayesian unsupervised word
segmentation with nested pitman-‐yor language
modeling
• [Goldwater+ 06] Contextual dependencies in
unsupervised word segmentation
![Page 29: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/29.jpg)
Discriminative Phrase-‐based Lexicalized Reordering Models using Weighted
Reordering Graphs
Wang Ling, Joao Grac¸a, David Martins de Matos,
Isabel Trancoso and Alan Blac
Carnegie Mellon University
IJCNLP 2011
![Page 30: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/30.jpg)
Reordering in Phrase-‐based SMT
• Reordering model plays important role in
language pairs like Japanese-‐English
LM Reordering Translation
P (e|f) = P (e)I�
i=1
P (fi|ei)P (pi, oi)
[Koehn+ 03]
![Page 31: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/31.jpg)
History of reordering model • Distance-‐based reordering model [Kohen+ 03]
• Word-‐based lexicalized reordering [Kohen+ 05]
• Phrase-‐based lexicalized reordering [Tillmann+ 04]
• Weighted word-‐based lexicalized reordering [Ling+ 11]
– Weighted alignment matrices [Liu+ 09]
– Reordering graph representation [Su+ 10]
• Propose weighted phrase-‐based lexicalized reordering
![Page 32: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/32.jpg)
Three types of “orientation”
• Categorize 3 types
– monotone (m)
– swap (s)
– discontinuous (d)
[Koehn+ 05]
![Page 33: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/33.jpg)
Word-‐based reordering • Most popular reordering model currently
• Extend count to weighted sum of probability
[Koehn+ 05]
![Page 34: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/34.jpg)
Weighted alignment matrices [Liu+ 09]
![Page 35: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/35.jpg)
Weighted Reordering Graph [Su+ 10]
![Page 36: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/36.jpg)
Forward-‐Backward Algorithm
• To calculate reordering probability P(p,o)
![Page 37: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/37.jpg)
Choosing Weight Matrix • Weighted Alignment Matrix
• Distance-‐based edge weight
come from [Liu+ 09]
![Page 38: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/38.jpg)
Experimental Result
![Page 39: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/39.jpg)
References • [Koehn+ 03] Statistical phrase-‐based translation • [Koehn+ 05] Edinburgh System Description for the
2005 IWSLT Speech Translation Evaluation
• [Liu+ 09] Weighted Alignment Matrices for Statistical
Machine Translation
• [Su+ 10] Learning lexicalized reordering models from
reordering graphs
• [Ling+ 11] Reordering modeling using weighted
alignment matrices
![Page 40: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/40.jpg)
Phrase Extraction for Japanese Predictive Input Method as Post-‐Processing
Yoh Okuno
Yahoo Japan Corporation
IJCNLP 2011
![Page 41: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/41.jpg)
Call For Paper TokyoNLP #9
EMNLP 2011 Reading
![Page 42: A Report of IJCNLP 2011 #TokyoNLP](https://reader034.vdocuments.site/reader034/viewer/2022052620/55794df2d8b42a31678b5287/html5/thumbnails/42.jpg)
Any Question?