automatic acquisition of basic katakana lexicon from a given corpus
DESCRIPTION
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Toshiaki Nakazawa, Daisuke Kawahara Sadao Kurohashi University of Tokyo. 2005/10/13 IJCNLP2005. Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Japanese Character Set About Word Segmentation - PowerPoint PPT PresentationTRANSCRIPT
Language & K nowledge Engineering Lab
Automatic Acquisition of
Basic Katakana Lexicon
from a Given Corpus
Toshiaki Nakazawa, Daisuke Kawahara
Sadao Kurohashi
University of Tokyo
2005/10/13 IJCNLP2005
Language & K nowledge Engineering Lab
Automatic Acquisition of Basic Katakana Lexicon
from a Given Corpus
• Japanese Character Set
• About Word Segmentation
• Proposed Method– Method using a Japanese-English Dictionary– Method using a Huge Corpus and a Dictionary– Method using Relation in WOD
• Evaluation and Discussion
• Conclusion
Language & K nowledge Engineering Lab
Japanese Character Set
• Kanji (ideogram): 6,000
– Noun: 東京 , 大学 , 情報 …– Stems of verbs/adjectives: 書く , 美しい …
• Hiragana (phonogram): 83
– Function words: が , を , れる , られる …– Endings of verbs/adjectives: 書く , 美しい …
• Katakana (phonogram): 86
– Loan words: コンピュータ , ドイツ …
(Tokyo) (university) (information)
(write) (beautiful)
(write) (beautiful)
(computer) (Germany)
Language & K nowledge Engineering Lab
Katakana Set ( 86 Characters )
ワ ヮ ラ ヤ ャ マ ハ バパ
ナ タ ダ サ ザ カ ヵ ア ァ
リ ヰ ミ ヒ ビピ
ニ チ ヂ シ ジ キ イ ィ
ヲ ル ユ ュ ム フ ブプ
ヌ ツ ヅッ
ス ズ ク ウ ゥヴ
レ ヱ メ ヘ ベペ
ネ テ デ セ ゼ ケ ヶ エ ェ
ン ロ ヨ ョ モ ホ ボポ
ノ ト ド ソ ゾ コ オ ォ
W R Y M H BP
N DT KS Z
a
i
u
e
o
Language & K nowledge Engineering Lab
Word Segmentation
• Kanji and HiraganaEx. 彼は大学に通う (kare-wa-daigaku-ni-kayo-u) He pp Univ. pp goes
• KatakanaEx. エクストラバージンオリーブオイル extra virgin olive oil
ジャパンカップサイクルロードレース Japan cup cycle road race
Language & K nowledge Engineering Lab
Why Katakana Word Segmentation is Necessary?
トマトソース
ホワイトソース
リソース
(tomato sauce)
(white sauce)
(resource)
kinds of “sauce”tomato so-su
howaito so-su
riso-su
○
○
×something to put to some
dishes
something to put to some
dishes
something to put to some
dishes
Language & K nowledge Engineering Lab
Similar Problem in German
Lebensversicherungsgesellschaftsangestellter
“life insurance company employee”
Donaudampfschleppschiffahrtgesellschaftskapitän
“Captain of Danube steam tow company”
Language & K nowledge Engineering Lab
Word Segmentation so far
• A lot of studies about word segmentation
• No study aiming at Katakana words
• In word segmentation task so far for Katakana words:– Use a dictionary with some manually
registered Katakana words– Consider a whole continuous Katakana string
as a word for unknown words or so
Language & K nowledge Engineering Lab
Problem Setting
・・・あとは粉を付けてバターで焼いたムニエルや、白ワインで蒸し直したり、パン粉をまぶしてフライにしたり、ホワイトソースやトマトソースをかけたグラタンにもなります。 ・・・
Corpus Word-Occurrence data(WOD)
ラーメンスープ・・・トマトソース・・・トマトスープ ・・・トマトソース・・・
2872720808
・・・11641
・・・8435
・・・78877570
・・・
ラーメンスープトマトソース・・・
Basic Vocabulary
Japanese-English Translation Information
Language & K nowledge Engineering Lab
Table of Contents
• Japanese Character Set
• About Word Segmentation
• Proposed Method– Method using a Japanese-English Dictionary
– Method using a Huge Corpus and a Dictionary
– Method using Relation in WOD
• Evaluation and Discussion
• Conclusion
Language & K nowledge Engineering Lab
Overview of the Method
DictionaryEnglish Corpus
WOD Freq.
WOD
BasicVocabulary
Corpus
HighlyReliable
Language & K nowledge Engineering Lab
• Segmentation using aJE dictionaryEx. トマトソース
• Translation is one word → single-wordEx. サンドウィッチ = “ sandwich”
• Entries of Japanese Dic. → single-wordEx. インゲン ( = いんげん)
Method using Dictionary
トマトソース = “tomato sauce”
トマト = “tomato”
ソース = “sauce”, “source”
JE Dictionary
= =トマト
ソース
= “ tomato sauce”
(a kidney bean)
Language & K nowledge Engineering Lab
High PrecisionMore coverage
Overview of the Method
DictionaryEnglish Corpus
WOD Freq.
HighlyReliable
WOD
BasicVocabulary
Corpus
Lowcoverage
Language & K nowledge Engineering Lab
Method using a Huge English Corpus
• All possible segmentation to Katakana words in the JE Dictionary
• Translation → possible English phrases• # of Phrasal Hits of Web search engine
(i) パセリ:ソース
(ii) パセ:リソース
parsley sourceparsley saucepase resource
→ 554 Hits→ 20600 Hits ◎→ 3 Hits
Ex. パセリソース (parsley sauce)
Language & K nowledge Engineering Lab
Threshold for Hit Number• Even an inappropriate segmentation and its mad
translation has some frequency in the web Ex. デミ : グラス→ demi glass :207
バン : バンジー → van bungee:159
• The longer the Katakana word is, the more probable it is a compund
C / N L
L : the length of the Katakana word
C : 400,000 N : 2
(demi-glace)
(Chinese food “ban-ban-ji”)
Language & K nowledge Engineering Lab
ハイビジョン (hai-bijyon)× high vision → 11,000 Hits○ high definition → 5,450,000 Hits
ペーパーテスト (pe-pa-tesuto)× paper test → 45,400 Hits○ witten test → 415,000 Hits
Overview of the Method
DictionaryEnglish Corpus
WOD Freq.
HighlyReliable
High RecallWOD
BasicVocabulary
Corpus
Lowcoverage
High PrecisionMore coverage
Depends on the JE-Dic., Natural English Compounds
Language & K nowledge Engineering Lab
Method using Relation in WOD• Try to find compounds only based on the
information in a WOD• Geometric mean of freq. of possible
constituent words ⇔ Freq. of the original word
WOD
159 ガーリックトースト 32 ガー 9 リック
515 ガーリック
652 トースト
5 トー
60 スト
Ex. ガーリックトースト 159 ガー : リック : トー
スト (32 × 9 × 652)=ガーリック : トースト (515 × 652)=ガー : リック : トー : スト (32 × 9 × 5 × 60)=ガーリック : トー : スト (515 × 5 × 60)=
57
579
17
54
1 /3
1 /4
1 /3
1 /2
(garlic toast)
Language & K nowledge Engineering Lab
Threshold for Geometric Mean
Fo < Fg’ , Fg’ = Fg / (C / N l + α)
Fo : Freq. of the original word
Fg : Geometric mean of freq. of constituents
Fg’ : Modified Geometric mean
l : Average length of constituents
C : 2,500
N : 4
α: 0.7
Language & K nowledge Engineering Lab
Table of Contents
• Japanese Character Set
• About Word Segmentation
• Proposed Method– Method using a Japanese-English Dictionary
– Method using a Huge Corpus and a Dictionary
– Method using Relation in WOD
• Evaluation and Discussion
• Conclusion
Language & K nowledge Engineering Lab
Experiments• Data
– 87K Katakana types in 5.8M sentences of newspaper articles (12-year volume)
– 43K Katakana types in 2.8M sentences of cooking-related web pages
• Evaluation– 500-word test set for each data set : manually
assign correct segmentation– Automatic segmentation is compared with the
gold-standard data → precision/recall
Language & K nowledge Engineering Lab
Experimental Results (1/2)
News D D + C D + WOD D+C+WOD
Precision 1.0 0.996 0.986 0.985
Recall 0.822 0.909 0.945 0.949
F 0.902 0.950 0.965 0.966
Cooking D D + C D + WOD D+C+WOD
Precision 1.0 1.0 0.990 0.991
Recall 0.717 0.836 0.948 0.956
F 0.835 0.910 0.968 0.973
Language & K nowledge Engineering Lab
Experimental Results (2/2)
cooking domain
0
200
400
600
800
1000
1200
1400
1600
1800
5 6 7 8 9 10 11 12 13 14 15 16 17word length
# of
wor
ds
single word / not registered
single word / registerd
compound
news domain
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5 6 7 8 9 10 11 12 13 14 15 16 17word length
# of
wor
ds
single word / not registered
single word / registered
compound
News Cooking# of words 13807 4947
# of compounds 6054 2565
Katakana words : Freq. 10≧
Language & K nowledge Engineering Lab
Discussion (1/2)• Precision : No entry in JE dictinonary
– Neologisms or very rare words
シュレッドチーズ → シュ : レッド : チーズ × shred cheese
– Proper nouns
パスツール → パス : ツール × Pasteur
• Recall– Criteria for compounds
プールサイド = poolside– No entry in JE dictionary
シュガーローフ sugar loaf
Language & K nowledge Engineering Lab
Discussion (2/2)
• Context dependency
– Segmentation
タコスライス → タコス + ライス or タコ +
スライス tacos rice Tako slice
– Compound or not
カラーリング or カラー + リング coloring color ring
(= octopus)
Language & K nowledge Engineering Lab
Conclusion
• Segmentation of Japanese Katakana compounds– Dictionary– Huge English Corpus and JE-Dictionary– Relation in WOD
• Future plan– Integration with NE detection– Use of automatic transliteration