automatic acquisition of basic katakana lexicon from a given corpus

25
Language & K nowledge EngineeringLab Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus Toshiaki Nakazawa, Daisuke Kaw ahara Sadao Kurohashi University of Tokyo 2005/10/13 IJCNLP2005

Upload: bachyen-nguyen

Post on 03-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Toshiaki Nakazawa, Daisuke Kawahara Sadao Kurohashi University of Tokyo. 2005/10/13 IJCNLP2005. Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Japanese Character Set About Word Segmentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Automatic Acquisition of

Basic Katakana Lexicon

from a Given Corpus

Toshiaki Nakazawa, Daisuke Kawahara

Sadao Kurohashi

University of Tokyo

2005/10/13   IJCNLP2005

Page 2: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Automatic Acquisition of Basic Katakana Lexicon

from a Given Corpus

• Japanese Character Set

• About Word Segmentation

• Proposed Method– Method using a Japanese-English Dictionary– Method using a Huge Corpus and a Dictionary– Method using Relation in WOD

• Evaluation and Discussion

• Conclusion

Page 3: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Japanese Character Set

• Kanji (ideogram): 6,000

– Noun: 東京 , 大学 , 情報 …– Stems of verbs/adjectives: 書く , 美しい …

• Hiragana (phonogram): 83

– Function words: が , を , れる , られる …– Endings of verbs/adjectives: 書く , 美しい …

• Katakana (phonogram): 86

– Loan words: コンピュータ , ドイツ …

(Tokyo) (university) (information)

(write) (beautiful)

(write) (beautiful)

(computer) (Germany)

Page 4: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Katakana Set ( 86 Characters )

ワ ヮ ラ ヤ ャ マ ハ バパ

ナ タ ダ サ ザ カ ヵ ア ァ

リ ヰ ミ ヒ ビピ

ニ チ ヂ シ ジ キ イ ィ

ヲ ル ユ ュ ム フ ブプ

ヌ ツ ヅッ

ス ズ ク ウ ゥヴ

レ ヱ メ ヘ ベペ

ネ テ デ セ ゼ ケ ヶ エ ェ

ン ロ ヨ ョ モ ホ ボポ

ノ ト ド ソ ゾ コ オ ォ

W R Y M H BP

N DT KS Z

a

i

u

e

o

Page 5: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Word Segmentation

• Kanji and HiraganaEx. 彼は大学に通う (kare-wa-daigaku-ni-kayo-u) He pp Univ. pp goes

• KatakanaEx. エクストラバージンオリーブオイル extra virgin olive oil

ジャパンカップサイクルロードレース Japan cup cycle road race

Page 6: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Why Katakana Word Segmentation is Necessary?

トマトソース

ホワイトソース

リソース

(tomato sauce)

(white sauce)

(resource)

kinds of “sauce”tomato so-su

howaito so-su

riso-su

×something to put to some

dishes

something to put to some

dishes

something to put to some

dishes

Page 7: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Similar Problem in German

Lebensversicherungsgesellschaftsangestellter

“life insurance company employee”

Donaudampfschleppschiffahrtgesellschaftskapitän

“Captain of Danube steam tow company”

Page 8: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Word Segmentation so far

• A lot of studies about word segmentation

• No study aiming at Katakana words

• In word segmentation task so far for Katakana words:– Use a dictionary with some manually

registered Katakana words– Consider a whole continuous Katakana string

as a word for unknown words or so

Page 9: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Problem Setting

・・・あとは粉を付けてバターで焼いたムニエルや、白ワインで蒸し直したり、パン粉をまぶしてフライにしたり、ホワイトソースやトマトソースをかけたグラタンにもなります。 ・・・

Corpus Word-Occurrence data(WOD)

ラーメンスープ・・・トマトソース・・・トマトスープ ・・・トマトソース・・・

2872720808

・・・11641

・・・8435

・・・78877570

・・・

ラーメンスープトマトソース・・・

Basic Vocabulary

Japanese-English Translation Information

Page 10: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Table of Contents

• Japanese Character Set

• About Word Segmentation

• Proposed Method– Method using a Japanese-English Dictionary

– Method using a Huge Corpus and a Dictionary

– Method using Relation in WOD

• Evaluation and Discussion

• Conclusion

Page 11: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Overview of the Method

DictionaryEnglish Corpus

WOD Freq.

WOD

BasicVocabulary

Corpus

HighlyReliable

Page 12: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

• Segmentation using aJE dictionaryEx. トマトソース

• Translation is one word → single-wordEx.  サンドウィッチ = “ sandwich”

• Entries of Japanese Dic. → single-wordEx.  インゲン ( = いんげん)

Method using Dictionary

トマトソース = “tomato sauce”

トマト = “tomato”

ソース = “sauce”, “source”

JE Dictionary

= =トマト

ソース

= “ tomato sauce”

(a kidney bean)

Page 13: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

High PrecisionMore coverage

Overview of the Method

DictionaryEnglish Corpus

WOD Freq.

HighlyReliable

WOD

BasicVocabulary

Corpus

Lowcoverage

Page 14: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Method using a Huge English Corpus

• All possible segmentation to Katakana words in the JE Dictionary

• Translation → possible English phrases• # of Phrasal Hits of Web search engine

(i) パセリ:ソース

(ii) パセ:リソース

parsley sourceparsley saucepase resource

→  554 Hits→  20600 Hits  ◎→  3 Hits

Ex.  パセリソース (parsley sauce)

Page 15: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Threshold for Hit Number• Even an inappropriate segmentation and its mad

translation has some frequency in the web Ex. デミ : グラス→ demi glass :207

バン : バンジー → van bungee:159

• The longer the Katakana word is, the more probable it is a compund

C / N L

L : the length of the Katakana word

C : 400,000 N : 2

(demi-glace)

(Chinese food “ban-ban-ji”)

Page 16: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

ハイビジョン (hai-bijyon)× high vision → 11,000 Hits○ high definition → 5,450,000 Hits

ペーパーテスト (pe-pa-tesuto)× paper test → 45,400 Hits○ witten test → 415,000 Hits

Overview of the Method

DictionaryEnglish Corpus

WOD Freq.

HighlyReliable

High RecallWOD

BasicVocabulary

Corpus

Lowcoverage

High PrecisionMore coverage

Depends on the JE-Dic., Natural English Compounds

Page 17: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Method using Relation in WOD• Try to find compounds only based on the

information in a WOD• Geometric mean of freq. of possible

constituent words ⇔ Freq. of the original word

WOD

159 ガーリックトースト 32 ガー 9 リック

515 ガーリック

652 トースト

5 トー

60 スト

Ex.  ガーリックトースト   159 ガー : リック : トー

スト (32 × 9 × 652)=ガーリック : トースト (515 × 652)=ガー : リック : トー : スト (32 × 9 × 5 × 60)=ガーリック : トー : スト (515 × 5 × 60)=

57

579

17

54

1 /3

1 /4

1 /3

1 /2

(garlic toast)

Page 18: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Threshold for Geometric Mean

Fo < Fg’ , Fg’ = Fg / (C / N l + α)

Fo : Freq. of the original word

Fg : Geometric mean of freq. of constituents

Fg’ : Modified Geometric mean

l : Average length of constituents

C : 2,500

N : 4

α: 0.7

Page 19: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Table of Contents

• Japanese Character Set

• About Word Segmentation

• Proposed Method– Method using a Japanese-English Dictionary

– Method using a Huge Corpus and a Dictionary

– Method using Relation in WOD

• Evaluation and Discussion

• Conclusion

Page 20: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Experiments• Data

– 87K Katakana types in 5.8M sentences of newspaper articles (12-year volume)

– 43K Katakana types in 2.8M sentences of cooking-related web pages

• Evaluation– 500-word test set for each data set : manually

assign correct segmentation– Automatic segmentation is compared with the

gold-standard data → precision/recall

Page 21: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Experimental Results (1/2)

News D D + C D + WOD D+C+WOD

Precision 1.0 0.996 0.986 0.985

Recall 0.822 0.909 0.945 0.949

F 0.902 0.950 0.965 0.966

Cooking D D + C D + WOD D+C+WOD

Precision 1.0 1.0 0.990 0.991

Recall 0.717 0.836 0.948 0.956

F 0.835 0.910 0.968 0.973

Page 22: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Experimental Results (2/2)

cooking domain

0

200

400

600

800

1000

1200

1400

1600

1800

5 6 7 8 9 10 11 12 13 14 15 16 17word length

# of

wor

ds

single word / not registered

single word / registerd

compound

news domain

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5 6 7 8 9 10 11 12 13 14 15 16 17word length

# of

wor

ds

single word / not registered

single word / registered

compound

News Cooking# of words 13807 4947

# of compounds 6054 2565

Katakana words : Freq. 10≧

Page 23: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Discussion (1/2)• Precision : No entry in JE dictinonary

– Neologisms or very rare words

シュレッドチーズ → シュ : レッド : チーズ × shred cheese

– Proper nouns

パスツール → パス : ツール × Pasteur

• Recall– Criteria for compounds

プールサイド = poolside– No entry in JE dictionary

シュガーローフ sugar loaf

Page 24: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Discussion (2/2)

• Context dependency

– Segmentation

タコスライス → タコス + ライス or タコ +

スライス tacos rice Tako slice

– Compound or not

カラーリング or カラー + リング coloring color ring

(= octopus)

Page 25: Automatic Acquisition of  Basic Katakana Lexicon  from a Given Corpus

Language & K nowledge Engineering Lab

Conclusion

• Segmentation of Japanese Katakana compounds– Dictionary– Huge English Corpus and JE-Dictionary– Relation in WOD

• Future plan– Integration with NE detection– Use of automatic transliteration