modeless japanese input method
TRANSCRIPT
Hybrid method for modeless Japanese input
using N-gram based binary classification and dictionary
Yukino IkegamiSetsuo Tsuruta
2014/01/20
Necessity of Japanese Input Method
• Japanese has many characters– Kana
• Hiragana– 81 characters e.g.) いろはにほへと
• Katakana– 81 characters e.g.) イロハニホヘト
– Kanji (Chinese-characters)• More than 6,000 characters e.g.) 以呂波仁保反止
• We can’t input directly by a keyboard Japanese input method (Converting alphabet to Japanese
character) is necessary2
If all Japanese characters are assigned to each key…
• Toooo many keys!• Japanese input method is necessary
Japanese Input Method-Roman to Kana-Kanji Converter-
• Flow1. Receive the Romanized alphabets
2. Convert the Romanized alphabetsinto Kana using Roman-to-Kana table
3. Convert Kana into Kanji (if necessary)
①n e k o d e s u
② ねこです
③ 猫です
4
Problems on Japanese Input Method
• Need to switch input modes between Japanese and ASCII
e.g. To input ‘ あれは 8Byte です’ (That is 8Byte)
areha [Return][ASCII Mode] 8byte [Japanese Mode] desu Switching Switching
• Switching is cumbersome!
5
Adding Term to Dictionaryfor Switching Mode Problem
• Adding term of other languages to dictionary of conventional input method editor
• Shortcoming– New term is created continuously– Homograph problem
Related Work
• Modeless Pinyin-Chinese Input [Chen et al. 2000]– Convert alphabet (Pinyin) to Chinese– Using word-surface feature only for classification
• Type-Any [Ehara et al. 2009]– Convert Alphabet to Any Language– Need press Delimiter-key when converting– Using word-surface feature only for classification
7
Approach-Modeless Japanese Input Method-
• Automatically switching input mode
1. Generate discriminating model by Support Vector Machine (SVM)– the model describe multiple n-gram features
2. Distinguish a segment whether Kana or not in alphabet sequences using the discriminating model– e.g. nekohacatdesu → nekoha / cat / desu → ねこは cat です Japanese / English / Japanese
8
Main flow of Modeless Japanese Input Method
each character in user inputs
if character is still ASCII?
Kana conversion
System Response(Kana & alphabet sequence)
User input(alphabet sequence)
True
FalseKana-conversion
DiscriminativeModel
9
Non Japanese Dic.
Flow of Generating Discriminative Model
• 猫は cat ですLoad Texts
• Using Japanese Morphological Analyzer (MeCab)• ネコハ cat デスKanji to Kana
• Using Kana to ASCII table (used by Google Japanese input)• nakohacatdesuKana to ASCII• character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha... • character-type: LL, LL, LLL, LL, LLL, LL, LLL...• History: KK,KK, KKK, KK, KKK, KKK...
ASCII to n-gram
• 1, 3, 4, 13, 22...n-gram to ID
• 1:1, 3:1, 4:1, 13:1, 32:1...Describe as binary model
• 1.344, 0.691, 0,023, -1.398...Learning on SVM10
n-gram Features あ れ は 8 B y t e
a r e h a 8 B y t e(in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”)
• Character-Surface– Substring of backward and forward at focus point– e.g.) -2/ha -1/a8 0/8B 1/By
• Character-Type– Upper-case(U), Lower-case(L), Number(N), and
Symbol(S).– e.g.) -2/LL -1/LN 0/NU 1/UL
11
Generating Non-Japanese Dictionary
• Words never appeared in Japanese only text– More than 5 length– Contains substring can’t convert to Kana
• Source– Corpus of Contemporary American English (COCA)– Japanese Wikipedia article title list
12
Compare with Conventional IMEConventional method
areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu Switching SwitchingTyping : 17
• The number of typing key is decreased
Modeless Japanese input method
areha8Bytedesu
Typing : 14
13
Datasetsused in Evaluation Experiment
• Generating Model & Evaluating Method– Balanced Corpus of Contemporary Written
Japanese (BCCWJ)• book, magazine, blog, government document and
others
• Non Japanese Dictionary Source– COCA– Japanese Wikipedia article title list
14
Criteria
Results of Evaluation
• Outperforms baseline
Baseline(Char. surface
n-gram)
Proposed method(Char. {surface, type}n-gram & Dictionary)
Kana Precision .998 .999ASCII Precision .989 .996
Kana Recall .993 .998ASCII Recall .780 .884
Kana F1-measure .953 .968ASCII F1-measure .858 .924
16
User test
• Outperforms conventional method
Person No. 1 2 3 4 5 6 7 8 9Conventional IME 18.18 17.89 15.4 12.71 11.09 10.18 11.42 12.38 10.48
Proposed method 13.34 14.68 9.88 12.23 6.03 7.00 11.03 11.37 10.30
17
…
• 4 females and 7 males• Input example sentences (chat, mail, technological
text)
Summary
• Switching input mode is cumbersome• Hybrid Modeless Japanese Input Method– Automatically switching input mode between
Japanese and ASCII– Using n-gram features model for discrimination• character-{surface, type}
– Outperforms conventional methods
18