combining multiple dictionaries to improve tokenization of
TRANSCRIPT
Combining Multiple Dictionaries to Improve Tokenization
of Ainu Language
Michal Ptaszynski1 Yuka Ito2 Karol Nowakowski3
Hirotoshi Honma2 Yoko Nakajima2 Fumito Masui1
1 Kitami Institute of Technology
2 Kushiro National College of Technology
3 Independent Researcher
Combining Multiple Dictionaries to Improve Tokenization
of Ainu Language
Michal Ptaszynski1 Yuka Ito2 Karol Nowakowski3
Hirotoshi Honma2 Yoko Nakajima2 Fumito Masui1
1 Kitami Institute of Technology
2 Kushiro National College of Technology
3 Independent Researcher
INTRODUCTION
• Ainu language is a language of Ainu people, mostly living in northern Japan.
• Population of Ainu = about 23 thousand people.
• Number of native speakers = less than hundred (Hohmann, 2008).
• Ainu language is critically endangered (Moseley, 2010).
Purpose of this research:
• Create language analysis toolkit including POS tagger, translation support tool and shallow parser.
• Help in linguistic and language anthropology research and support translators of Ainu texts.
• Contribute to the process of reviving Ainu language.
PREVIOUS RESEARCH ON AINU LANGAUGE
Linguistic Studies:
• collections of Ainu epic stories and myths(Chiri, 1978; Kayano, 1998; Piłsudski and Majewicz,2004)
• dictionaries and lexicons(Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995; Kayano, 1996; Tamura, 1998; Kirikae, 2003)
• grammar descriptions(Chiri,1974; Murasaki,1979; Refsing,1986; Kindaichi,1993; Sato,2008)
NLP-related Studies:
• automatically gather word translations from texts (Echizen-ya et al., 2004)
• analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi & Momouchi,2009ab)
• annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008)
• a system for translation of Ainu topological names (Momouchi and Kobayashi 2010)
Our previous work:
• created POST-AL, a simple POS tagger for Ainu language (Ptaszynski & Momouchi, 2012)
• Expanding to a toolkit (Michal Ptaszynski, et al. 2013)
• Improving POS tagging (Michal Ptaszynski, et al. 2016)
POST-AL
DICTIONARY
• base dictionary for POST-AL:
• Ainu shin-yoshu jiten
(Lexicon to Yukie Chiri’s Ainu
Shin-yosyu (Ainu Songs of Gods))
by Kirikae (2003)
• transform dictionary information
to XML database:
• 1. token (word, morpheme, etc.)
• 2. part of speech
• 3. meaning (in Japanese)
• 4. usage examples (partially)
• 5. reference to the story it
appears in (partially)
POST-AL
• SYSTEM DESCRIPTION
• Tokenization
• DL-LSM: (Dictionary Lookup based on
Longest Match Principle)
• POS Tagging
• CON-POST: (Contextual Part of Speech
Tagging) based on higher order HMM
trained on dictionary examples
• Token Translation
• CON-ToT: (Contextual Token Translation)
translation selected specifically for the
word selected in CON-POST
input
POS tagging
output
tokenization
token translation
POST-AL
• SYSTEM DESCRIPTION
• Tokenization
• DL-LSM: (Dictionary Lookup based on
Longest Match Principle)
• POS Tagging
• CON-POST: (Contextual Part of Speech
Tagging) based on higher order HMM
trained on dictionary examples
• Token Translation
• CON-ToT: (Contextual Token Translation)
translation selected specifically for the
word selected in CON-POST
input
POS tagging
output
tokenization
token translation
POST-AL
• SYSTEM DESCRIPTION
• Tokenization
• DL-LSM: (Dictionary Lookup based on
Longest Match Principle)
• POS Tagging
• CON-POST: (Contextual Part of Speech
Tagging) based on higher order HMM
trained on dictionary examples
• Token Translation
• CON-ToT: (Contextual Token Translation)
translation selected specifically for the
word selected in CON-POST
input
POS tagging
output
tokenization
token translation
1. Improve tokenization algorithm
2. Improve dictionary base
IMPROVING TOKENIZER
APPLIED TOKENIZERS
• POST-AL Tokenizer• dictionary lookup
• longest match principle
• keeps track of the already matched word patterns to avoid over-tokenization (splitting each word recursively)
• NLTK Word Tokenizer• NLTK (Natural Language Tool-Kit) Word Tokenizer
http://www.nltk.org/
• re-trained on the same Ainu dictionary base
IMPROVING TOKENIZER
Test data• 5 yukar stories (tokenized by experts):
9. Tororo hanrok , hanrok ! (Unknown meaning)
10.Kutnisa kutunkutun (Unknown meaning)
11.Tan ota hure hure (This sand, red, red!)
12.Kappa rew rew kappa (Otter, flexible otter)
13.Tonupeka ranran (Unknown meaning / … raining [?])
IMPROVING TOKENIZER
POSTAL NLTK word
tokenizer tokenizer
Pr Re F1 Pr Re F1
Yuk09 81.3% 85.9% 83.2% 38.4% 92.9% 53.8%
Yuk10 90.2% 93.4% 91.5% 28.8% 82.1% 42.1%
Yuk11 86.0% 90.6% 87.9% 31.9% 89.1% 46.5%
Yuk12 84.5% 87.7% 85.7% 33.1% 86.0% 46.9%
Yuk13 87.4% 92.7% 89.6% 34.8% 86.1% 49.0%
IMPROVING TOKENIZER
POSTAL NLTK word
tokenizer tokenizer
Pr Re F1 Pr Re F1
Yuk09 81.3% 85.9% 83.2% 38.4% 92.9% 53.8%
Yuk10 90.2% 93.4% 91.5% 28.8% 82.1% 42.1%
Yuk11 86.0% 90.6% 87.9% 31.9% 89.1% 46.5%
Yuk12 84.5% 87.7% 85.7% 33.1% 86.0% 46.9%
Yuk13 87.4% 92.7% 89.6% 34.8% 86.1% 49.0%
IMPROVING TOKENIZER
• Retrained NLTK tokenizer doesrecursive tokenization
IMPROVING DICTIONARY
APPLIED DICTIONARIES
1. Lexicon to Ainu Songs of Gods(based on written scripts)Kirikae (2003) - later KK
2. Ainugo kaiwa jiten(Ainu conversational dictionary)Jinbo & Kanazawa (1898-1986) - later JK
3. Combined 1. + 2.
IMPROVING DICTIONARY
APPLIED DATASETS
• Yukar - Ainu Songs of Gods
• JK sentence samples
• Samples from: Shibatani’sThe Languages of Japan
• Mukawa Dialect Samples(by Chiba Univ.)
http://cas-chiba.net/Ainu-archives/mukawa/mukawa.cgi
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
KK-based was better on yukar data
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
KK-based was better on yukar data
JK-based was better on JK
samples
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
KK-based was better on yukar data
JK-based was better on JK
samples
JK+KK combined were in the middle…
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
KK-based was better on yukar data
JK-based was better on JK
samples
JK+KK combined were in the middle…
…but for new data did not
hinder performance…
IMPROVING DICTIONARY
RESULTS
Tokenization
* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13
JK
sample
sentences
Mukawa
dialect
dictionary
- sample
sentences
Shibatani
colloquial
samplesAverage
JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%
KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%
JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%
KK-based was better on yukar data
JK-based was better on JK
samples
JK+KK combined were in the middle…
…but for new data did not
hinder performance…
…and in general
improved it!
IMPROVING DICTIONARY
ERROR ANALYSISTokenizer Gold std Category
kutun kutun kutunkutun Dictionarya sawa as a wa Tokenizertasi ne ta sine Tokenizer
karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer
kuru n kur un Tokenizernep ne p Tokenizer
cir uska ci ruska Tokenizerciki k ci kik Tokenizer
cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer
pokna sir poknasir Dictionaryciousi ci ousi Tokenizer
montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer
… … …
IMPROVING DICTIONARY
ERROR ANALYSIS
• 8% of errors =word not in dictionary
Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary
a sawa as a wa Tokenizertasi ne ta sine Tokenizer
karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer
kuru n kur un Tokenizernep ne p Tokenizer
cir uska ci ruska Tokenizerciki k ci kik Tokenizer
cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer
pokna sir poknasir Dictionaryciousi ci ousi Tokenizer
montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer
… … …
IMPROVING DICTIONARY
ERROR ANALYSIS
• 8% of errors =word not in dictionary
• 92% of errors =word in dict.but wrongly split
Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary
a sawa as a wa Tokenizertasi ne ta sine Tokenizer
karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer
kuru n kur un Tokenizernep ne p Tokenizer
cir uska ci ruska Tokenizerciki k ci kik Tokenizer
cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer
pokna sir poknasir Dictionaryciousi ci ousi Tokenizer
montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer
isoytak isoitak Dictionary
IMPROVING DICTIONARY
ERROR ANALYSIS
• 8% of errors =word not in dictionary
• 92% of errors =word in dict.but wrongly split
Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary
a sawa as a wa Tokenizertasi ne ta sine Tokenizer
karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer
kuru n kur un Tokenizernep ne p Tokenizer
cir uska ci ruska Tokenizerciki k ci kik Tokenizer
cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer
pokna sir poknasir Dictionaryciousi ci ousi Tokenizer
montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer
isoytak isoitak Dictionary
Main problemPossible solutions:
1. Statistical (probability of a space between letters or letter ngrams)
2. Apply contextual information (usage examples from dictionaries)
3. Automatically obtain word ngrams for disambiguation
4. Hybrid
CONCLUSIONS
• Apply NLP techniques to revitalize Ainu language
• Created POST-AL – needed improvements
• Improve tokenization• Tokenizer:
• POST-AL
• NLTK
• Dictionary base• Kirikae lexicon: Yukar stories-based (written)
• Jinbo&Kanazawa dictionary (spoken)
• KK+JK: Combined
CONCLUSIONS
• Apply NLP techniques to revitalize Ainu language
• Created POST-AL – needed improvements
• Improve tokenization• Tokenizer:
• POST-AL
• NLTK
• Dictionary base• Kirikae lexicon: Yukar stories-based (written)
• Jinbo&Kanazawa dictionary (spoken)
• KK+JK: Combined
• Custom tokenizer (POST-AL) better than NLTK
• Combined dictionaries were in general better on new data
• Still need to improve tokenization process