楊允言 iunn un-gian 2008.7.14 台語文特性分析 及其處理技術 written taiwanese : its...
TRANSCRIPT
楊允言 Iunn Un-gian
2008.7.14
台語文特性分析及其處理技術
Written Taiwanese : Its Characteristic Analysis and Processing Techniques
2
Vita• 1984-1988 NTU CSIE under
• 1990/8-1994/1 Sinica IIS assistant
• 1991-1993 NTHU IS graduate
• 1994/2-1996/11 NTU CC programmar
• 1996 – migrate to Hualian
3
Vita-2
• 1999 Dahan I.T. CSIE lecturer
• 2003/8 - assistant prof.
• 2004 - NTU CSIE phD program
• Journal : IJCLCLP 12(4)
• Project : NSC 3, NMTL 1, Academia Historica 1
4
Outline
1.Introduction
2.Resources and Survey of Written Taiwanese Processing
3.Coding and I/O of POJ
4.Tone Sandhi Problem and Algorithm
5
Outline-2
5.Word Segmentation and Tagging Methods
6.Corpora Collection and Annotation
7.Some Applications of Written Taiwanese Corpora
8.Conclusion and Future Work
6
1. Introduction1.1 Background
–Population : 46M (2005)
–Distribution : Taiwan, Singapore, Malaysia, Brunei, China, Thailand, Philippines, Indonesia
–Rank : 21
–Confused Name : Southern-Min ? Amoy ? Taiwanese ?
7
1. Introduction-2
1.2 Different Scripts–Han Characters Script
–Romanization Script (POJ)
–Han-Romanization Mixed Script
–Others : Kana, Phonetic Symbols, Proverb, …
8
1. Introduction-3
1.3 Phoneme of the Taiwanese–Initials (18)
–Vowels (86)
–Tones (7)
–Compared with Mandarin : legal syllable 2726 vs 1200
9
1. Introduction-4
1.4 Some Keypoints–Not yet standardized
–The POJ characters are seperated to different zones in Unicode set
–Need to Annotate phonetic marker in corpora
–Interact with Taiwanese group
10
1. Introduction-5
1.5 Motivation–My mother tongue
1.6 Definition and Glossary
1.7 Goal of This Dissertation
1.8 Organization
11
2. Resources and Survey
2.1 Resources–Input method
–Dictionary
–Corpus
–Word segmentation
–Scripts conversion
–Text-to-speech
2.2 Survey
12
3. Coding and I/O of POJ
3.1 POJ Character Code–Unicode encoding
3.2 Two Kinds of POJ Representation–POJ and numbered POJ
13
3. Coding and I/O of POJ-2
3.3 Retrieval of POJ–Issue : both case-sensitive
and case-insensitive
–2-stage retrieval : excute SQL command and then filtering
–Fuzzy retrieval : toneless, glottal stop, checked syllable, vowel
–Examples
14
3. Coding and I/O of POJ-3
3.4 Display of POJ–Strategy : Unicode (with
specific fonts) or graph–POJ to numbered POJ
• lâng la5ng lang5
–Numbered POJ to POJ• lang5 la5ng lâng• Priority : o a e u i n m• ou..5o5u ou5 ô.
15
3. Coding and I/O of POJ-4
3.5 Word Processing Utilities for POJ–Phoneme segmentation :
backward direction
–Spelling checker
–Syllable / word / sentence count
16
4. Tone Sandhi4.1 Tone Sandhi Problem
–Types of tone sandhi• Normal sandhi
• Following sandhi
• Neutral sandhi
• Double sandhi
• Pre-á sandhi
• Triplicate sandhi
• Rising sandhi
17
4. Tone Sandhi-2
4.1 Tone Sandhi Problem–Most complicate among the
Sino language family
–Need to find the boundary of tone sandhi group
18
4. Tone Sandhi-3
19
4. Tone Sandhi-4
4.2 Implementation–Training and test data : POJ
–Tag set : A(adj) C(conj) D(adv) G(postposition) I(interjection) M(special marker) N(noun) P(prep) R(pron) S(time) T(aux) V(verb)
–Taiwanese-Mandarin dict & Chinese electronic dict
20
4. Tone Sandhi-5
4.3 Rule-based Algorithm–20 rules
–Syllable / word / POS / sentence level
4.4 Result–Training data : 97.39%
–Test data : 88.98%
21
5. Word Seg and Tagging
5.1 Word Segmentation–For Han-Romanization mixed
– Forward maximal matching (FMM) vs Backward maximal matching (BMM)• … 看台語… :
看台 語 (FMM) or 看 台語 (BMM)?
–Ambiguous : statistic• P( 看 )×P( 台語 ) >> P( 看台 )×P
( 語 )
22
5. Word Seg and Tagging-2
5.2 POS Tagging–Data : POJ and HR mixed
parallel corpus
–Tag set : CKIP Chinese tagset
–Taiwanese-Mandarin dict
–Chinese bigram training data
23
5. Word Seg and Tagging-3
24
5. Word Seg and Tagging-4
5.2 POS Tagging– Example :
• 因為 [in-ūi]{ 由於 ; 因為 }< 因為 >(Cbb)等待 [tán-thāi]{ 留待 ; 等待 }< 等待 >(VK)朋友 [pêng-iú]{ 友人 ; 朋友 }< 朋友 >(Na), [,]<,>(COMMACATEGORY)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)
25
5. Word Seg and Tagging-5
5.2 POS Tagging–Result : 91.49%–Error analysis :
• Wrong Chinese translation word
• No best Chinese translation to select
• Unknown word• Proper noun• Propogation error
26
6. Collect/Annotate Corpora
6.1 Corpora Collection–POJ (3M+ syllables)
–Han-Romanization Mixed (5M+ syllables)
–Sources : • Project results
• Articles in magazines
• Academic paper
27
6. Collect/Annotate Corpora-2
6.2 Raw Corpus Pre-process–Space between “-” and char
–“-” between Han char and POJ
–Alignment
28
6. Collect/Annotate Corpora-3
6.3 Corpus Annotation–POS
–Semantic annotation
–Phonetic annotation
–Special pattern marker
29
7. Corpora Applications
7.1 Basic Count–Syllable / word count
–Zipf law
–Proportion of POJ in Han-Romanization mixed script
–Suggestion of othpgraphy for unconsistent word usage
30
7. Corpora Applications-2
7.2 Concordancer system–For language learning
–For systax study
7.3 Collocation–MI & Correlation (χ2)
–VN, NV, AN, NN
31
7. Corpora Applications-3
7.4 Lexical Change and Variation–Two periods : before / after
1945
–Register : • Japanese loanwords
• Mandarin loanwords
• church register
32
7. Corpora Applications-4
7.4 Lexical Change and Variation–Two Taiwanese bible
versions (new testament) : 1916 and 1972
–Dialect difference–Common words : 31%–43% words disappered after
5 decades
33
7. Corpora Applications-5
7.5 Language Learning and Test
7.6 Coarticulation
34
7. Corpora Applications-6
7.7 POJ / HR mixed script conversion–POJ to HR mixed
• Kin-a2-jit8 thinn-khi3 chin ho2 今仔日天氣真好
• Lookup dictionary
• Bigram , unigram ( 5M syllables training data )
• (input method)
35
7. Corpora Applications-7
7.7 POJ / HR mixed script conversion–HR mixed to POJ
• 今仔日天氣真好 Kin-a2-jit8 thinn-khi3 chin ho2
• Word segmentation
• Loopup dictionary
• Bigram,unigram (3M syllables/ words training data)
36
8. Future Work8.1 Summary
8.2 Future Work–Parser
–Machine translation
–OCR
–Put corpora to LDC
37
8. Future Work I wish this dissertation will
turn into be a written Taiwanese processing textbook ( written in Taiwanese or Mandarin )
敬請指教 Kèng-chhián chí-kàuPlease advise.