楊允言 iunn un-gian 2008.7.14 台語文特性分析及其處理技術 written taiwanese : its...

楊允言 Iunn Un-gian

2008.7.14

台語文特性分析及其處理技術

Written Taiwanese : Its Characteristic Analysis and Processing Techniques

2

Vita• 1984-1988 NTU CSIE under

• 1990/8-1994/1 Sinica IIS assistant

• 1991-1993 NTHU IS graduate

• 1994/2-1996/11 NTU CC programmar

• 1996 – migrate to Hualian

3

Vita-2

• 1999 Dahan I.T. CSIE lecturer

• 2003/8 - assistant prof.

• 2004 - NTU CSIE phD program

• Journal : IJCLCLP 12(4)

• Project : NSC 3, NMTL 1, Academia Historica 1

4

Outline

1.Introduction

2.Resources and Survey of Written Taiwanese Processing

3.Coding and I/O of POJ

4.Tone Sandhi Problem and Algorithm

5

Outline-2

5.Word Segmentation and Tagging Methods

6.Corpora Collection and Annotation

7.Some Applications of Written Taiwanese Corpora

8.Conclusion and Future Work

6

1. Introduction1.1 Background

–Population : 46M (2005)

–Distribution : Taiwan, Singapore, Malaysia, Brunei, China, Thailand, Philippines, Indonesia

–Rank : 21

–Confused Name : Southern-Min ? Amoy ? Taiwanese ?

7

1. Introduction-2

1.2 Different Scripts–Han Characters Script

–Romanization Script (POJ)

–Han-Romanization Mixed Script

–Others : Kana, Phonetic Symbols, Proverb, …

8

1. Introduction-3

1.3 Phoneme of the Taiwanese–Initials (18)

–Vowels (86)

–Tones (7)

–Compared with Mandarin : legal syllable 2726 vs 1200

9

1. Introduction-4

1.4 Some Keypoints–Not yet standardized

–The POJ characters are seperated to different zones in Unicode set

–Need to Annotate phonetic marker in corpora

–Interact with Taiwanese group

10

1. Introduction-5

1.5 Motivation–My mother tongue

1.6 Definition and Glossary

1.7 Goal of This Dissertation

1.8 Organization

11

2. Resources and Survey

2.1 Resources–Input method

–Dictionary

–Corpus

–Word segmentation

–Scripts conversion

–Text-to-speech

2.2 Survey

12

3. Coding and I/O of POJ

3.1 POJ Character Code–Unicode encoding

3.2 Two Kinds of POJ Representation–POJ and numbered POJ

13

3. Coding and I/O of POJ-2

3.3 Retrieval of POJ–Issue : both case-sensitive

and case-insensitive

–2-stage retrieval : excute SQL command and then filtering

–Fuzzy retrieval : toneless, glottal stop, checked syllable, vowel

–Examples

14


3.4 Display of POJ–Strategy : Unicode (with

specific fonts) or graph–POJ to numbered POJ

• lâng la5ng lang5

–Numbered POJ to POJ• lang5 la5ng lâng• Priority : o a e u i n m• ou..5o5u ou5 ô.

15


3.5 Word Processing Utilities for POJ–Phoneme segmentation :

backward direction

–Spelling checker

–Syllable / word / sentence count

16

4. Tone Sandhi4.1 Tone Sandhi Problem

–Types of tone sandhi• Normal sandhi

• Following sandhi

• Neutral sandhi

• Double sandhi

• Pre-á sandhi

• Triplicate sandhi

• Rising sandhi

17

4. Tone Sandhi-2

4.1 Tone Sandhi Problem–Most complicate among the

Sino language family

–Need to find the boundary of tone sandhi group

18

4. Tone Sandhi-3

19

4. Tone Sandhi-4

4.2 Implementation–Training and test data : POJ

–Tag set : A(adj) C(conj) D(adv) G(postposition) I(interjection) M(special marker) N(noun) P(prep) R(pron) S(time) T(aux) V(verb)

–Taiwanese-Mandarin dict & Chinese electronic dict

20

4. Tone Sandhi-5

4.3 Rule-based Algorithm–20 rules

–Syllable / word / POS / sentence level

4.4 Result–Training data : 97.39%

–Test data : 88.98%

21

5. Word Seg and Tagging

5.1 Word Segmentation–For Han-Romanization mixed

– Forward maximal matching (FMM) vs Backward maximal matching (BMM)• … 看台語… :

看台語 (FMM) or 看台語 (BMM)?

–Ambiguous : statistic• P( 看 )×P( 台語 ) >> P( 看台 )×P

( 語 )

22

5. Word Seg and Tagging-2

5.2 POS Tagging–Data : POJ and HR mixed

parallel corpus

–Tag set : CKIP Chinese tagset

–Taiwanese-Mandarin dict

–Chinese bigram training data

23


24


5.2 POS Tagging– Example :

• 因為 [in-ūi]{ 由於 ; 因為 }< 因為 >(Cbb)等待 [tán-thāi]{ 留待 ; 等待 }< 等待 >(VK)朋友 [pêng-iú]{ 友人 ; 朋友 }< 朋友 >(Na)， [,]<,>(COMMACATEGORY)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)

25


5.2 POS Tagging–Result : 91.49%–Error analysis :

• Wrong Chinese translation word

• No best Chinese translation to select

• Unknown word• Proper noun• Propogation error

26

6. Collect/Annotate Corpora

6.1 Corpora Collection–POJ (3M+ syllables)

–Han-Romanization Mixed (5M+ syllables)

–Sources : • Project results

• Articles in magazines

• Academic paper

27

6. Collect/Annotate Corpora-2

6.2 Raw Corpus Pre-process–Space between “-” and char

–“-” between Han char and POJ

–Alignment

28

6. Collect/Annotate Corpora-3

6.3 Corpus Annotation–POS

–Semantic annotation

–Phonetic annotation

–Special pattern marker

29

7. Corpora Applications

7.1 Basic Count–Syllable / word count

–Zipf law

–Proportion of POJ in Han-Romanization mixed script

–Suggestion of othpgraphy for unconsistent word usage

30

7. Corpora Applications-2

7.2 Concordancer system–For language learning

–For systax study

7.3 Collocation–MI & Correlation (χ2)

–VN, NV, AN, NN

31


7.4 Lexical Change and Variation–Two periods : before / after

1945

–Register : • Japanese loanwords

• Mandarin loanwords

• church register

32


7.4 Lexical Change and Variation–Two Taiwanese bible

versions (new testament) : 1916 and 1972

–Dialect difference–Common words : 31%–43% words disappered after

5 decades

33


7.5 Language Learning and Test

7.6 Coarticulation

34


7.7 POJ / HR mixed script conversion–POJ to HR mixed

• Kin-a2-jit8 thinn-khi3 chin ho2 今仔日天氣真好

• Lookup dictionary

• Bigram , unigram ( 5M syllables training data )

• (input method)

35


7.7 POJ / HR mixed script conversion–HR mixed to POJ

• 今仔日天氣真好 Kin-a2-jit8 thinn-khi3 chin ho2

• Word segmentation

• Loopup dictionary

• Bigram,unigram (3M syllables/ words training data)

36

8. Future Work8.1 Summary

8.2 Future Work–Parser

–Machine translation

–OCR

–Put corpora to LDC

37

8. Future Work I wish this dissertation will

turn into be a written Taiwanese processing textbook ( written in Taiwanese or Mandarin )

敬請指教 Kèng-chhián chí-kàuPlease advise.

楊允言 iunn un-gian 2008.7.14 台語文特性分析 及其處理技術 written taiwanese : its...

Documents

楊允言 iunn un-gian 2008.7.14 台語文特性分析及其處理技術 written taiwanese : its...