*ÿ/f ä!Í*ÿ óq }7É; 5a

15
Automatic Correction for Graphemic Chinese Misspelled Words Tao-Hsing Chang Shou-Yen Su Department of Computer Science and Information Engineering National Kaohsiung University of Applied Sciences [email protected] [email protected] Hsueh-Chih Chen Department of Educational Psychology and Counseling National Taiwan Normal University [email protected] bi-gram bi-gram SVM Neural Network Abstract No matter that learning Chinese as a first or second language, a quite important issue, misspelled words, needs to be addressed. Many studies proposed that there was a suggestion of correcting misspelled words for students who are still schooling as well as a suggestion of teaching and learning strategies of Chinese characters for teachers. Although in schooling, it does to prevent students who do lots of precautions and corrections from generating misspelled words; students sometimes are unconscious of their misspelled words while writing. As a result, in addition to emphasize the recognition of misspelled words in teaching, mentioning how to prevent from generating misspelled words during the process of using words becomes a critical issue. Nevertheless, it is not an easy matter to find misspelled words automatically and correctly within documents by using formula. Currently, there are researchers conducting research on graphemic misspelled words detection and applying it to Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012) 125

Upload: others

Post on 06-Dec-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: *ÿ/F ä!Í*ÿ óQ }7É; 5a

Automatic Correction for Graphemic Chinese Misspelled Words

Tao-Hsing ChangShou-Yen Su

Department of Computer Science and Information EngineeringNational Kaohsiung University of Applied Sciences

[email protected]@gmail.com

Hsueh-Chih Chen

Department of Educational Psychology and CounselingNational Taiwan Normal University

[email protected]

bi-gram bi-gramSVM Neural Network

Abstract

No matter that learning Chinese as a first or second language, a quite important issue, misspelled words, needs to be addressed. Many studies proposed that there was a suggestion of correcting misspelled words for students who are still schooling as well as a suggestion of teaching and learning strategies of Chinese characters for teachers. Although in schooling, it does to prevent students who do lots of precautions and corrections from generating misspelled words; students sometimes are unconscious of their misspelled words while writing. As a result, in addition to emphasize the recognition of misspelled words in teaching, mentioning how to prevent from generating misspelled words during the process of using words becomes a critical issue. Nevertheless, it is not an easy matter to find misspelled words automatically and correctly within documents by using formula. Currently, there are researchers conducting research on graphemic misspelled words detection and applying it to

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

125

Page 2: *ÿ/F ä!Í*ÿ óQ }7É; 5a

different fields. But the accuracy is still far from the real demand. If it can analyze the model, probability and context of misspelled words in detail, it could be detecting the misspelled words more quickly and precisely as well as correcting those words effectively. We had been already accumulated quite research experiences on graphemic misspelled words. This project will combine with resources provided by the mainline project to process the problem of graphemic misspelled words. If it can achieve a breakthrough, it will not only offer a quite effective auxiliary tool for teaching Chinese misspelled words, but assist in establishing a learning tool of Chinese character errors corpus more quickly.

Keywords: Misspelled Words Detection, Misspelled Words Correction, Graphemic Similarit

1995 Chang[1]False Alert

Ren [4]2002 Lin[6]

Huang [2]bi-gram

[7][8]

Template TranslatePOS Language Model

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

126

Page 3: *ÿ/F ä!Í*ÿ óQ }7É; 5a

Bi-gram Bi-gramSVM Neural Network Linear Regression

(Lexicon-based)

bi-gram

Input( )

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

127

Page 4: *ÿ/F ä!Í*ÿ óQ }7É; 5a

_ _ _ _ _

[9]

Windows Unicode “[]”

[| ]439 246 193

[9] 11

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

128

Page 5: *ÿ/F ä!Í*ÿ óQ }7É; 5a

(1) (X)

(2) (-)= -( , )

= -( , )(3) (|)

= |( , )= |( , , )

(4) (0)

= 0( , ) = 0( , )(5) (/)

= /(-( , ), ) = /( ,-( , , ))(6) (\)

= /( ,-( , )) = \( , )(7) (L)

= L( ,-( , )) = L( , )(8) ( )

= ( ,-( , )) = ( ,-( , ))(9) (V)

= V( , ) = V( , , )(10) (<)

= <( , ) = <( ,-( ,|( , )))(11) (T)

= T , , = T , ,

= T( , , , , )

-(|(<( ,), ), ) |(<( , ), ) -

(Root)-

(Leaf Node)(Subtree) (Branch Node)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

129

Page 6: *ÿ/F ä!Í*ÿ óQ }7É; 5a

(Level) (Weight)

-(|(<( , ), ), ) |(<( , ),)) -

- |(<( , ), ))

1 23

1 00

0

0

(1-(x-1)*0.05) x3

7 23/7*0.95=0.407 1

3/7*1=0.429

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

130

Page 7: *ÿ/F ä!Í*ÿ óQ }7É; 5a

3

12 18 212/18*0.95=0.666

1

1 1

0+13=13((13/16)+(13/16))/2=0.8125

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

131

Page 8: *ÿ/F ä!Í*ÿ óQ }7É; 5a

1 1|

-((2/7)+(2/7))/2=0.2857

1

0 4 3 13

1

3((3/5)+(3/15))/2=0.4

6097 30%10 10

55 3/15=0.2

1

3 0

0 0 6097

(0) 28(V) 4

(<) 15(T) 35

0(-) (|)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

132

Page 9: *ÿ/F ä!Í*ÿ óQ }7É; 5a

0.88 9

(8/12)*0.8=0.53330.9

1

4 0

0 4

bi-gram

bi-gram

2003 7205bi-gram

n S: bi-gram

(1)

(1)

bi-grambi-gram

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

133

Page 10: *ÿ/F ä!Í*ÿ óQ }7É; 5a

bi-gram

bi-gram

bi-gram

A: bi-gram 2

(2)

(Nd)_ (P)_ (Na)_ (Nb)_ (Vc)(Nd)_ (P)_ (Nc)_

(Vc) Nd P Nc Vc

2n

n+k k xx A1 x+k A2

bi-gram 3

(3) 36Df-Vh P(Vh|Df)=0.721

De-Ve P(De|Ve)=0.01206 P(X)4

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

134

Page 11: *ÿ/F ä!Í*ÿ óQ }7É; 5a

bi-gram (3) (4)bi-gram

bi-gram bi-gram

( )

5

(5)

y y y

150%

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

135

Page 12: *ÿ/F ä!Í*ÿ óQ }7É; 5a

( ) SVM NN(Support Vector Machine) (Neural Network)

(Back Propagation Neural Network)RBF

11871 61 3 258

6015 81 810.625 647

1 8

1~0.9 0.9~0.8 0.8~0.7 0.7~0.6

10(12.35%)

49(60.49%)

20(23.69%)

2(2.47%) 81

1(0.15%)

20(3.09%)

106(16.38%)

520(80.37%) 647

4-fold cross validation 8181 4 20

40 21 42recall rate precision rate

= ; Recall = ; Precision =

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

136

Page 13: *ÿ/F ä!Í*ÿ óQ }7É; 5a

1. SVM

LibSVM[3] LibSVM

LibSVM grid kernel RBF cost 32 gamma0.5

95.63% recall rate 97.50% precision rate 94.10%

SVMRecall Rate Precision Rate

234 1 42 42 100% 100% 100%134 2 40 37 92.50% 95.00% 90.48%124 3 40 38 95.00% 100% 90.90%123 4 40 38 95.00% 95.00% 95.00%

95.63% 97.50% 94.10%

2. Neural Network

Qnet2000 Neural Network3 3 10 1

(Sigmoid function)(Learn Rate) 0.01 (Momentum) 0.8 (Max Iterations)

10000 Neural Network 96.25%recall rate 96.25% precision rate 96.37%

Neural NetworkRecall Rate Precision Rate

234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 37 92.50% 95.00% 90.48%123 4 40 38 95.00% 95.00% 95.00%

96.25% 96.25% 96.37%

3.

bi-gram bi-gram97.50% recall rate

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

137

Page 14: *ÿ/F ä!Í*ÿ óQ }7É; 5a

96.25% precision rate 98.75%

bi-gram bi-gram

234 0.476108 0.047314 0.495112 0.82134 0.533153 0.041051 0.426089 0.83124 0.515488 0.065720 0.446411 0.85123 0.527719 0.038126 0.446539 0.85

Recall Rate Precision Rate234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 38 95.00% 95.00% 95.00%123 4 40 39 97.50% 95.00% 100%

97.50% 96.25% 98.75%

4.

SVM Neural NetworkSVM 95.63% Neural Network

96.25% 97.5%

Recall Rate Precision RateSVM 95.63% 97.50% 94.10%NN 96.25% 96.25% 96.37%LR 97.50% 96.25% 98.75%

precision rate recall rateSVM precision rate

1:8 precision rate

[9]bi-gram

bi-gram

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

138

Page 15: *ÿ/F ä!Í*ÿ óQ }7É; 5a

precision rate

101J1A0301101J1A0701

[1] Chao-Huang Chang. A New Approach for Automatic Chinese Spelling Correction. In Proceedings of Natural Language Processing Pacific Rim Symposium’95, Seoul, Korea. pp: 278-283,1995.

[2] Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text. World scientific publishing company, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol.16, Suppl. 1 pp.: 89-105, 2008.

[3] Chih-Jen Lin, Chih-Chung Chang. LIBSVM -- A Library for Support Vector Machines. Take from http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[4] Fuji Ren, Hongchi Shi, and Qiang Zhou. A hybrid approach to automatic Chinese text checking and error correction, in Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics. pp: 1693-1698,2001.

[5] J.A.K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters. Volume 9, Number 3 , 293-300, 1999.

[6] Yih-Jeng Lin, Feng-Long Huang and Ming-Shing Yu. A Chinese Spelling ErrorCorrection System. Proceedings of the Seventh Conference on Artificial Intelligence and Applications, 2002.

[7] 2009 ” ”

[8] 2010 ””

[9] 2011 ””

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

139