ÿ/f ä!Íÿ óq }7É; 5a

Automatic Correction for Graphemic Chinese Misspelled Words

Tao-Hsing ChangShou-Yen Su

Department of Computer Science and Information EngineeringNational Kaohsiung University of Applied Sciences

[email protected]@gmail.com

Hsueh-Chih Chen

Department of Educational Psychology and CounselingNational Taiwan Normal University

[email protected]

bi-gram bi-gramSVM Neural Network

Abstract

No matter that learning Chinese as a first or second language, a quite important issue, misspelled words, needs to be addressed. Many studies proposed that there was a suggestion of correcting misspelled words for students who are still schooling as well as a suggestion of teaching and learning strategies of Chinese characters for teachers. Although in schooling, it does to prevent students who do lots of precautions and corrections from generating misspelled words; students sometimes are unconscious of their misspelled words while writing. As a result, in addition to emphasize the recognition of misspelled words in teaching, mentioning how to prevent from generating misspelled words during the process of using words becomes a critical issue. Nevertheless, it is not an easy matter to find misspelled words automatically and correctly within documents by using formula. Currently, there are researchers conducting research on graphemic misspelled words detection and applying it to

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

125

different fields. But the accuracy is still far from the real demand. If it can analyze the model, probability and context of misspelled words in detail, it could be detecting the misspelled words more quickly and precisely as well as correcting those words effectively. We had been already accumulated quite research experiences on graphemic misspelled words. This project will combine with resources provided by the mainline project to process the problem of graphemic misspelled words. If it can achieve a breakthrough, it will not only offer a quite effective auxiliary tool for teaching Chinese misspelled words, but assist in establishing a learning tool of Chinese character errors corpus more quickly.

Keywords: Misspelled Words Detection, Misspelled Words Correction, Graphemic Similarit

1995 Chang[1]False Alert

Ren [4]2002 Lin[6]

Huang [2]bi-gram

[7][8]

Template TranslatePOS Language Model


126

Bi-gram Bi-gramSVM Neural Network Linear Regression

(Lexicon-based)

bi-gram

Input( )


127

_ _ _ _ _

[9]

Windows Unicode “[]”

[| ]439 246 193

[9] 11


128

(1) (X)

(2) (-)= -( , )

= -( , )(3) (|)

= |( , )= |( , , )

(4) (0)

= 0( , ) = 0( , )(5) (/)

= /(-( , ), ) = /( ,-( , , ))(6) (\)

= /( ,-( , )) = \( , )(7) (L)

= L( ,-( , )) = L( , )(8) ( )

= ( ,-( , )) = ( ,-( , ))(9) (V)

= V( , ) = V( , , )(10) (<)

= <( , ) = <( ,-( ,|( , )))(11) (T)

= T , , = T , ,

= T( , , , , )

-(|(<( ,), ), ) |(<( , ), ) -

(Root)-

(Leaf Node)(Subtree) (Branch Node)


129

(Level) (Weight)

-(|(<( , ), ), ) |(<( , ),)) -

- |(<( , ), ))

1 23

1 00

0

0

(1-(x-1)*0.05) x3

7 23/7*0.95=0.407 1

3/7*1=0.429


130

3

12 18 212/18*0.95=0.666

1

1 1

0+13=13((13/16)+(13/16))/2=0.8125


131

1 1|

-((2/7)+(2/7))/2=0.2857

1

0 4 3 13

1

3((3/5)+(3/15))/2=0.4

6097 30%10 10

55 3/15=0.2

1

3 0

0 0 6097

(0) 28(V) 4

(<) 15(T) 35

0(-) (|)


132

0.88 9

(8/12)*0.8=0.53330.9

1

4 0

0 4

bi-gram

bi-gram

2003 7205bi-gram

n S: bi-gram

(1)

(1)

bi-grambi-gram


133

bi-gram

bi-gram

bi-gram

A: bi-gram 2

(2)

(Nd)_ (P)_ (Na)_ (Nb)_ (Vc)(Nd)_ (P)_ (Nc)_

(Vc) Nd P Nc Vc

2n

n+k k xx A1 x+k A2

bi-gram 3

(3) 36Df-Vh P(Vh|Df)=0.721

De-Ve P(De|Ve)=0.01206 P(X)4


134

bi-gram (3) (4)bi-gram

bi-gram bi-gram

( )

5

(5)

y y y

150%


135

( ) SVM NN(Support Vector Machine) (Neural Network)

(Back Propagation Neural Network)RBF

11871 61 3 258

6015 81 810.625 647

1 8

1~0.9 0.9~0.8 0.8~0.7 0.7~0.6

10(12.35%)

49(60.49%)

20(23.69%)

2(2.47%) 81

1(0.15%)

20(3.09%)

106(16.38%)

520(80.37%) 647

4-fold cross validation 8181 4 20

40 21 42recall rate precision rate

= ; Recall = ; Precision =


136

1. SVM

LibSVM[3] LibSVM

LibSVM grid kernel RBF cost 32 gamma0.5

95.63% recall rate 97.50% precision rate 94.10%

SVMRecall Rate Precision Rate

234 1 42 42 100% 100% 100%134 2 40 37 92.50% 95.00% 90.48%124 3 40 38 95.00% 100% 90.90%123 4 40 38 95.00% 95.00% 95.00%

95.63% 97.50% 94.10%

2. Neural Network

Qnet2000 Neural Network3 3 10 1

(Sigmoid function)(Learn Rate) 0.01 (Momentum) 0.8 (Max Iterations)

10000 Neural Network 96.25%recall rate 96.25% precision rate 96.37%

Neural NetworkRecall Rate Precision Rate

234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 37 92.50% 95.00% 90.48%123 4 40 38 95.00% 95.00% 95.00%

96.25% 96.25% 96.37%

3.

bi-gram bi-gram97.50% recall rate


137

96.25% precision rate 98.75%

bi-gram bi-gram

234 0.476108 0.047314 0.495112 0.82134 0.533153 0.041051 0.426089 0.83124 0.515488 0.065720 0.446411 0.85123 0.527719 0.038126 0.446539 0.85

Recall Rate Precision Rate234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 38 95.00% 95.00% 95.00%123 4 40 39 97.50% 95.00% 100%

97.50% 96.25% 98.75%

4.

SVM Neural NetworkSVM 95.63% Neural Network

96.25% 97.5%

Recall Rate Precision RateSVM 95.63% 97.50% 94.10%NN 96.25% 96.25% 96.37%LR 97.50% 96.25% 98.75%

precision rate recall rateSVM precision rate

1:8 precision rate

[9]bi-gram

bi-gram


138

precision rate

101J1A0301101J1A0701

[1] Chao-Huang Chang. A New Approach for Automatic Chinese Spelling Correction. In Proceedings of Natural Language Processing Pacific Rim Symposium’95, Seoul, Korea. pp: 278-283,1995.

[2] Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text. World scientific publishing company, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol.16, Suppl. 1 pp.: 89-105, 2008.

[3] Chih-Jen Lin, Chih-Chung Chang. LIBSVM -- A Library for Support Vector Machines. Take from http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[4] Fuji Ren, Hongchi Shi, and Qiang Zhou. A hybrid approach to automatic Chinese text checking and error correction, in Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics. pp: 1693-1698,2001.

[5] J.A.K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters. Volume 9, Number 3 , 293-300, 1999.

[6] Yih-Jeng Lin, Feng-Long Huang and Ming-Shing Yu. A Chinese Spelling ErrorCorrection System. Proceedings of the Seventh Conference on Artificial Intelligence and Applications, 2002.

[7] 2009 ” ”

[8] 2010 ””

[9] 2011 ””


139

*ÿ/f ä!Í*ÿ óq }7É; 5a

Documents

ÿ/f ä!Íÿ óq }7É; 5a