*ÿ/f ä!Í*ÿ óq }7É; 5a
TRANSCRIPT
Automatic Correction for Graphemic Chinese Misspelled Words
Tao-Hsing ChangShou-Yen Su
Department of Computer Science and Information EngineeringNational Kaohsiung University of Applied Sciences
[email protected]@gmail.com
Hsueh-Chih Chen
Department of Educational Psychology and CounselingNational Taiwan Normal University
bi-gram bi-gramSVM Neural Network
Abstract
No matter that learning Chinese as a first or second language, a quite important issue, misspelled words, needs to be addressed. Many studies proposed that there was a suggestion of correcting misspelled words for students who are still schooling as well as a suggestion of teaching and learning strategies of Chinese characters for teachers. Although in schooling, it does to prevent students who do lots of precautions and corrections from generating misspelled words; students sometimes are unconscious of their misspelled words while writing. As a result, in addition to emphasize the recognition of misspelled words in teaching, mentioning how to prevent from generating misspelled words during the process of using words becomes a critical issue. Nevertheless, it is not an easy matter to find misspelled words automatically and correctly within documents by using formula. Currently, there are researchers conducting research on graphemic misspelled words detection and applying it to
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
125
different fields. But the accuracy is still far from the real demand. If it can analyze the model, probability and context of misspelled words in detail, it could be detecting the misspelled words more quickly and precisely as well as correcting those words effectively. We had been already accumulated quite research experiences on graphemic misspelled words. This project will combine with resources provided by the mainline project to process the problem of graphemic misspelled words. If it can achieve a breakthrough, it will not only offer a quite effective auxiliary tool for teaching Chinese misspelled words, but assist in establishing a learning tool of Chinese character errors corpus more quickly.
Keywords: Misspelled Words Detection, Misspelled Words Correction, Graphemic Similarit
1995 Chang[1]False Alert
Ren [4]2002 Lin[6]
Huang [2]bi-gram
[7][8]
Template TranslatePOS Language Model
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
126
Bi-gram Bi-gramSVM Neural Network Linear Regression
(Lexicon-based)
bi-gram
Input( )
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
127
_ _ _ _ _
[9]
Windows Unicode “[]”
[| ]439 246 193
[9] 11
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
128
(1) (X)
(2) (-)= -( , )
= -( , )(3) (|)
= |( , )= |( , , )
(4) (0)
= 0( , ) = 0( , )(5) (/)
= /(-( , ), ) = /( ,-( , , ))(6) (\)
= /( ,-( , )) = \( , )(7) (L)
= L( ,-( , )) = L( , )(8) ( )
= ( ,-( , )) = ( ,-( , ))(9) (V)
= V( , ) = V( , , )(10) (<)
= <( , ) = <( ,-( ,|( , )))(11) (T)
= T , , = T , ,
= T( , , , , )
-(|(<( ,), ), ) |(<( , ), ) -
(Root)-
(Leaf Node)(Subtree) (Branch Node)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
129
(Level) (Weight)
-(|(<( , ), ), ) |(<( , ),)) -
- |(<( , ), ))
1 23
1 00
0
0
(1-(x-1)*0.05) x3
7 23/7*0.95=0.407 1
3/7*1=0.429
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
130
3
12 18 212/18*0.95=0.666
1
1 1
0+13=13((13/16)+(13/16))/2=0.8125
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
131
1 1|
-((2/7)+(2/7))/2=0.2857
1
0 4 3 13
1
3((3/5)+(3/15))/2=0.4
6097 30%10 10
55 3/15=0.2
1
3 0
0 0 6097
(0) 28(V) 4
(<) 15(T) 35
0(-) (|)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
132
0.88 9
(8/12)*0.8=0.53330.9
1
4 0
0 4
bi-gram
bi-gram
2003 7205bi-gram
n S: bi-gram
(1)
(1)
bi-grambi-gram
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
133
bi-gram
bi-gram
bi-gram
A: bi-gram 2
(2)
(Nd)_ (P)_ (Na)_ (Nb)_ (Vc)(Nd)_ (P)_ (Nc)_
(Vc) Nd P Nc Vc
2n
n+k k xx A1 x+k A2
bi-gram 3
(3) 36Df-Vh P(Vh|Df)=0.721
De-Ve P(De|Ve)=0.01206 P(X)4
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
134
bi-gram (3) (4)bi-gram
bi-gram bi-gram
( )
5
(5)
y y y
150%
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
135
( ) SVM NN(Support Vector Machine) (Neural Network)
(Back Propagation Neural Network)RBF
11871 61 3 258
6015 81 810.625 647
1 8
1~0.9 0.9~0.8 0.8~0.7 0.7~0.6
10(12.35%)
49(60.49%)
20(23.69%)
2(2.47%) 81
1(0.15%)
20(3.09%)
106(16.38%)
520(80.37%) 647
4-fold cross validation 8181 4 20
40 21 42recall rate precision rate
= ; Recall = ; Precision =
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
136
1. SVM
LibSVM[3] LibSVM
LibSVM grid kernel RBF cost 32 gamma0.5
95.63% recall rate 97.50% precision rate 94.10%
SVMRecall Rate Precision Rate
234 1 42 42 100% 100% 100%134 2 40 37 92.50% 95.00% 90.48%124 3 40 38 95.00% 100% 90.90%123 4 40 38 95.00% 95.00% 95.00%
95.63% 97.50% 94.10%
2. Neural Network
Qnet2000 Neural Network3 3 10 1
(Sigmoid function)(Learn Rate) 0.01 (Momentum) 0.8 (Max Iterations)
10000 Neural Network 96.25%recall rate 96.25% precision rate 96.37%
Neural NetworkRecall Rate Precision Rate
234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 37 92.50% 95.00% 90.48%123 4 40 38 95.00% 95.00% 95.00%
96.25% 96.25% 96.37%
3.
bi-gram bi-gram97.50% recall rate
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
137
96.25% precision rate 98.75%
bi-gram bi-gram
234 0.476108 0.047314 0.495112 0.82134 0.533153 0.041051 0.426089 0.83124 0.515488 0.065720 0.446411 0.85123 0.527719 0.038126 0.446539 0.85
Recall Rate Precision Rate234 1 42 42 100% 100% 100%134 2 40 39 97.50% 95.00% 100%124 3 40 38 95.00% 95.00% 95.00%123 4 40 39 97.50% 95.00% 100%
97.50% 96.25% 98.75%
4.
SVM Neural NetworkSVM 95.63% Neural Network
96.25% 97.5%
Recall Rate Precision RateSVM 95.63% 97.50% 94.10%NN 96.25% 96.25% 96.37%LR 97.50% 96.25% 98.75%
precision rate recall rateSVM precision rate
1:8 precision rate
[9]bi-gram
bi-gram
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
138
precision rate
101J1A0301101J1A0701
[1] Chao-Huang Chang. A New Approach for Automatic Chinese Spelling Correction. In Proceedings of Natural Language Processing Pacific Rim Symposium’95, Seoul, Korea. pp: 278-283,1995.
[2] Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text. World scientific publishing company, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol.16, Suppl. 1 pp.: 89-105, 2008.
[3] Chih-Jen Lin, Chih-Chung Chang. LIBSVM -- A Library for Support Vector Machines. Take from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[4] Fuji Ren, Hongchi Shi, and Qiang Zhou. A hybrid approach to automatic Chinese text checking and error correction, in Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics. pp: 1693-1698,2001.
[5] J.A.K. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters. Volume 9, Number 3 , 293-300, 1999.
[6] Yih-Jeng Lin, Feng-Long Huang and Ming-Shing Yu. A Chinese Spelling ErrorCorrection System. Proceedings of the Seventh Conference on Artificial Intelligence and Applications, 2002.
[7] 2009 ” ”
[8] 2010 ””
[9] 2011 ””
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
139