coling 2012 review (enlp)
DESCRIPTION
Paper reviews presented at COLING 2012. I introduced papers related to educational applications of natural language processing.TRANSCRIPT
![Page 1: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/1.jpg)
COLING 2012 review(eNLP version)
Mamoru Komachi2012/12/17
Educational NLP research groupComputational Linguistics Lab
Nara Institute of Science and Technology, Japan
![Page 2: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/2.jpg)
Disclaimer
• Not complete list, so please take a look at the paper list by yourself!
• I haven’t read any papers yet. I will talk about the impression from the presentation (oral, poster, demo) of their work. Please refer to the paper itself if you feel interested
![Page 3: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/3.jpg)
COLING 2012 ORAL
• Joint English Spelling Error Correction and POS Tagging for Language Learners Writing
• Modeling ESL Word Choice Similarities by Representing Word Intensions and Extensions
• Problems in Evaluating Grammatical Error Detection Systems• Mining Words in the Minds of Second Language Learners• Native Tongue, Lost and Found: Resources and Empirical Evaluations
in Native Language Identification• Robust, Lexicalized Native Language Identification• Native Language Identification using Recurring N-grams
![Page 4: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/4.jpg)
Joint English Spelling Error Correction and POS Tagging for Language Leaners Writing
Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi and Yuji Matsumoto
(NAIST, Japan)
Problem: Spelling errors and POS tags often coincide, but each task has been solved separately Idea: Jointly perform spelling correction and POS tagging by variable length CRF (to deal with split/merge errors)• Joint model outperforms the pipeline model• Shorter outputs due to removal of delimiters
![Page 5: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/5.jpg)
Modeling ESL Word Choice Similarities By Representing Word Intensions and Extensions
Huichao Xue and Rebecca Hwa (University of Pittsburgh, USA)
Problem: To construct a confusion set for grammatical error correction, it often relies on manually corrected learner corpusIdea: Use only a native corpus to create confusion sets by applying relevance component analysis• Better confusion sets can be learned from
bilingual corpus and native corpus• Created confusion sets correlate well with real
mistakes
![Page 6: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/6.jpg)
Problems in Evaluating Grammatical Error Detection Systems
Martin Chodorow, Markus Dickinson, Ross Israel and Joel Tetreault (City University of New York, USA)
Problem: Many evaluation metrics have been used for grammatical error detection, but none of them addresses the issue of data skewnessIdea: Propose best practices• Report raw frequencies (tp, fn, fp, tn)– Also report how you define true nevatives
• Treat unit size (exact match/overlap) carefully• Consider weighting the reliability of judgments
![Page 7: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/7.jpg)
Mining Words in the Minds of Second Language Learners
Yo Ehara, Issei Sato, Hidekazu Oiwa and Hiroshi Nakagawa (University of Tokyo, Japan)
Problem: Though there are many studies on measuring the size of learners’ vocabulary, few studies address what kind of words they knowIdea: Define a learner-specific word difficulty measure• Theoretically sound and practically useful
extension to previous models• Able to obtain interpretable weight vector
![Page 8: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/8.jpg)
Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification
Joel Tetreault, Daniel Blanchard, Aoife Cahill and Martin Chodorow (ETS, USA)
Problem: Previous NLI task uses ICLE, but the corpus is highly skewedIdea: Create a new balanced corpus (TOEFL11) and evaluate on cross-corpora• Many trends in previous work on ICLE generalize
to other corpora• Training on a large corpus and testing on a smaller
one works well, but not vice versa• Accuracy varies across proficiency levels
![Page 9: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/9.jpg)
Robust, Lexicalized Native Language Identification
Julian Brooke and Graem Hirst(University of Toronto, Canada)
Problem: Previous NLI research uses only small single corpora, which limit using lexical featuresIdea: Extract an ESL corpus form Lang-8 to use lexical features and perform cross-corpus evaluation • Shallow lexical features contribute much more
than sophisticated syntactic features• Domain adaptation gives improvement• Evaluation on a single corpus may be questionable
![Page 10: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/10.jpg)
Native Language Identification using Recurring n-grams
Serhiy Bykh and Detmar Meurers (Universitaet Tuebingen, Germany)
Problem: Since NLI task is a new field, features for NLI task are not well-studiedIdea: Explore surface/Open-Class-POS/POS n-gram features and evaluate on cross-corpora• The finer the features, the better the accuracy• Features learned from ICLE well generalized to
other corpora, unlike (Brooke and Hirst, 2011) which uses Lang-8 as a training corpus for NLI
![Page 11: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/11.jpg)
COLING 2012 POSTERS
• The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
• Defining Syntax for Learner Language Annotation
![Page 12: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/12.jpg)
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto
(NAIST, Japan)
Problem: Until recently, no large-scale ESL corpora has been publicly available for grammatical error correctionIdea: Extract ESL corpus from the web and see the effect of corpus size in grammatical correction• Phrase-based SMT trained on large-scale data is
effective in preposition, article, lexical choice• Syntax and discourse information needed for
tense, agreement, noun number errors
![Page 13: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/13.jpg)
Defining Syntax for Learner Language Annotation
Marwa Ragheb and Markus Dickinson (Indiana University, USA)
Problem: Though POS annotation has been proposed for ESL langauge, annotating syntax for learner language is not well studiedIdea: Investigate multiple layered annotation (morphological dependencies, distributional dependencies, and subcategorization) for ESL texts• Subcategorization seems preferable over other two
layers, since ESL texts are often hard to parse• Open question: how can we generalize this framework
to other non-canonical languages?
![Page 14: COLING 2012 review (eNLP)](https://reader037.vdocuments.site/reader037/viewer/2022103109/546ce57bb4af9f662c8b52aa/html5/thumbnails/14.jpg)
Summary
• Introduced eNLP-related papers presented at COLING 2012
• A lot of work on native language identification done (there will be a shared task on NLI at BEA-8, collocated with NAACL 2013)
• Cross-corpora scalability is important• Future research should go beyond surface and
POS level (semantic, syntactic and discourse information be investigated)