coling 2012 review (enlp)

COLING 2012 review(eNLP version)

Mamoru Komachi2012/12/17

Educational NLP research groupComputational Linguistics Lab

Nara Institute of Science and Technology, Japan

Disclaimer

• Not complete list, so please take a look at the paper list by yourself!

• I haven’t read any papers yet. I will talk about the impression from the presentation (oral, poster, demo) of their work. Please refer to the paper itself if you feel interested

COLING 2012 ORAL

• Joint English Spelling Error Correction and POS Tagging for Language Learners Writing

• Modeling ESL Word Choice Similarities by Representing Word Intensions and Extensions

• Problems in Evaluating Grammatical Error Detection Systems• Mining Words in the Minds of Second Language Learners• Native Tongue, Lost and Found: Resources and Empirical Evaluations

in Native Language Identification• Robust, Lexicalized Native Language Identification• Native Language Identification using Recurring N-grams

Joint English Spelling Error Correction and POS Tagging for Language Leaners Writing

Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi and Yuji Matsumoto

(NAIST, Japan)

Problem: Spelling errors and POS tags often coincide, but each task has been solved separately Idea: Jointly perform spelling correction and POS tagging by variable length CRF (to deal with split/merge errors)• Joint model outperforms the pipeline model• Shorter outputs due to removal of delimiters

Modeling ESL Word Choice Similarities By Representing Word Intensions and Extensions

Huichao Xue and Rebecca Hwa (University of Pittsburgh, USA)

Problem: To construct a confusion set for grammatical error correction, it often relies on manually corrected learner corpusIdea: Use only a native corpus to create confusion sets by applying relevance component analysis• Better confusion sets can be learned from

bilingual corpus and native corpus• Created confusion sets correlate well with real

mistakes

Problems in Evaluating Grammatical Error Detection Systems

Martin Chodorow, Markus Dickinson, Ross Israel and Joel Tetreault (City University of New York, USA)

Problem: Many evaluation metrics have been used for grammatical error detection, but none of them addresses the issue of data skewnessIdea: Propose best practices• Report raw frequencies (tp, fn, fp, tn)– Also report how you define true nevatives

• Treat unit size (exact match/overlap) carefully• Consider weighting the reliability of judgments

Mining Words in the Minds of Second Language Learners

Yo Ehara, Issei Sato, Hidekazu Oiwa and Hiroshi Nakagawa (University of Tokyo, Japan)

Problem: Though there are many studies on measuring the size of learners’ vocabulary, few studies address what kind of words they knowIdea: Define a learner-specific word difficulty measure• Theoretically sound and practically useful

extension to previous models• Able to obtain interpretable weight vector

Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification

Joel Tetreault, Daniel Blanchard, Aoife Cahill and Martin Chodorow (ETS, USA)

Problem: Previous NLI task uses ICLE, but the corpus is highly skewedIdea: Create a new balanced corpus (TOEFL11) and evaluate on cross-corpora• Many trends in previous work on ICLE generalize

to other corpora• Training on a large corpus and testing on a smaller

one works well, but not vice versa• Accuracy varies across proficiency levels

Robust, Lexicalized Native Language Identification

Julian Brooke and Graem Hirst(University of Toronto, Canada)

Problem: Previous NLI research uses only small single corpora, which limit using lexical featuresIdea: Extract an ESL corpus form Lang-8 to use lexical features and perform cross-corpus evaluation • Shallow lexical features contribute much more

than sophisticated syntactic features• Domain adaptation gives improvement• Evaluation on a single corpus may be questionable

Native Language Identification using Recurring n-grams

Serhiy Bykh and Detmar Meurers (Universitaet Tuebingen, Germany)

Problem: Since NLI task is a new field, features for NLI task are not well-studiedIdea: Explore surface/Open-Class-POS/POS n-gram features and evaluate on cross-corpora• The finer the features, the better the accuracy• Features learned from ICLE well generalized to

other corpora, unlike (Brooke and Hirst, 2011) which uses Lang-8 as a training corpus for NLI

COLING 2012 POSTERS

• The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings

• Defining Syntax for Learner Language Annotation

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings

Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto

(NAIST, Japan)

Problem: Until recently, no large-scale ESL corpora has been publicly available for grammatical error correctionIdea: Extract ESL corpus from the web and see the effect of corpus size in grammatical correction• Phrase-based SMT trained on large-scale data is

effective in preposition, article, lexical choice• Syntax and discourse information needed for

tense, agreement, noun number errors

Defining Syntax for Learner Language Annotation

Marwa Ragheb and Markus Dickinson (Indiana University, USA)

Problem: Though POS annotation has been proposed for ESL langauge, annotating syntax for learner language is not well studiedIdea: Investigate multiple layered annotation (morphological dependencies, distributional dependencies, and subcategorization) for ESL texts• Subcategorization seems preferable over other two

layers, since ESL texts are often hard to parse• Open question: how can we generalize this framework

to other non-canonical languages?

Summary

• Introduced eNLP-related papers presented at COLING 2012

• A lot of work on native language identification done (there will be a shared task on NLI at BEA-8, collocated with NAACL 2013)

• Cross-corpora scalability is important• Future research should go beyond surface and

POS level (semantic, syntactic and discourse information be investigated)