overview of the 2013 alta shared task
DESCRIPTION
Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013), pp132-136, Brisbane, Australia. http://aclweb.org/anthology/U/U13/TRANSCRIPT
Overview of the 2013 ALTA Shared Task
Diego Molla
Australasian Language Technology Macquarie University
ALTA 2013, Brisbane, Australia
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 2/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 3/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
The ALTA Shared Tasks
Aims
I Target university students with programming experience.
I No background on text processing required.
I Aim to expose potential researchers to NLP-related problems.
Format
I All participants attempt to solve the same problem.
I The training and test data are common to all.
I Any tools and external resources can be used.
I The solution must be completely automated.
2013 ALTA Shared Task Diego Molla 4/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
The ALTA Shared Tasks
Aims
I Target university students with programming experience.
I No background on text processing required.
I Aim to expose potential researchers to NLP-related problems.
Format
I All participants attempt to solve the same problem.
I The training and test data are common to all.
I Any tools and external resources can be used.
I The solution must be completely automated.
2013 ALTA Shared Task Diego Molla 4/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
The 2013 Shared Task
Task: Case and punctuation restoration
Categories: student, open
Prize: $350
Framework: Kaggle in Class
Student Category
I All members areuniversity students.
I No members are full-timeemployed.
I No members have a PhD.
Open Category
I Any other teams.
2013 ALTA Shared Task Diego Molla 5/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 6/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Case and Punctuation Restoration
Input
. . . stored at the ucla television archives the archived episodes weretelecast march 8 16 and 24 1971 april 1 and . . .
Output
. . . stored at the UCLA Television Archives. The archived episodeswere telecast: March 8, 16, and 24, 1971, April 1 and . . .
2013 ALTA Shared Task Diego Molla 7/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Motivation
I In some situations, English text does not have informationabout capitalisation or punctuation.
I Automated text transcriptions.I Quick notes.I Text messages, tweets.
I In some applications, a preliminary stage of case andpunctuation restoration improves outcomes.
I Machine translation.I Information extraction.
2013 ALTA Shared Task Diego Molla 8/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Motivation
I In some situations, English text does not have informationabout capitalisation or punctuation.
I Automated text transcriptions.I Quick notes.I Text messages, tweets.
I In some applications, a preliminary stage of case andpunctuation restoration improves outcomes.
I Machine translation.I Information extraction.
2013 ALTA Shared Task Diego Molla 8/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Case and Punctuation Restoration as a Classification Task
Baldwin and Joseph (2009)
I Multi-label classification.I Each label indicates the information to restore.
I COMMA: Word is followed by a comma.I CAPi : Character i is in uppercase.I ALLCAPS: All characters in uppercase.I NOCHANGE: No special restoration needed.I . . .
corp/CAP1+FULLSTOP+COMMA
Corp.
2013 ALTA Shared Task Diego Molla 9/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Simplification for the ALTA Shared Task
Only Two Labels
I Case: The word has at least one character in uppercase.
I Punct: The word is followed by at least one punctuation mark.
Punctuation Marks
,.;:?!
2013 ALTA Shared Task Diego Molla 10/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Training Set
CAPITALIZED PUNCTUATION WORD
True False positive
False False pressure
False False ventilation
False False (
True False ppv
False False )
False False consists
False False of
False False using
False False a
False False fan
False False to
False False create
2013 ALTA Shared Task Diego Molla 11/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Test Set
Input
ID WORD
255 stored
256 at
257 the
258 ucla
259 television
260 archives
261 the
262 archived
263 episodes
264 were
Output
Id,documents
Case,258 259 260 261 266 272
Punct,260 265 267 268 270 271
2013 ALTA Shared Task Diego Molla 12/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Data Sources
Test Set
I Data collected by Baldwin & Joseph (2009) from the APNewswire (APW) and New York Times (NYT) sections of theEnglish Gigaword Corpus.
1. Public test set: available for participants during thecompetition.
2. Private test set: released at the last minute.
Training Set
I A third partition from the data by Baldwin & Joseph (2009).
I An extract of Wikipedia.
2013 ALTA Shared Task Diego Molla 13/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Data Sizes
Wikipedia Extract for Training
I 18 files.
I 306,445 words in total.
Data from Baldwin & Joseph (2009)
I Training: 66,371 words.
I Public test: 64,072 words.
I Private test: 66,371 words.
2013 ALTA Shared Task Diego Molla 14/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 15/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Kaggle in Class
Kaggle
I Kaggle offers a Web-based framework for data-drivencompetitions.
I A large base of potential participants.
I Potentially large prizes for the participants.
I Fee-based for the organisers; free for the participants.
Kaggle in Class
I Free for organisers and participants.
I Limited user support by Kaggle.
I Used by course-based competitions.
2013 ALTA Shared Task Diego Molla 16/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Alta Shared Task in Kaggle in Class
2013 ALTA Shared Task Diego Molla 17/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Features of Kaggle in Class
I Public leaderboard: all participants can submit and comparewith other participants.
I Automated evaluation: organisers can choose among severalevaluation metrics.
I Public and private partitions: A private partition of the testdata is held private for the final ranking
I But this feature does not work well with some evaluationmetrics.
I Discussion forum: for communication among participants.
2013 ALTA Shared Task Diego Molla 18/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 19/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Evaluation Metric
Output
Id,documents
Case,258 259 260 262 270
Punct,259 260 265 270
Target
Id,documents
Case,258 259 260 261 266 272
Punct,260 265 267 268 270 271
Macro-Averaged F1
I Case:P = 3/5; R = 3/6;F1 = 0.54
I Punct:P = 3/4; R = 3/6;F1 = 0.6
I Final score:(0.54+0.6)/2 =0.57
2013 ALTA Shared Task Diego Molla 20/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
A Baseline
Training data F1 (public) F1 (private)
Train data 0.4355 0.2895Wikipedia 0-5 0.4077 0.2761Wikipedia 0-10 0.4173 0.2791Wikipedia 0-1 0.42267 0.2789Train + Wikipedia 0.4493 0.2876
I Single-label task: Each of the 4 combinations of possiblelabels forms a single label.
I Trained NLTK’s Hidden Markov Model (HMM).
I Results improved as we added more training data.
I Large difference between “public” and “private” test sets.
2013 ALTA Shared Task Diego Molla 21/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Results
Public Data
Rank Team Score
1 Winner 0.737632 Second 0.683603 ? 0.632324 ? 0.631095 ? 0.602516 ? 0.601477 ? 0.595178 ? 0.583329 ? 0.5683210 ? 0.5674711 ? 0.5579312 ? 0.5560613 ? 0.5508714 ? 0.5226115 ? 0.5195416 ? 0.5116717 ? 0.4931118 ? 0.4762219 (test system) 0.4666720 ? 0.4649021 ? 0.4598622 ? 0.45291
Baseline 0.44930
Public Data
Rank Team Score
23 (8 systems) 0.4493032 ? 0.4491433 ? 0.4271034 ? 0.4225735 ? 0.4169236 ? 0.4023937 ? 0.3881238 ? 0.3811339 ? 0.3259440 ? 0.3232041 ? 0.3098842 ? 0.2989143 ? 0.2930444 ? 0.2764245 ? 0.2350446 Team A 0.2310847 ? 0.2193048 ? 0.2177149 ? 0.2129150 ? 0.2022651 ? 0.1339752 ? 0.00000
Private Data
Rank Team Score
1 Winner 0.736602 Second 0.649343 ? 0.300374 Team A 0.07656
2013 ALTA Shared Task Diego Molla 22/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task Diego Molla 23/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
The ALTA Shared Task in Class at UniMelb
I Students in the UniMelb Knowledge Technologies subjectwere assigned the shared task as a class project.
I Blended Learning : augmenting classroom learning withon-line opportunities.
I Some adaptations were made to the class context:I Stage 1: Data pre-processingI Stage 2: Feature and Method Exploration; Report write-upI Stage 3: Peer review
I Emphasis on critical analysis of methods and results.
2013 ALTA Shared Task Diego Molla 24/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
ALTA Kaggle in Class at UniMelb
I Students were given the option of participating on-linethrough Kaggle in Class.
I Participating in the on-line forum gave immediate feedback onperformance.
I Open ’competition’ through leader board stimulatedexperimentation.
I Anecdotal observation suggested better overall marks forstudents who participated on-line.
2013 ALTA Shared Task Diego Molla 25/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Conclusions
Conclusions
I Larger participation than in past tasks.
I Used as an assignment at a Masters unit at University ofMelbourne.
I Many participants did much better than our baseline.
I Easy to produce training data.
I Larger training data from other domains (Wikipedia) improveson results.
I Kaggle in Class useful, though had to use a second “final”submission that had very few participants.
Questions?
2013 ALTA Shared Task Diego Molla 26/26
The ALTA Shared Tasks The 2013 Task Kaggle in Class Results Use in UniMelb
Conclusions
Conclusions
I Larger participation than in past tasks.
I Used as an assignment at a Masters unit at University ofMelbourne.
I Many participants did much better than our baseline.
I Easy to produce training data.
I Larger training data from other domains (Wikipedia) improveson results.
I Kaggle in Class useful, though had to use a second “final”submission that had very few participants.
Questions?
2013 ALTA Shared Task Diego Molla 26/26