unsupervised segmentation of words into morphemes morpho challenge workshop 2006
DESCRIPTION
Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006. Mikko Kurimo , Mathias Creutz, Krista Lagus. Opening – Welcomes. Welcome to the Morphochallenge workshop, everybody! challenge participants workshop speakers other PASCAL researchers - PowerPoint PPT PresentationTRANSCRIPT
HELSINKI UNIVERSITY OF TECHNOLOGY
LABORATORY OF COMPUTER AND INFORMATION SCIENCE
ADAPTIVE INFORMATICS RESEARCH CENTRE
Unsupervised Segmentation of Words into MorphemesMorpho Challenge Workshop 2006
Mikko Kurimo, Mathias Creutz, Krista Lagus
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Opening – Welcomes
Welcome to the Morphochallenge workshop, everybody!
• challenge participants• workshop speakers• other PASCAL researchers• others interested in the topic
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Motivation
To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes.
Get basic vocabulary units suitable for different tasks:
• Speech and text understanding• Machine translation• Information retrieval• Statistical language modellingRule based systems can split: read + ing, but have
difficulties for complicated words and languages
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Workshop 12 April, final timetable
0900 Opening0910 Introduction and evaluation report0950 Invited talk by Richard Sproat1050 Break1120 Morfessor baseline by Krista Lagus1150 Competitors presentations1230 Lunch1400 Competitors (contd.)1500 Discussion1530 Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Morning session
09:10 Mikko KurimoIntroduction and Evaluation report
09:50 Prof. Richard Sproat (Invited Talk) University of Illinois at Urbana-Champaign ”Computational Morphology and its
Implications for the Theoretical Morphology”
10:50 – 11:20 Coffee break
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Noon session
11:20 Krista Lagus: "Morfessor in MorphoChallenge"
11:50 Delphine Bernhard: "Morphological segmentation for the automatic acquisition of semantic relationships in the context of MorphoChallenge 2005"
12:10 Stefan Bordag: "Two-step approach to unsupervised morpheme segmentation"
12:30 – 14:00 Lunch
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Afternoon session
14:00 Lars Johnsen: "Learning morphology on tokens" 14:20 Samarth Keshava and Emily Pitler: "Reports - Quick and Simple Unsupervised
Learning of Morphemes" 14:40 Eric Atwell (Mikko Kurimo): "Combinatory Hybrid Elementary Analysis of
Text" 15:00 Discussion 15:30 Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Discussion topics for afternoon
• New ways to evaluate the obtained units ?• New evaluation languages: German,
Norwegian, French, Estonian, Arabic,..?• Other application evaluations: SLU, IR,
MT,..?• New organizer partners ?• MorphoChallenge2 ?• Journal special issue ?• 2nd Morpho Challenge workshop ?• ?
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Opening - Thanks
Thanks to all who made Morpho Challenge possible!
• PASCAL network, coordinators, challenge program organizers
• Morpho Challenge organizing committee• Morpho Challenge program committee• Morpho Challenge participants• Morpho Challenge evaluation team• Challenge workshop organizers
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Let’s start.
It is my pleasure to welcome the first speaker, who is...
HELSINKI UNIVERSITY OF TECHNOLOGY
LABORATORY OF COMPUTER AND INFORMATION SCIENCE
ADAPTIVE INFORMATICS RESEARCH CENTRE
Morpho Challenge – Introduction and evaluation report
Mikko Kurimo, Mathias Creutz, Matti Varjokallio (Helsinki, FI)
Ebru Arisoy, Murat Saraclar (Istanbul, TR)
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Contents
1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Motivation
To design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes.
Get basic vocabulary units suitable for different tasks:
• Speech and text understanding• Machine translation• Information retrieval• Statistical language modelling
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Motivation
The scientific goals of this challenge are:• To learn of the phenomena underlying word
construction in natural languages• To discover approaches suitable for a wide
range of languages• To advance machine learning methodology
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Contents
1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Call for participation
• Part of the EU Network of Excellence PASCAL’s Challenge Program
• Participation is open to all and free of charge• Word sets are provided for three languages:
Finnish, English, and Turkish • Implement an unsupervised algorithm that
segments the words of each language!• No language-specific tweaking parameters,
please• Write a paper that describes your algorithm
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Rules
• Segmented words are submitted to the organizers
• Two different evaluations are made• Competition 1: Comparison to a linguistic
morpheme segmentation "gold standard“• Competition 2: Speech recognition
experiments, where statistical n-gram language models utilize the morphemes instead of entire words.
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Datasets
• Word lists are downloadable at our home page• Each word in the list is preceded by its frequency • Finnish: newspapers, books, newswires: 1.6/32M• Turkish: web, newspapers, sports news: 0.6/17M• English: Gutenberg, Gigaword, Brown: 170k/24M• Small gold standard sample in each language
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
ParticipantsA1 Choudri and Dang, Univ. Leeds, UKA2 a,b, Bernhard, TIMC-IMAG, FA3 'A.A.‘ Ahmad and Allendes, Univ. Leeds, UKA4 ‘comb’,’lsv’, Bordag, Univ. Leipzig, DA5 Rehman and Hussain, Univ. Leeds, UKA6 'RePortS‘, Pitler and Keshava, Univ. Yale, USAA7 Bonnier, Univ. Leeds, UKA8 Kitching and Malleson, Univ. Leeds, UKA9 'Pacman‘, Manley and Williamson, Univ. Leeds, UKA10 Johnsen, Univ. Bergen, NOA11 'Swordfish‘, Jordan, Healy and Keselj, Univ.
Dalhousie, CAA12 'Cheat‘, Atwell and Roberts, Univ. Leeds, UKM1-3 Morfessor, Categories-ML, MAP, Helsinki Univ.
Tech, FI
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Contents
1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Competition 1: Word segmentation
• Two samples : boule_vard , cup_bearer_s‘• Gold standard: boulevard , cup_bear_er_s_‘• 2 correct hits (H), 1 insertion (I), 2 deletions (D)• Precision = H / (H + I) = 2 / (2 + 1) = 0.67• Recall = H / (H + D) = 2 / (2 + 2) = 0.50• F-Measure = harmonic mean of precision and
recall = 2H / (2H + I + D) = 4 / (4 + 1 + 2) = 0.57• A secret (random)10% subset of words evaluated• Morfessor Baseline: 54.2% FI, 51.3% TR, 66.0 EN
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Results: F-measure in Finnish data
202530354045505560657075
Finnish
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure with reference algorithms
202530354045505560657075
Finnish
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure in Turkish data
202530354045505560657075
Turkish
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure with reference algorithms
202530354045505560657075
Turkish
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure in English data
303540
45505560
65707580
English
Choudri
BernhA
BernhB
Ahmad
BordagC
Rehman
Pitler
Bonnier
Kitching
Manley
Johnsen
Jordan
Atwell
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure with reference algorithms
30
3540455055
6065707580
English
Choudri
BernhA
BernhB
Ahmad
BordagC
Rehman
Pitler
Bonnier
Kitching
Manley
Johnsen
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
F-measure, the 3 languages task
202530354045505560657075
Finnish Turkish English
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
...with reference algorithms
202530354045505560657075
Finnish Turkish English
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Contents
1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Competition 2: Language modeling
• A statistical N-gram LM trained for the obtained morphemes using a large text corpus
• Growing N-gram model for Finnish by HUT tools
• 4-gram model for Turkish using SRILM• Free lexicon size (40´000 – 700´000)• ~10M N-grams (Finnish) or 50-70M bytes
(Turkish)
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Evaluation by speech recognition
• Realistic benchmark application: Continuous reading of large-vocabulary texts (books and news)
• Letter error rate LER% = (sub + ins + del) / letters• Baseline systems using LMs of Morfessor’s segments• Finnish recognizer made at HUT (HUT tools): speaker-
dep., running speed 10-15 xRT, baseline 1.31% LER• Turkish made at Bogazici Univ. (HTK and AT&T tools):
speaker-indep., running 2-3 xRT, baseline 13.7% LER
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Speech recognition letter error rate (LER)
11
11.512
12.513
13.514
14.515
15.516
Finnish*10 Turkish*1
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
LER for reference algorithms
1010.5
1111.5
1212.5
1313.5
1414.5
1515.5
16
Finnish*10 Turkish*1
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
Rover
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
LER for grammatic rules and words, too
1010.5
1111.5
1212.5
1313.5
1414.5
1515.5
16
Finnish*10 Turkish*1
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
Rover
GoldStd
Words
s
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Update for Turkish results NEW
1010.5
1111.5
1212.5
1313.5
1414.5
15
Turkish pruned Turkish full LM
Choudri
BernhA
BernhB
BordagC
Rehman
Bonnier
Manley
Jordan
Atwell
Morfess.
MorfML
MorfMAP
C-All
C-Top5
Rover
GoldStd
Words
s
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Contents
1. Motivation2. Call for participation3. Rules4. Datasets5. Participants6. Results of competition 1, word segmentation7. Results of competition 2, language modeling8. Conclusion
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Conclusion
The scientific goals of this challenge are:• To learn of the phenomena underlying word
construction in natural languages• To discover approaches suitable for a wide
range of languages• To advance machine learning methodology
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Conclusion
• 14 different unsupervised segmentation algorithms
• 12 participating research groups• Evaluations for 3 languages• Full report and papers in the proceedings• Website:
http://www.cis.hut.fi/morphochallenge2005
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Acknowledgments
• Text and speech data providers in all languages!
• Finnish and Turkish evaluation teams• Funding from PASCAL, Finnish Academy,
Lang. Tech. Grad school, HUT, and Bogazici Univ.
• LM and ASR tools in HUT, SRI, and AT&T• Competition participants!
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
The second speaker today :
Professor Richard Sproat, University of Illinois at Urbana-Champaign:
”Computational Morphology and its Implications for the Theoretical Morphology”
HELSINKI UNIVERSITY OF TECHNOLOGY
ADAPTIVE INFORMATICS RESEARCH CENTRE
Richard Sproat
Professor of Linguistics and Electrical and Computer Engineering at the University of Illinois and head of the Computational Linguistics Lab at the Beckman Institute.
Received his Ph.D. from MIT in 1985 and has since then worked also at AT&T Bell Labs.
A well-known expert in language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, writing systems, and text-to-scene conversion.