next slide is the neutral avenue system diagram with a morphology learning box added
DESCRIPTION
Next slide is the Neutral Avenue System Diagram with a Morphology Learning box added. Avenue Overview. Elicitation. Morphology. Rule Learning. Run-Time System. Rule Refinement. Translation Correction Tool. Word-Aligned Parallel Corpus. Learning Module. Do NOT Use. Handcrafted - PowerPoint PPT PresentationTRANSCRIPT
Next slide is the Neutral Avenue System Diagram with a
Morphology Learning box added
Avenue Overview
Learning
Module
Transfer Rules
Lexical Resources
Run Time Transfer System
Lattice
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Handcrafted rules
Morphology
Morphology Analyzer
Learning Module
Do NOT Use
The next slide is for Ari. It has her sections highlighted but also has
the extra box that I added for Morphology Learning
Rule Refinement
Learning
Module
Transfer Rules
Lexical Resources
Run Time Transfer System
Lattice
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Handcrafted rules
Morphology
Morphology Analyzer
Learning Module
Do NOT Use
Here is where Christian’s presentation begins
Avenue Overview
Learning
Module
Transfer Rules
Lexical Resources
Run Time Transfer System
Lattice
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Handcrafted rules
Morphology
Morphology Analyzer
Learning Module
Do NOT Use
The Challenge of Morphology
Mapudungun (Indigenous Language of Chile and Argentina, ~1 Million Speakers)
Allkütulekefun
The Challenge of Morphology
Mapudungun
-ke -fu -n-leAllkütu
The Challenge of Morphology
Mapudungun
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of Morphology
Mapudungun
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
I
The Challenge of Morphology
Mapudungun
I used to
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of Morphology
Mapudungun
I used to listen
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
The Challenge of Morphology
Mapudungun
I used to listen
-ke
-past
-fu
-indic.1sg
-n
-habitual
-le
-prog.
Allkütu
Listen
Tasks for Morphology• Segment Words• Map Morphemes onto Features
The Challenge of Morphology
Tasks for Morphology
• Segment Words• Map Morphemes
onto Features
• Learn these tasks– unsupervised – from data – for any language
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Our Approach
Ø.sblamesolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Ø.sblamesolve
Ø.s.dblame
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Ø.sblamesolve
Ø.s.dblame
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Leverage the Natural Structure of Morphology
• Paradigm– Set of affixes that
interchangeably attach to a set of stems
Ø.sblamesolve
Ø.s.dblame
sblameroamsolve
e.esblamsolv
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Our Approach
Ø.sblamesolve
Example Vocabulary
blame blamed blames roamed
roaming roams solve solves solving
Ø.s.dblame
sblameroamsolve
e.esblamsolv
Our Approach
e.esblamsolv
e.edblam
esblamsolv
Ø.s.dblame
Ø.sblamesolve
Øblameblamesblamedroams
roamedroaming
solvesolvessolving
e.es.edblam
edblamroam
dblameroame
Ø.dblame
s.dblame
sblameroamsolve
es.edblam
eblamsolv
me.mesbla
me.medbla
mesbla
me.mes.medbla
medblaroa
mes.medbla
mebla
a.as.o.os43
african, cas, jurídic, l, ...
a.as.o.os.tro1
cas
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a.tro2
cas.cen
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
tro16
catas, ce, cen, cua, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
Spanish Newswire Corpus
40,011 Tokens
6,975 Types
24
a.as.o.os43
african, cas, jurídic, l, ...
a.as.o.os.tro1
cas
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a.tro2
cas.cen
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
tro16
catas, ce, cen, cua, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
25
Suffixes
Stems
Level 5 = 5 suffixes
Stem Type Count
a.as.o.os43
african, cas, jurídic, l, ...
Adjective Inflection Class
26
a.as.o.os.tro1
cas
a.tro2
cas.cen
tro16
catas, ce, cen, cua, ...
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
From the spurious suffix “tro”
a.as.o.os.tro1
cas
a.tro2
cas.cen
tro16
catas, ce, cen, cua, ...
a.as.o.os43
african, cas, jurídic, l, ...
a.as.os50
afectad, cas, jurídic, l, ...
a.as.o59
cas, citad, jurídic, l, ...
a.o.os105
impuest, indonesi, italian, jurídic, ...
a.as199
huelg, incluid, industri,
inundad, ...
a.os134
impedid, impuest, indonesi,
inundad, ...
as.os68
cas, implicad, inundad, jurídic, ...
a.o214
id, indi, indonesi,
inmediat, ...
as.o85
intern, jurídic, just, l, ...
a1237
huelg, ib, id, iglesi, ...
as404
huelg, huelguist, incluid,
industri, ...
os534
humorístic, human, hígad,
impedid, ...
o1139
hub, hug, human,
huyend, ...
as.o.os54
cas, implicad, jurídic, l, ...
o.os268
human, implicad, indici,
indocumentad, ...
27
De
cre
asin
g S
tem
Co
un
t
Incr
ea
sin
g S
uffix
Co
unt
Basic Search Procedure
Examples and Evaluation of Automatically Selected Suffix SetsØ.ba.n.ndo ada.adas.ado.ados.aron.ó
a.aba.ado.ados.ar.ará.arán ada.ado.ados.ar.o
a.aciones.ación.adas.ado.ar ado.adores.o
a.ada.adas.ado.ar.ará ado.ados.arse.e
a.adas.ado.an.ar ado.ar.aron.arse.ará
a.ado.ados.ar.ó do.dos.ndo.r.ron
a.ado.an.arse.ó e.ida.ido
a.ado.aron.arse.ó emos.ido.ía.ían
aba.ada.ado.ar.o.os ida.ido.idos.ir.ió
aciones.ación.ado.ados ido.iendo.ir
aciones.ado.ados.ará ido.ir.ro
ación.ado.an.e
28
Global Suffix Evaluation
Precision: 0.506
Recall: 0.517
F1: 0.511
Next Steps for Morphology Induction
• Improve the Quality of Induced Paradigms– Current Work
• Convert Paradigms into a Segmenter– Soon
• Learn Mappings from Morphemes to Features– Future Goal
Avenue Overview
Learning
Module
Transfer Rules
Lexical Resources
Run Time Transfer System
Lattice
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Handcrafted rules
Morphology
Morphology Analyzer
Learning Module
Do NOT Use
Mapudungun• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers
Collaboration
• Mapuche Language Experts – Universidad de la Frontera (UFRO)
• Instituto de Estudios Indígenas (IEI)– Institute for Indigenous Studies
• Chilean Funding– Chilean Ministry of Education
(Mineduc)• Bilingual and Multicultural Education
Program
Eliseo Cañulef
Rosendo Huisca
Hugo Carrasco
Hector Painequeo
Flor Caniupil
Luis Caniupil Huaiquiñir
Marcela Collio Calfunao
Cristian Carrillan Anton
Salvador Cañulef
Carolina Huenchullan Arrúe
Claudio Millacura Salas
Accomplishments
• Corpora Collection
– Spoken Corpus• Collected: Luis Caniupil Huaiquiñir • Medical Domain• 3 of 4 Mapudungun Dialects
– 120 hours of Nguluche– 30 hours of Lafkenche– 20 hours of Pwenche
• Transcribed in Mapudungun• Translated into Spanish
– Written Corpus• ~ 200,000 words• Bilingual Mapudungun – Spanish• Historical and newspaper text
nmlch-nmjm1_x_0405_nmjm_00:M: <SPA>no pütokovilu kay koC: no, si me lo tomaba con agua
M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués
nmlch-nmjm1_x_0406_nmlch_00:M: ChengewerkelafuymiürkeC: Ya no estabas como gente entonces!
Accomplishments
• Developed At UFRO– Bilingual Dictionary with Examples
• 1,926 entries
– Spelling Corrected Mapudungun Word List• 117,003 fully-inflected word forms
– Segmented Word List• 15,120 forms• Stems translated into Spanish
Accomplishments
• Developed at LTI using Mapudungun language resources from UFRO– Spelling Checker
• Integrated into OpenOffice
– Hand-built Morphological Analyzer– Prototype Machine Translation Systems
• Rule-Based• Example-Based
– LenguasAmerindias.org