learning morphology of romance, germanic, and slavic ... fileintroduction motivation how can we...

17
Learning Morphology of Romance, Germanic, and Slavic languages with the tool Linguistica Helena Blancafort LREC 2010

Upload: others

Post on 06-Sep-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Learning Morphology of

Romance, Germanic, and Slavic

languages with the tool Linguistica

Helena Blancafort

LREC 2010

Page 2: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Outline

1. Introduction

2. State of the art

3. Linguistica: How it works

4. Experiments and Results

5. Conclusions and further work

20/05/2010LREC 2010

2

Page 3: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Introduction

Motivation

How can we predict the cost of developing a morphosyntactic lexicon for a new language?

Goals

Evaluate if we can benefit from unsupervised learning of morphology

Input: Bible parallel corpus, tool Linguistica(Goldsmith 2001, 2006)

20/05/2010LREC 2010

3

Page 4: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

State of the Art: Induction of

morphologyObjective

- induce morphological information from raw data

20/05/2010LREC 2010

4

• Brent et al. 1995; Kazakov, 1997 • MDL (Rissanen ,1998)

Affix inventory

• Schone and Jurafsky 2001;• Yarowsky andWicentowski 2001

Cluster of stems and

affixes

Page 5: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

State of the Art II

Using linguistic knowledge or not

20/05/2010LREC 2010

5

• Nakov et al (2003); Oliver (2005)• Learn all possible endings of an unknown word• Apply Maximum Likelihood Estimation (Mikheev)

Lexicon

• Clément et al. (2004)• Fosbert et al (2006); Loupy et al. (2008)• Pos-tagger Zanchetta and Baroni

(2005)

Inflection Rules

Page 6: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Linguistica: How it works I

• Knowledge-free

• Input: raw corpus

• Heuristics to generate a probabilistic morphological grammar

• MDL (minimum length description) & EM (expectation-maximization algorithm) to filter out inappropriate analysis

20/05/2010LREC 2010

6

Page 7: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Linguistica: How it works II

Signatures

Paradigm-like clusters with words sharing the same affixes

could help to build a morphological grammar

The algorithm:

- Splits a word into stem and affix

- For each stem, list of affixes

- Cluster of stems sharing the same affixes

20/05/2010LREC 2010

7

Page 8: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Linguistica: How it works III

Signatures

20/05/2010LREC 2010

8

NULL.ed.ing.s 68 7889

gather abound account ascend ask belong boil

chasten concern confirm consider delay doubt

encamp enter exceed explain fail fasten fold gain

gather glean greet groan guard hang happen harden

insult journey knock lack leap lift listen look minister

number obey offer overflow

Page 9: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Linguistica: How it works IV

Main hurdles1) Allomorphy

2) Incomplete paradigms due to bad segmentationSpanish verb anunciar:

anunci(o, en, etc.), anunciab(a)

3) No distinction between inflectional and derivational suffixes

20/05/2010LREC 2010

9

ES colgar -> colg, cuelgFR acheter -> achet, achèt

Page 10: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results I

20/05/2010LREC 2010

10

0

100

200

300

400

500

600

pl it cat es fr pt de nl en

number of suffixes generated by Linguistica

Page 11: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results II

Number of paradigmes and number of suffixes

20/05/2010LREC 2010

11

ptfrde

cat

nlen

ites

pl

0

100

200

300

400

500

600

700

800

900

1000

1234567891011121314

Page 12: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results III

20/05/2010LREC 2010

12

0

5

10

15

20

25

30

35

40

45

pl es cat it pt fr de nl en

Max nb forms per signature (Linguistica)

Page 13: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results IV

20/05/2010LREC 2010

13

Max nb forms per

signature (Linguistica)

es 31

it 28

fr 24

de 14

en 9

Max nb forms per

paradigm (Multext)

it 63

fr 62

es 55

de 29

en 14

Knowledge-free vs. Knowledge based

Page 14: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results VLongest signatures suggested by Linguistica for a stem

Affix Stem signature

pl 39 da NULL.ch.cie.dzą.j.je.jmy.jmyż.ją.jąc.li.liście.liśmy.m.my.na.ne.nej.ni.nie.niu.no.ny.ną.rze.sz.wa.wał.wszy.d.ł.ła.łby.łbyś.łem.łeś.ło.ły.o

es 31 anunci a.ad.ada.adas.adlo.ado.amos.an.ando.ar.ara.arles.aron.aros.arte.ará.arán.arás.aré.as.ase.asen.e.emos.en.es.o.áis.é.éis.ó

de 14 heil NULL.e.en.et.ig.los.lose.loser.sam.same.sames.t.te.ten

en 9 light NULL.ed.en.er.ing.ly.ness.ning.s

20/05/2010LREC 2010

14

Page 15: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Experiments and Results VIList of most frequent prefixes for German

Prefix Nb occ. Prefix Nb occ. Prefix Nb occ.

ge 40 her 13 er 8

aus 30 un 13 *nied 7

ver 21 weg 11 bei 6

hin 20 be 10 heim 6

auf 19 zu 10 über 5

ab 19 *üb 9 durch 5

ein 16 an 9 ent 4

20/05/2010LREC 2010

15

Page 16: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

Conclusions and Further Work

Useful information to evaluate the richness and complexity of the morphology of a language

Unsupervised techniques should be improved with human input: handwritten-rules are necessary for dealing with allomorphy and correct bad segmentation (Karasimos & Petropoulo 2010)

Complete paradigms using the web (Oliver 2005) or

Output quality is language-dependent, English better results than other languages (complete verbal paradigms)

20/05/2010LREC 2010

16

Page 17: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate

20/05/2010LREC 2010

17

Thank you

Grazzi