multilingual language processing - umiacshal/tmp/blair.pdf · bulgarian burmese byzantine...

Post on 16-Apr-2018

234 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Multilingual Language Processing1 Hal Daumé III (me@hal3.name)

MultilingualLanguage Processing

Hal Daumé IIIComputer ScienceUniversity of Maryland

me@hal3.name

Blair Linguistics Club

19 Nov 2014

Piyush Rai(Duke)

Lyle Campbell(U Hawaii)

Sujith Ravi(Google)

Adam Teichert(JHU)

Statistics, Typology and NLP2 Hal Daumé III (me@hal3.name)

Why study O(100) languages➢ What makes a language a

human language?

➢ What properties of “Language” can be learned from/exploited on from text

➢ Computational challenge of dealing with large, uncertain data sets

➢ You never know what language will be important tomorrow

➢ Pairwise models of language don't scale

➢ Hard to find linguists or translators in minority languages

Statistics, Typology and NLP3 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

Statistics, Typology and NLP4 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwichThe man eat+ a sandwich

past

Statistics, Typology and NLP5 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

The man eat+ a sandwich past

Statistics, Typology and NLP6 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

S

The man eat+ a sandwich past

Statistics, Typology and NLP7 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

The man eat+ a sandwich past

Statistics, Typology and NLP8 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)

The man eat+ a sandwich past

Statistics, Typology and NLP9 Hal Daumé III (me@hal3.name)

Typical NLP pipeline

Source Words Target Words

SourceMorphology

SourceSyntax

SourceShallowmantics

Interlingua

TargetMorphology

TargetSyntax

TargetShallowmantics

Analysis Generation

Source Semantics Target Semantics

The man ate a sandwich

DT NN VB DT NN

NP NPVP

SAgent Theme

∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)

The man eat+ a sandwich past

MorphologyTaggingParsingRole labelingInterpretation

Statistics, Typology and NLP10 Hal Daumé III (me@hal3.name)

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

Statistics, Typology and NLP11 Hal Daumé III (me@hal3.name)

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

AfrikaansAlbanianAmuzgoArabicArabic (Syrian)ArmenianArmenianAzerbaijaniBasqueBulgarianBurmeseByzantineCakchiquelChamorroCherokeeChinantec

CzechDanishDutchEnglishEsperantoEstonianFinnishFrenchGaelicGermanGreekGujaratiHaitian CreoleHebrewHiligaynonHindiHungarianIcelandic

IndonesianIrishItalianJacaltecoKannadaK'ekchíKlingonKoreanLatinLatvianLithuanianLow GermanMacedonianMalagasyMalayalamMamMam of TodosMandan

MandarinMaoriNahuatlNdebeleNorwegianOryaPersianPolishPortuguesePotawatomiQuichéRomanianRomaniRussianSerbianShonaSlovakSomali

SpanishSwahiliSwedishTagalogTamilThaiTurkishUkrainianUmaUrduUspantecoVietnamese

Statistics, Typology and NLP12 Hal Daumé III (me@hal3.name)

A unified approach

Raw Text

Linguistic Features

AnnotatedTreebanks

VO ⊃ PrePPostP ⊃ OV

Typological Features

Parallel Data

Statistics, Typology and NLP13 Hal Daumé III (me@hal3.name)

How does (eg) syntax work?➢ Get some linguists to annotate text with

syntactic structures

➢ Estimate a probabilistic context freegrammar from those structures

➢ Use that pCFG to parse new “test”sentences

➢ Works for any language for whichwe have annotated text!

AnnotatedTreebanks

Statistics, Typology and NLP14 Hal Daumé III (me@hal3.name)

How does (eg) syntax work?➢ Get some linguists to annotate text with

syntactic structures

➢ Estimate a probabilistic context freegrammar from those structures

➢ Use that pCFG to parse new “test”sentences

➢ Works for any language for whichwe have annotated text!

AnnotatedTreebanks

Multilingual Language Processing15 Hal Daumé III (me@hal3.name)

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

S

English PCFG

Multilingual Language Processing16 Hal Daumé III (me@hal3.name)

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

SLe homme a mange un sandwich savoureaux D N A V D N J

NP NPVP

SEl hombre se comio un bocadillo sabrosa D N A V D N J

NP NPVP

S

English PCFG

French PCFG

Spanish PCFG

ϴ

[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]

Multilingual Language Processing17 Hal Daumé III (me@hal3.name)

Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N

NP NPVP

SLe homme a mange un sandwich savoureaux D N A V D N J

NP NPVP

SEl hombre se comio un bocadillo sabrosa D N A V D N J

NP NPVP

S

English PCFG

French PCFG

Spanish PCFG

ϴ

[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]

+ 21% on averageover 8 languages

English, DutchDanish, Swedish

Spanish, PortugueseSloveneChinese

See also:Snyder, Barzilay et al....

Multilingual Language Processing18 Hal Daumé III (me@hal3.name)

Typology can helplanguage processing

Language processing can help typology

Statistics is the mediator

Multilingual Language Processing19 Hal Daumé III (me@hal3.name)

Implicational Universals

English:I eat dinner in restaurants.

French:je mange le diner dans les restaurantsI eat the dinner in the restaurants

Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat

Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am

Verb-Object (VO)

Object-Verb (OV)

Prepositional (PreP)

Postpositional (PostP)

Multilingual Language Processing20 Hal Daumé III (me@hal3.name)

Implicational Universals

English:I eat dinner in restaurants.

French:je mange le diner dans les restaurantsI eat the dinner in the restaurants

Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat

Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am

Verb-Object (VO)

Object-Verb (OV)

Prepositional (PreP)

Postpositional (PostP)

VO ⊃ PrePPostP ⊃ OV

Multilingual Language Processing21 Hal Daumé III (me@hal3.name)

The Typologist's Life

PreP PostPVOOV

Now, repeat for lots of feature pairs

(Greenberg, 1963) – Based on 30 diversely

sampled languages

16 0 3 11

Multilingual Language Processing22 Hal Daumé III (me@hal3.name)

Difficulties with Typical Approach

A ⊃ B (99%) is uninterestingwhen ∅ ⊃ B (99%)

Search process is tedious

Sampling problem whenmany languages considered

Process is inherently noisy

Multilingual Language Processing23 Hal Daumé III (me@hal3.name)

A Typological Database➢ 2150 Languages

➢ 35 language families➢ 275 language geni

➢ 139 Features➢ 11 feature categories

➢ Sparsely sampled➢ 85% missing data

Multilingual Language Processing24 Hal Daumé III (me@hal3.name)

Typological Map: VO

Multilingual Language Processing25 Hal Daumé III (me@hal3.name)

Typological Map: PreP

Multilingual Language Processing26 Hal Daumé III (me@hal3.name)

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Multilingual Language Processing27 Hal Daumé III (me@hal3.name)

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Problems: Cannot handle noisy data Doesn't address sampling problem

Multilingual Language Processing28 Hal Daumé III (me@hal3.name)

➢ Consider two features --> 2xN matrix

➢ First, generate first column withprior probability π1

➢ Next, decide if the implication holds

➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”

or if the implication doesn't hold➢ Forced to be “+” otherwise

An Initial Model VO PreP

++-++?+??+-+?+-

+?+-+++--?-+-++

Problems: Cannot handle noisy data Doesn't address sampling problem

m

π2π1 f2f1

Multilingual Language Processing29 Hal Daumé III (me@hal3.name)

Fixing the Noise Problem➢ Assume language-specific noise

➢ Model remains unchanged, excepta new variable causes “f” to be flipped

m π2π1

f2f1

e1 ε e2

Multilingual Language Processing30 Hal Daumé III (me@hal3.name)

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

m π2π1

f2f1

e1 ε e2

Multilingual Language Processing31 Hal Daumé III (me@hal3.name)

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

f2f1

e1 ε e2

f2f1

e1 ε e2

f2f1

e1 ε e2

. . .

Multilingual Language Processing32 Hal Daumé III (me@hal3.name)

Fixing the Sampling Problem➢ Hierarchical Bayes prior...

f2f1

e1 ε e2

f2f1

e1 ε e2

f2f1

e1 ε e2

. . .

m0

mIE

mGer mRom

mAus

mOce

Multilingual Language Processing33 Hal Daumé III (me@hal3.name)

Inference➢ Binomials get Beta priors

➢ m ~ Uniform➢ ~ Beta with 5% mean, 0-10% with 50% probability

➢ Everything else gets uniform priors

➢ Inference by Gibbs sampling➢ Plus a rejection sampler subroutine

Multilingual Language Processing34 Hal Daumé III (me@hal3.name)

Three Models

Flat – All languages independent

LingHier – Typological Hierarchy

DistHier – Obtained by clustering positionally

Multilingual Language Processing35 Hal Daumé III (me@hal3.name)

Automatically Extracting Implications➢ Search only over pairs with:

➢ 250 langs for which both features are known➢ 15 languages for which both hold simultaneously➢ When f1 is true, f2 is true with >50% probability

➢ Reduces space from 19,000 to 3442

➢ Sort by probability that m is true

➢ Evaluate:➢ Compare restorative accuracy versus each other➢ Compare against well-known implications

Multilingual Language Processing36 Hal Daumé III (me@hal3.name)

Restoration Accuracy by Model

Multilingual Language Processing37 Hal Daumé III (me@hal3.name)

Top Implications – LingHierPostpositions Gen-N Greenberg #2a OV Greenberg #4 OV Gen-N Greenberg #4 + Greenberg \#2a Gen-Noun Greenberg #2a (converse)

OV Greenberg #2b (converse) SV Gen-N ??? Adj-N Greenberg #18 Suffixing Clear explanation VO

Appeal to economy Dem-NVO Greenberg #3 (converse)

Adj-N Dem-N Greenberg #18 Noun-AdjSV ??? VO Greenberg #3

Prefixing Greenberg #27b N-Adj ???

Labial-velars No uvulars See paperNegative word See paperStrong prefixing VO

Suffixing ??? Final Sub. Word

Many vowels See paperPlural prefix N-Gen ??? No fricatives No tones ???

See paperDem-N

PostP

PostPPostPositions

Num-NTense Suf.Noun-RelC Lehmann

Intr. verb No question prt.Num-N Hawkins XVI (for postpositional languages) PreP

PostP Lehmann PostPPreP

Init. Subord. PreP Operator-operand principle (Lehmann) PreP

Little affixation

No pron poss afxLehmann

Subord. SuffixPostP Operator-operand principle (Lehmann)

High+Mid F.V.s

Oblig. subj. pron No pron poss afxTense Suf. Operator-operand principle (Lehmann)

Multilingual Language Processing38 Hal Daumé III (me@hal3.name)

Notes➢ If you think this stuff is interesting, you should read the

Dunn et al Nature paper

➢ Main claim:➢ All of this typology stuff is bogus➢ Once you account for “genetic” influences

➢ Directly contradicts what I've just told you

➢ Who is right?

Statistics, Typology and NLP39 Hal Daumé III (me@hal3.name)

Automatic Induction of Syntax➢ INPUT: A pile of text➢ OUTPUT: Syntactic structures of this text

➢ Current approaches are mostly based on dependency formalisms

The man ate a big sandwich D N V D J N

MODSUBJ

OBJMOD

MOD

Statistics, Typology and NLP40 Hal Daumé III (me@hal3.name)

Probabilistic Models of Syntax

D N V D J N

p(V|0,r)

p(N|V,l)

p(D|N,l)

p(N|V,r)

p(D|N,l)

p(J|N,l)

p(Data) = p(V|0,r) p(N|V,l) p(D|N,l) p(N|V,r) p(J|N,l) p(D|N,l)

Statistics, Typology and NLP41 Hal Daumé III (me@hal3.name)

Inferring Tags from the Structure➢ INPUT:

➢ OUTPUT:

➢ Baseline:➢ Random guessing: 4% accuracy

The man ate a big sandwich

D N V D J N

Statistics, Typology and NLP42 Hal Daumé III (me@hal3.name)

Sources of Knowledge➢ Seeds (frequent words for each tag)

➢ N: membro, milhoes, obras➢ D: as [the,2f] o [the,1m] os [the,2m]➢ V: afector, gasta, juntar➢ P: com, como, de, em

➢ Typological rules:➢ Art Noun←➢ Prp Noun→

➢ Tag knowledge:➢ Open class➢ Closed class

Statistics, Typology and NLP43 Hal Daumé III (me@hal3.name)

Preliminary Results

No Seeds Seeds0

10

20

30

40

50

60

No O/COpen/Closed

Statistics, Typology and NLP44 Hal Daumé III (me@hal3.name)

Preliminary Results: Open/Closed

No RulesArt<-N

Prp->NBoth

20

25

30

35

40

45

50

55

60

No RulesArt<-N

Prp->NBoth

20

25

30

35

40

45

50

55

60NO SEEDS SEEDS

Multilingual Language Processing45 Hal Daumé III (me@hal3.name)

Where does the tree come from?

Multilingual Language Processing46 Hal Daumé III (me@hal3.name)

A standard model for the genealogy of a populationEach organism has exactly one parent (haploid)Thus, the genealogy is a tree

Kingman's Coalescent

Multilingual Language Processing47 Hal Daumé III (me@hal3.name)

An infinite tree...

Multilingual Language Processing48 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing49 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing50 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing51 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing52 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing53 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing54 Hal Daumé III (me@hal3.name)

Graphical model on a coalescent

Multilingual Language Processing55 Hal Daumé III (me@hal3.name)

Understanding language relationships

Multilingual Language Processing56 Hal Daumé III (me@hal3.name)

Modeling errors

Multilingual Language Processing57 Hal Daumé III (me@hal3.name)

The Balkans

Multilingual Language Processing58 Hal Daumé III (me@hal3.name)

Linguistic areas

Multilingual Language Processing59 Hal Daumé III (me@hal3.name)

Classic linguistic areas

Multilingual Language Processing60 Hal Daumé III (me@hal3.name)

Model desiridata

Multilingual Language Processing61 Hal Daumé III (me@hal3.name)

Pitman-Yor Processes

Multilingual Language Processing62 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing63 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing64 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing65 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing66 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing67 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing68 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing69 Hal Daumé III (me@hal3.name)

Generative Story

Multilingual Language Processing70 Hal Daumé III (me@hal3.name)

Discovered results

Multilingual Language Processing71 Hal Daumé III (me@hal3.name)

Shared features

Multilingual Language Processing72 Hal Daumé III (me@hal3.name)

Reconstruction accuracies

Statistics, Typology and NLP73 Hal Daumé III (me@hal3.name)

Conclusions + Future Steps➢ Can infer IUs from data (WALS)

➢ Old ones and new ones➢ Can handle the sampling problem

➢ Can use typology to help tagging➢ Open/closed➢ Simple features

➢ Infer tree structure, too➢ Don't assume features: just IUs!➢ Infer multiple languages simultaneously➢ Feedback from text to IUs

top related