lexical alignment: ibm models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · let’s...

Lexical alignment: IBM models 1 and 2MLE via EM for categorical distributions

Wilker Aziz

April 11, 2017

Translation data

Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data

the black dog � ~the nice dog � ∪the black cat � ~

a dog chasing a cat � / �

Is there anything we could say about this language?

1 / 29

Translation data

Let’s assume we are confronted with a new languageand luckily we managed to obtain some sentence-aligned data

Is there anything we could say about this language?

1 / 29

Translation by analogy

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

I nouns seem to preceed adjectives

I determines are probably not expressed

I chasing may be expressed by /and perhaps this language is OVS

I or perhaps chasing is realised by a verb with swappedarguments

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

A few hypotheses:

I � ⇐⇒ dog

I � ⇐⇒ cat

I ~ ⇐⇒ black

2 / 29

Probabilistic lexical alignment models

This lecture is about operationalising this intuition

I through a probabilistic learning algorithm

I for a non-probabilistic approach see for example[Lardilleux and Lepage, 2009]

3 / 29

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

4 / 29

Word-to-word alignments

Imagine you are given a text

the black dog o cao pretothe nice dog o cao amigothe black cat o gato preto

a dog chasing a cat um cao perseguindo um gato

5 / 29

Now imagine the French words were replaced by placeholders

the black dog F1 F2 F3

the nice dog F1 F2 F3

the black cat F1 F2 F3

a dog chasing a cat F1 F2 F3 F4 F5

5 / 29

and suppose our task is to have a model explain the original data

5 / 29

and suppose our task is to have a model explain the original databy generating each French word from exactly one English word

5 / 29

Generative story

For each sentence pair independently,

1. observe an English sentence e1, · · · , emand a French sentence length n

2. for each French word position j from 1 to n

2.1 select an English position aj2.2 conditioned on the English word eaj

, generate fj

We have introduced an alignmentwhich is not directly visible in the data

6 / 29

Generative story

For each sentence pair independently,

1. observe an English sentence e1, · · · , emand a French sentence length n

2. for each French word position j from 1 to n

2.1 select an English position aj2.2 conditioned on the English word eaj

, generate fj

We have introduced an alignmentwhich is not directly visible in the data

6 / 29

Data augmentation

Observations:

the black dog o cao preto

Imagine data is made of pairs: (aj , fj) and eaj → fj

7 / 29

Data augmentation

Observations:

the black dog (A1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, EA1 → F1) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (A2, EA2 → F2) (A3, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (3, EA2 → F2) (A3, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (3, dog→ cao) (A3, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (3, dog→ cao) (2, EA3 → F3)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (3, dog→ cao) (2, black→ preto)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)

7 / 29

Data augmentation

Observations:

the black dog (1, the→ o) (1, the→ cao) (1, the→ preto)

the black dog (a1, ea1 → f1) (a2, ea2 → f2) (a3, ea3 → f3)

7 / 29

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

8 / 29

Mixture models: generative story

I c mixture components

I each defines a distribution over the same data space XI plus a distribution over components themselves

Generative story

1. select a mixture component z ∼ P (Z)

2. generate an observation from it x ∼ P (X|Z = z)

9 / 29

Mixture models: generative story

I c mixture components

I each defines a distribution over the same data space XI plus a distribution over components themselves

Generative story

1. select a mixture component z ∼ P (Z)

2. generate an observation from it x ∼ P (X|Z = z)

9 / 29

Mixture models: likelihood

Incomplete-data likelihood

P (xm1 ) =

m∏i=1

P (xi) (1)

=m∏i=1

c∑z=1

P (X = xi, Z = z)︸︷︷︸complete-data likelihood

m∏i=1

c∑z=1

P (Z = z)P (X = xi|Z = z) (3)

10 / 29

Interpretation

Missing data

I Let Z take one of c mixture components

I Assume data consists of pairs (x, z)

I x is always observed

I y is always missing

Inference: posterior distribution over possible Z for each x

P (Z = z|X = x) =P (Z = z,X = x)∑c

z′=1 P (Z = z′, X = x)(4)

=P (Z = z)P (X = x|Z = z)∑c

z′=1 P (Z = z′)P (X = x|Z = z′)(5)

11 / 29

Interpretation

Missing data

I Let Z take one of c mixture components

I Assume data consists of pairs (x, z)

I x is always observed

I y is always missing

Inference: posterior distribution over possible Z for each x

P (Z = z|X = x) =P (Z = z,X = x)∑c

z′=1 P (Z = z′, X = x)(4)

=P (Z = z)P (X = x|Z = z)∑c

z′=1 P (Z = z′)P (X = x|Z = z′)(5)

11 / 29

Non-identifiability

Different parameter settings, same distribution

Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5

Z X = a X = b

1 0.2 0.8

2 0.7 0.3

P (X) 0.45 0.55

Z X = a X = b

1 0.7 0.3

2 0.2 0.8

P (X) 0.45 0.55

Problem for parameter estimation by hillclimbing

12 / 29

Non-identifiability

Different parameter settings, same distribution

Suppose X = {a, b} and c = 2and let P (Z = 1) = P (Z = 2) = 0.5

Z X = a X = b

1 0.2 0.8

2 0.7 0.3

P (X) 0.45 0.55

Z X = a X = b

1 0.7 0.3

2 0.2 0.8

P (X) 0.45 0.55

Problem for parameter estimation by hillclimbing

12 / 29

Maximum likelihood estimation

Suppose a dataset D = {x(1), x(2), · · · , x(m)}

Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

the score function is

l(θ) =m∑i=1

logPθ(X = x(i))

then we chooseθ? = arg max

θl(θ)

13 / 29

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θ

Likelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

l(θ) =m∑i=1

logPθ(X = x(i))

θl(θ)

13 / 29

Suppose a dataset D = {x(1), x(2), · · · , x(m)}Suppose P (X) is one of a parametric family with parameters θLikelihood of iid observations

P (D) =

m∏i=1

Pθ(X = x(i))

l(θ) =m∑i=1

logPθ(X = x(i))

θl(θ)

13 / 29

P (D) =

m∏i=1

Pθ(X = x(i))

l(θ) =

m∑i=1

logPθ(X = x(i))

θl(θ)

13 / 29

P (D) =

m∏i=1

Pθ(X = x(i))

l(θ) =

m∑i=1

logPθ(X = x(i))

θl(θ)

13 / 29

MLE for categorical: estimation from fully observed data

Suppose we have complete data

I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}

Then, for a categorical distribution

P (X = x|Z = z) = θz,x

and n(z, x|Dcomplete) = count of (z, x) in Dcomplete

MLE solution:

θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)

14 / 29

MLE for categorical: estimation from fully observed data

Suppose we have complete data

I Dcomplete = {(x(1), z(1)), . . . , (x(m), z(m))}

Then, for a categorical distribution

P (X = x|Z = z) = θz,x

and n(z, x|Dcomplete) = count of (z, x) in Dcomplete

MLE solution:

θz,x =n(z, x|Dcomplete)∑x′ n(z, x′|Dcomplete)

14 / 29

MLE for categorical: estimation from incomplete data

Expectation-Maximisation algorithm [Dempster et al., 1977]

E-step:

I for every observation x, imagine that every possible latentassignment z happened with probability Pθ(Z = z|X = x)

Dcompleted = {(x, Z = 1), . . . , (x, Z = c) : x ∈ D}

15 / 29

MLE for categorical: estimation from incomplete dataExpectation-Maximisation algorithm [Dempster et al., 1977]

M-step:

I reestimate θ as to climb the likelihood surface

I for categorical distributions P (X = x|Z = z) = θz,xz and x are categorical0 ≤ θz,x ≤ 1 and

∑x∈X θz,x = 1

θz,x =E[n(z → x|Dcompleted)]∑x′ E[n(z → x′|Dcompleted)]

∑mi=1

∑z′ P (z′|x(i))1z(z′)1x(x(i))∑m

∑x′∑

z′ P (z′|x(i))1z(z′)1x′(x(i))(7)

∑mi=1 P (z|x(i))1x(x(i))∑m

∑x′ P (z|x(i))1x′(x(i))

16 / 29

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

17 / 29

IBM1: a constrained mixture model

Constrained mixture model

I mixture components are English words

I but only English words that appear inthe English sentence can be assigned

I aj acts as an indicator for the mixturecomponent that generates Frenchword fj

I e0 is occupied by a special Nullcomponent

18 / 29

Parameterisation

Alignment distribution: uniform

P (A|M = m,N = n) =1

m+ 1(9)

Lexical distribution: categorical

P (F |E = e) = Cat(F |θe) (10)

I where θe ∈ RvFI 0 ≤ θe,f ≤ 1

f θe,f = 1

19 / 29

IBM1: incomplete-data likelihood

Incomplete-data likelihood

P (fn1 |em0 ) =

m∑a1=0

· · ·m∑

P (fn1 , an1 |eaj ) (11)

m∑a1=0

· · ·m∑

n∏j=1

P (aj |m,n)P (fj |eaj ) (12)

=n∏j=1

m∑aj=0

P (aj |m,n)P (fj |eaj ) (13)

20 / 29

IBM1: posterior

Posterior

P (an1 |fn1 , em0 ) =P (fn1 , a

n1 |em0 )

P (fn1 |em0 )(14)

Factorised

21 / 29

MLE via EM

E-step:

E[n(e→ f|An1 )] =

m∑a1=0

· · ·m∑

P (an1 |fn1 , em0 )n(e→ f|An1 ) (16)

m∑a1=0

· · ·m∑

n∏j=1

P (aj |fn1 , em0 )1e(eaj)1f(fj) (17)

n∏j=1

m∑i=0

P (Aj = i|fn1 , em0 )1e(ei)1f(fj) (18)

M-step:

θe,f =E[n(e→ f |An1 )]∑f ′ E[n(e→ f ′|An1 )]

22 / 29

EM algorithm

Repeat until convergence to a local optimum

1. For each sentence pair

1.1 compute posterior per alignment link1.2 accumulate fractional counts

2. Normalise counts for each English word

23 / 29

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

24 / 29

Alignment distribution

Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)

I one distribution for each tuple (j,m, n)

I support must include length of longest English sentence

I extremely over-parameterised!

Jump distribution [Vogel et al., 1996]

I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋

I P (Aj |m,n) = Cat(∆|λ)

I ∆ takes values from −longest to +longest

25 / 29

Alignment distribution

Positional distributionP (Aj |M = m,N = n) = Cat(A|λj,m,n)

I one distribution for each tuple (j,m, n)

I support must include length of longest English sentence

I extremely over-parameterised!

Jump distribution [Vogel et al., 1996]

I define a jump function δ(aj , j,m, n) = aj −⌊jmn⌋

I P (Aj |m,n) = Cat(∆|λ)

I ∆ takes values from −longest to +longest

25 / 29

Content

Lexical alignment

Mixture models

IBM model 1

IBM model 2

Remarks

26 / 29

Note on terminology: source/target vs French/English

From an alignment model perspective all that matters is

I we condition on one language and generate the other

I in IBM models terminology, we condition on English andgenerate French

From a noisy channel perspective, where we want to translate asource sentence fn1 into some target sentence em1I Bayes rule decomposes p(em1 |fn1 ) ∝ p(fn1 |em1 )p(em1 )

I train p(em1 ) and p(fn1 |em1 ) independently

I language model: p(em1 )

I alignment model: p(fn1 |em1 )

I note that the alignment model conditions on the targetsentence (English) and generates the source sentence (French)

27 / 29

Limitations of IBM1-2

I too strong independence assumptions

I categorical parameterisation suffers from data sparsity

I EM suffers from local optima

28 / 29

Extensions

Fertility, distortion, and concepts [Brown et al., 1993]

Dirichlet priors and posterior inference [Mermer and Saraclar, 2011]

I + no Null words [Schulz et al., 2016]

I + HMM and efficient sampler [Schulz and Aziz, 2016]

Log-linear distortion parameters and variational Bayes[Dyer et al., 2013]

First-order dependency (HMM) [Vogel et al., 1996]

I E-step requires dynamic programming[Baum and Petrie, 1966]

29 / 29

References I

L. E. Baum and T. Petrie. Statistical inference for probabilisticfunctions of finite state Markov chains. Annals of MathematicalStatistics, 37:1554–1563, 1966.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,and Robert L. Mercer. The mathematics of statistical machinetranslation: parameter estimation. Computational Linguistics, 19(2):263–311, June 1993. ISSN 0891-2017. URLhttp://dl.acm.org/citation.cfm?id=972470.972474.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the em algorithm. Journal ofthe Royal Statistical Society, 39(1):1–38, 1977.

30 / 29

References II

Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast,and effective reparameterization of ibm model 2. In Proceedingsof the 2013 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 644–648, Atlanta, Georgia, June 2013.Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/N13-1073.

Adrien Lardilleux and Yves Lepage. Sampling-based multilingualalignment. In Proceedings of the International ConferenceRANLP-2009, pages 214–218, Borovets, Bulgaria, September2009. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/R09-1040.

31 / 29

References III

Coskun Mermer and Murat Saraclar. Bayesian word alignment forstatistical machine translation. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages 182–187,Portland, Oregon, USA, June 2011. Association forComputational Linguistics. URLhttp://www.aclweb.org/anthology/P11-2032.

Philip Schulz and Wilker Aziz. Fast collocation-based bayesianhmm word alignment. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguistics:Technical Papers, pages 3146–3155, Osaka, Japan, December2016. The COLING 2016 Organizing Committee. URLhttp://aclweb.org/anthology/C16-1296.

32 / 29

References IV

Philip Schulz, Wilker Aziz, and Khalil Sima’an. Word alignmentwithout null words. In Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 2:Short Papers), pages 169–174, Berlin, Germany, August 2016.Association for Computational Linguistics. URLhttp://anthology.aclweb.org/P16-2028.

Stephan Vogel, Hermann Ney, and Christoph Tillmann.HMM-based word alignment in statistical translation. InProceedings of the 16th Conference on ComputationalLinguistics - Volume 2, COLING ’96, pages 836–841,Stroudsburg, PA, USA, 1996. Association for ComputationalLinguistics. doi: 10.3115/993268.993313. URLhttp://dx.doi.org/10.3115/993268.993313.

33 / 29

lexical alignment: ibm models 1 and 2uva-slpl.github.io/nlp2/resources/slides/ibm.pdf · let’s...

Documents

ibm lotus connections -...

word2vec parameter learning...

livro didatico nlp2 parte i

jsmsx-ray.jsms.jp/memo/nlp2.pdf · tungsten allo mo allo a...

asset performance management: evolución de...

structural svms 및 pegasos 알고리즘을 이용한...

representing and querying geospatial information in the...

computing insight uk...

hybrid cloud - ibm · title: hybrid cloud - ibm.pdf author:...

ibm system x bladecenter -...

[ppt]deep learning and applications to nlp - university of...

making change work ibm.pdf

nonlinear programming: concepts, algorithms and...

driving innovation at the speed of...

twitter book...

make sure your ibm products, platforms and complex...

the evolution of nlp -...

international business machines (nyse:ibm) › uploadedfiles...

natural language processing -...

w.w.djecjl-vrtlcl-t~b - slavonski-brod.hr - statut...