인공지능 연구실 정 성 원 statistical alignment and machine translation

인공지능 연구실정 성 원

Statistical Alignment and Machine Translation

2

Contents• Machine Translation

• Text Alignment– Length-based methods– Offset alignment by signal processing

techniques– Lexical methods of sentence alignment

• Word Alignment

• Statistical Machine Translation

3

Different Strategies for MT (1)Different Strategies for MT (1)Interlingua

(knowledge representation)

English(semantic representation)

French(semantic representation)

English(syntactic parser)

French(syntactic parser)

English Text(word string)

French Text(word string)word-for-word

syntactic transfer

semantic transfer

(knowledge-basedtranslation)

4

Different Strategies for MT (2)Different Strategies for MT (2)• Machine Translation : important but hard problem• Why is ML Hard?

– word for word• Lexical ambiguity• Different word order

– syntactic transfer approach• Can solve problems of word order• Syntactic ambiguity

– semantic transfer approaches• can fix cases of syntactic mismatch• Unnatural, unintelligible• interlingua

5

MT & Statistical Methods

• In theory, each of the arrows in prior figure can be implemented based on a probabilistic model.– Most MT systems are a mix of prob. and non-prob.

components.

• Text alignment – Used to create lexical resources such as bilingual

dictionaries and parallel grammars, to improve the quality of MT

– More work on text alignment than on MT in statistical NLP.

6

Text Alignment

• Parallel texts or bitexts– Same content is available in several languages– Official documents of countries with multiple official languages -> li

teral, consistent

• Alignment– Paragraph to paragraph, sentence to sentence, word to word

• Usage of aligned text– Bilingual lexicography– Machine translation– Word sense disambiguation– Multilingual information retrieval– Assisting tool for translator

7

Aligning sentences and paragraphs(1)

• Problems– Not always one sentence to one sentence– Reordering– Large pieces of material can disappear

• Methods– Length based vs. lexical content based– Match corresponding point vs. form sentence bead

8


9


• BEAD : n:m grouping– S, T : text in two language

s– S = (s1, s2, … , si)– T = (t1, t2, … , tj) – 0:1, 1:0, 1:1, 2:1, 1:2, 2:2,

2:3, 3:2 …– Each sentence can occur in

only one bead– No crossing

s1

.

.

.

.

.

.

.si

t1

.

.

.

.

.

.

.tj

S Tb1

b2

b3

b4

b5

.

.bk

10

Dynamic Programming(1)

V01

V13

V12

V11

V23

V22

V21

V32

V31

V43

V42

V41

V51

3

8

6

9

12

8 6

3

4

9

6

8

5

8

7

8

5

6

5

3

2

2

4

3

4

6

3V0

V1 V2 V3 V4

V5

11

Dynamic Programming(2)• 가장 짧은 길 계산

3)( ;6)( ;4),()(

),()(

)}(),({min)(

)}(),({min)(

)}(),({min)(

)}(),({min)()(

43min42min514141min

5144min

4min43313min

3min32312min

2min21311min

1min1013101minmin

vdvdvvdvd

vvdvd

vdvvdvd

vdvvdvd

vdvvdvd

vdvvdvdPf

jj

jjiji

jjiji

jjiji

iii

이므로그런데

dmin(vij)i

1 2 3 4 5

j

1 22(v12) 20(v21) 11(v32) 5(v43) 4(v51)

2 14(v22) 12(v31) 6(v41) 6(v51)

3 18(v22) 10(v32) 3(v51)

12

Length-based methods

• Rationale– Short sentence -> short sentence– Long sentence -> long sentence– Ignore the richer information but quite effective

• Length– # of words or # of characters

• Pros– Efficient (for similar languages)– rapid

13

Gale and Church (1)• Find the alignment A ( S, T : parallel texts )

• Decompose the aligned texts into a sequence of aligned beads (B1,…Bk)

• The method – length of source and translation sentences measured in characters– similar language and literal translations– used for Union Bank of Switzerland(USB) Corpus

• English, French, German• aligned paragraph level

),,(maxarg),/(maxarg TSAPTSAPAA

K

kkBPTSAP

1

)(),,(

14

Gale and Church (2)

• D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj

),,,2:2(cos)2,2(

),,1:2(cos)1,2(

),,2:1(cos)2,1(

),1:1(cos)1,1(

),0:1(cos),1(

),1:0(cos)1,(

min),(

11

1

1

jjii

jii

jji

ji

i

j

ttssaligntjiD

tssaligntjiD

ttsaligntjiD

tsaligntjiD

saligntjiD

taligntjiD

jiD

15

Gale and Church (3)

S1

S2

S3

S4

L1 alignment 1

cost(align(s1, s2, t1)) t1

cost(align(s3, t2)) t2

cost(align(s3, t2)) t3

+

+

L1 alignment 2

t1 cost(align(s1, t1))


cost(align(s3, ))

+

+


L2

16

Gale and Church (4)• l1, l2 : the length in characters of the sentences of

each language in the bead• 두 언어 사이의 character 의 길이 비

– normal distribution ~ (, s2)

• average 4% error rate• 2% error rate for 1:1 alignments

2112 /)( slll

)|()(log

)),,,(|(log),(cos 22121

alignPalignP

sllalignPllt

tP cos

17

Other Researches• Brown et.al(1991c)

– 대상 : Canadian Hansard(English , French)

– 방법 : Comparing sentence lengths in words rather than characters

– 목적 : produce an aligned subset of the corpus

– 특징 : EM algorithm

• Wu(1994)

– 대상 : Hong Kong Hansard(English, Cantonese)

– 방법 : Gale and Church(1993) Method

– 결과 : not as clearly met when dealing with unrelated language

– 특징 : use lexical cues

18

Offset alignment by signal processing techniques

• Showing roughly what offset in one text aligns with what offset in the other.

• Church(1993)– 배경 : noisy text(OCR output)– 방법

• character sequence level 에서 cognate 정의 -> 순수한 cognate + proper name + numbers

• dot plot method(character 4-grams)

– 결과 : very small error rate– 단점

• different character set• no or extremely few identical character sequences

19

DOT-PLOT

a a c g g c t t a c g

g ● ● ●

g ● ● ●

c ● ● ●

t ● ●

t ● ●

t ● ●

c ● ● ●

g ● ● ●

g ● ● ●

a a c g g c t t a c g

g

g ●

c ● ●

t

t ●

t ●

c ●

g ●

g ●

Uni-gram bi—gram

20

Fung and Mckeown• 조건

– without having found sentence boundary– in only roughly parallel texts– with unrelated language

• 대상 : English and Cantonese• 방법 :

– arrival vector– small bilingual dictionary

• A word offset : (1,263,267,519) => arrival vector : (262,4,252).

• Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment

• Strong signal in a line along the diagonal in dot plot => good alignment

21

Lexical methods of sentence alignment(1)

• Align beads of sentences in robust ways using lexical information

• Kay and Röscheisen(1993)– 특징 : lexical cues, a process of convergence

– 알고리즘• Set initial anchors• until most sentences are aligned

– Form an envelope of possible alignments

– Choose pairs of words that tend to co-occur in these potential partial alignment

– Find pairs of source and target sentences which contain many possible lexical correspondences.

22


• 96% coverage after four passes on Scientific American articles

• 7 errors after 5 passes on 1000 Hansard sentences

• 단점 – computationally intensive

– pillow shaped envelope => text moved, deleted

23


• Chen(1993)– Similar to the model of Gale and Church(1993)

– Simple translation model is used to estimate the cost of a alignment.

– 대상• Canadian Hansard, European Economic Community proc

eedings.(millions of sent.)

– Estimated error rate : 0.4 %• most of errors are due to sentence boundary detection me

thod => no further improvement

24


• Haruno and Yamazaki(1996)– Align structurally different languages.– A variant of Kay and Roscheisen(1993)– Do lexical matching on content words only

• POS tagger– To align short texts, use an online dictionary– Knowledge-rich approach– The combined methods

• good results on even short texts between very different languages

25

Word Alignment

• 용도– terminology databases, bilingual dictionaries

• 방법– text alignment -> word alignment– χ 2 measure– EM algorithm

• Use of existing bilingual dictionaries

26

Statistical Machine Translation(1)

• Noisy channel model in MT– Language model– Translation model– Decoder

Language ModelP(e)

Translation ModelP(f/e)

Decoderê = arg maxe P(e/f)

e f ê

27

• Translation model– compute p(f/e) by summing the probabilities of all alignments

e: English sentence l : the length of e in words f : French sentence m : the length of f fj : word j in f

aj : the position in e that fj is aligned with

eaj : the word in e that fj is aligned with

p(wf/we) : translation prob.

Z : normalization constant

l

a

m

jaj

l

a m

jefP

ZefP

0 10

)/(...1

)/(1

.

.fj

.

.

.

.

.

.eaj

...

f e


28

• Translation probability : p(wf/we)– Assume that we have a corpus of aligned sentences.

– EM algorithm

)/()(maxarg)(

)/()(maxarg)/(maxarg

^

efPePfP

efPePfepe

eee

• Decoder

search space is infinite => stack search

vvw

ww

ef

fwewtsfeefww

ef

f

ef

fe

ef

z

zwwPstepM

wwPzstepE

wwPoftioninitializaRandom

,

,

,..),(,

)/(:

)/(:

)/(


29


• Problems– distortion

– fertility : The number of French words one English word generate.

• Experiment– 48% of French sentences were decoded correctly

– incorrect decodings

– ungrammatical decodings

30

Statistical Machine Translation(5)• Detailed Problems

– model problems• Fertility is asymmetric• Independence assumption• Sensitivity to training data• Efficiency

– lack of linguistic knowledge• No notion of phrase• Non-local dependencies• Morphology• Sparse data problems

인공지능 연구실 정 성 원 statistical alignment and machine translation

Documents