word and phrase alignment

49
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna

Upload: bud

Post on 07-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Word and Phrase Alignment. Presenters: Marta Tatu Mithun Balakrishna. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Frank Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou CL-1996. Overview – Champollion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Word and Phrase Alignment

Word and Phrase Alignment

Presenters:Marta Tatu

Mithun Balakrishna

Page 2: Word and Phrase Alignment

Translating Collocations for

Bilingual Lexicons: A Statistical

Approach

Frank Smadja, Kathleen R. McKeown and Vasileios

HatzivassiloglouCL-1996

Page 3: Word and Phrase Alignment

3

Overview – Champollion

Translates collocations from English into French using an aligned corpus (Hansards)

The translation is constructed incrementally, adding one word at a time

Correlation method: the Dice coefficient Accuracy between 65% and 78%

Page 4: Word and Phrase Alignment

4

The Similarity Measure Dice coefficient (Dice, 1945)

where p(X,Y), p(X), and p(Y) are the joint and marginal probability of X and Y

If the probabilities are estimated using maximum likelihood, then

where fX, fY, and fXY are the absolute frequencies of appearance of “1”s for X and Y

Page 5: Word and Phrase Alignment

5

Algorithm - Preprocessing

Source and target language sentences must be aligned (Gale and Church 1991)

List of collocations to be translated must be provided (Xtract, Smadja 1993)

Page 6: Word and Phrase Alignment

6

Algorithm 1/3

1. Champollion identifies a set S of k words highly correlated with the source collocation

The target collocation is in the powerset of S

These words have a Dice-measure Td ( = 0.10) and appear Tf ( = 5 ) times

2. Form all pairs of words from S3. Evaluate the correlation between

each pair and the source collocation (Dice)

Page 7: Word and Phrase Alignment

7

Algorithm 2/3

4. Keep pairs that score above the threshold Td

5. Construct 3–word elements containing one of the highly correlated pairs plus a member of S

6. …7. Until for some n ≤ k, no n–word scores

above the threshold

Page 8: Word and Phrase Alignment

8

Algorithm 3/3

8. Champollion selects the best translation among the top candidates

9. In case of ties, the longer collocation is preferred

10. Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations

Are the words used consistently in the same order and at the same distance?

Page 9: Word and Phrase Alignment

9

Experimental Setup

DB1 = 3.5*106 words (8 months of 1986)

DB2 = 8.5*106 words (1986 and 1987) C1 = 300 collocations from DB1 of mid-

range frequency C2 = 300 collocations from 1987 C3 = 300 collocations from 1988 Three fluent bilingual speakers

Canadian French vs. continental French

Page 10: Word and Phrase Alignment

10

Results

Page 11: Word and Phrase Alignment

11

Future Work

Translating the closed class words Tools for the target language Separating corpus-dependent

translations from general ones Handling low frequency collocations Analysis of the effects of thresholds Incorporating the length of the

translation into the score Using nonparallel corpora

Page 12: Word and Phrase Alignment

12

Comments

Page 13: Word and Phrase Alignment

A Pattern Matching Method for Finding

Noun and Proper Noun Translations from Noisy

Parallel Corpora

Pascal FungACL-1995

Page 14: Word and Phrase Alignment

14

Goal of the Paper

Create bilingual lexicon of nouns and proper nouns

From unaligned, noisy parallel texts of Asian/Indo-European language pairs

Pattern matching method

Page 15: Word and Phrase Alignment

15

Introduction

Previous research on sentence-aligned, parallel texts

Alignment not always practical Unclear sentence boundaries in corpora Noisy text segments present in only one

language Two main steps

Find small bilingual primary lexicon Compute a better secondary lexicon from

these partially aligned texts

Page 16: Word and Phrase Alignment

16

Algorithm

1. Tag the English half of the parallel text Nouns and proper nouns (they have

consistent translations over the entire text)

Tagged English part with a modified POS tagger

Find translations for nouns, plural nouns and proper nouns only

Page 17: Word and Phrase Alignment

17

Algorithm

2. Positional Difference Vectors Correspondence between a word and its

translated counterpart In their frequency In their positions

Correspondence need not be linear Calculation

p – position vector of a word V – positional difference vector V[i-1] = p[i] – p[i-1]

Page 18: Word and Phrase Alignment

18

Algorithm

Page 19: Word and Phrase Alignment

19

Algorithm

3. Match pairs of positional difference vectors, giving scores

Dynamic Time Warping (Fung & McKeown, 1994)

For non-identical vectors Trace correspondence between all points in V1

and V2 No penalty for deletions and insertions

Statistical filters

Page 20: Word and Phrase Alignment

20

Dynamic Time Warping Given V1 and V2,

which point in V1 corresponds to which point in V2?

Page 21: Word and Phrase Alignment

21

Algorithm

Page 22: Word and Phrase Alignment

22

Algorithm5. Finding anchor points and eliminating

noise Every word pair selected to run DTW

Obtain DTW score Obtain DTW path

Plot DTW paths of all such word pairs Keep highly reliable points and discard rest Point (i,j) is noise if

Page 23: Word and Phrase Alignment

23

Algorithm

Page 24: Word and Phrase Alignment

24

Algorithm

6. Finding low frequency bilingual word pairs

Non-linear segment binary vectors V1[i] = 1 if word occurs in ith segment

Binary vector correlation measure

Page 25: Word and Phrase Alignment

25

Results

Page 26: Word and Phrase Alignment

26

Comments

Page 27: Word and Phrase Alignment

Automated Dictionary Extraction for

“Knowledge-Free” Example-Based

Translation

Ralf D. BrownTMIMT-1997

Page 28: Word and Phrase Alignment

28

Goal of the Paper

Extract a bilingual dictionary Using a aligned bilingual corpus Perform tests to compare the

performance of PanEBMT using Collins Spanish-English dictionary +

WordNet English root/synonym list Various automatically extracted bilingual

dictionaries

Page 29: Word and Phrase Alignment

29

Introduction

Page 30: Word and Phrase Alignment

30

Extracting Bilingual Dictionary

Extracted from corpus using Correspondence table Threshold Schema

Correspondence Table Two dimensional array Indexed by source language words Indexed by target language words

Cross-product word entries of each sentence pair are incremented

Page 31: Word and Phrase Alignment

31

Extracting Bilingual Dictionary

Similar word orders language pairs biased

Threshold setting A step function

Unreachably high for co-occurrence < MIN Constant otherwise

A sliding scale Start at 1.0 for co-occurrence = 1 Slide smoothly to MIN threshold value

Page 32: Word and Phrase Alignment

32

Extracting Bilingual Dictionary

Filtering Symmetric threshold

Asymmetric threshold

Any elements of Correspondence table which fail both tests set to zero

Non-zero elements added to dictionary

Page 33: Word and Phrase Alignment

33

Extracting Bilingual Dictionary - Results

Page 34: Word and Phrase Alignment

34

Extracting Bilingual Dictionary - Errors

High-frequency Error-ridden terms Short list high frequency words (all words

which appear in at least 20% of source sentences)

Short list sentence pairs containing extactly one or two high frequency words

Results in 7 of 16 words – Zero error Merge with results from first pass

Page 35: Word and Phrase Alignment

35

Experimental Setup

Manually created tokenization – 47 equivalence classes, 880 words and translations of each word

Two test texts 275 UN corpus sentences : in-domain 253 Newswire sentences : out-of-domain

Page 36: Word and Phrase Alignment

36

Results

Page 37: Word and Phrase Alignment

37

Comments

Page 38: Word and Phrase Alignment

Extracting Paraphrases from a Parallel Corpus

Regina Barzilay and Kathleen R. McKeown

ACL-2001

Page 39: Word and Phrase Alignment

39

Overview

Corpus-based unsupervised learning algorithm for paraphrase extraction Lexical paraphrases (single and multi-word)

(refuse, say no) Morpho-syntactic paraphrases

(king’s son, son of the king) (start to talk, start talking)

Phrases which appear in similar contexts are paraphrases

Page 40: Word and Phrase Alignment

40

Data

Multiple English translations of literary texts written by foreign authors Madam Bovary, Fairy Tales, Twenty

Thousand Leagues Under the Sea, etc. 11 translations

Page 41: Word and Phrase Alignment

41

Preprocessing

Sentence alignment Translations of the same source contain a

number of identical words 42% of the words in corresponding

sentences are identical (average) Dynamic programming (Gale & Church,

1991) 94.5% correct alignments (127 sentences)

POS tagger and chunker NP and VP

Page 42: Word and Phrase Alignment

42

Algorithm – Bootstrapping

Co-training method: DLCoTrain (Collins & Singer, 1999)

Similar contexts surround two phrases paraphrase

Having good paraphrase predictor contexts new paraphrases

1. Analyze contexts surrounding identical words in aligned sentence pairs

2. Use these contexts to learn new paraphrases

Page 43: Word and Phrase Alignment

43

Feature Extraction

Paraphrase features Lexical: tokens for each phrase in the

paraphrase pair Syntactic: POS tags

Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams)tried to comfort her left1=“VB1 TO2”, right1=“PRP$3”

tried to console her left2=“VB1 TO2”, right2=“PRP$3”

Page 44: Word and Phrase Alignment

44

Algorithm

Initialization Identical words are the seeds (positive

paraphrasing examples) Negatives are created by pairing each word

with all the other words in the sentence Training of the context classifier

Record contexts around positive and negative paraphrases of length ≤ 3

Identify the strong predictors based on their strength and frequency

Page 45: Word and Phrase Alignment

45

Algorithm

Keep the most frequent k = 10 contexts with a strength > 95%

Training of the paraphrasing classifier Using the context rules extracted

previously, derive new pairs of paraphrases When no more paraphrases are

discovered, stop

Page 46: Word and Phrase Alignment

46

Results

9483 paraphrases, 25 morpho-syntactic rules Out of 500: 86.5% (without context), 91.6%

(with context) correct paraphrases 69% recall evaluated on 50 sentences

Page 47: Word and Phrase Alignment

47

Future Work

Extract paraphrases from comparable corpora (news reports about the same event)

Improve the context representation

Page 48: Word and Phrase Alignment

48

Comments

Page 49: Word and Phrase Alignment

49

Thank You !