3gtm: a third-generation translation memoryfelipe/papers/slides-cline-3gtm-2005.pdf ·...

42
3GTM: A Third-Generation Translation Memory Fabrizio Gotti, Philippe Langlais, Elliott Macklovitch, Didier Bourigault, Benoit Robichaudand Claude CoulombeRALI epartement d’informatique et de recherche op ´ erationnelle Universit´ e de Montr ´ eal Lingua Technologies Inc. Montr ´ eal ERSS-CNRS Toulouse CLiNE — August 26 th 2005 RALI, Lingua Technologies Inc., ERSS-CNRS ( epartement d’informatique et de recherche op ´ ea 3GTM CLiNE — August 26 th 2005 1 / 35

Upload: others

Post on 18-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

3GTM: A Third-Generation Translation Memory

Fabrizio Gotti†, Philippe Langlais†, Elliott Macklovitch†,Didier Bourigault?, Benoit Robichaud‡ and Claude Coulombe‡

†RALIDepartement d’informatique et de recherche operationnelle

Universite de Montreal

‡ Lingua Technologies Inc.Montreal

? ERSS-CNRSToulouse

CLiNE — August 26th 2005

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 1 / 35

Page 2: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Outline

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 2 / 35

Page 3: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Overview of the 3GTM project

Translation MemoryA Computer Assisted Tool which eases the recycling of pasttranslations

1st -generation TM never translates again a sentence that has alreadybeen translatedFull-sentence repetition is a rather marginal phenomemon

2nd -generation TM 2 source sentences might be considered identicalif they differ only slightly (named entities, edit distance,etc.)Fuzzy matching

3rd -generation TM (3GTM) recycles at a sub-sentential level

A project currently funded by PRECARN

Lingua Technologies Inc., RALI, Transetix Inc.

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 3 / 35

Page 4: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

3GTM in a Screenshot

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 4 / 35

Page 5: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 5 / 35

Page 6: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Experimental Setting

English-French language pair : querying the French side,proposing English material

TM populated with Canadian Hansard texts

Coverage statistics computed over a test corpus

help appreciating the number of useful units that can bequeried/foundthe easiest thing to implement in an early stage of a projectultimately, we target human evaluation runs (or simulations)

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 6 / 35

Page 7: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Training MaterialNumber of sentences, tokens and types in the training corpus

Language English FrenchNb. sentences 1 753 443 1 753 443Nb. tokens 31 637 775 34 150 039Nb. types 85 810 106 987Avg. word/sent. 17.5 19.3

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 7 / 35

Page 8: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Test Material

1000 sentences (Hansard corpus)

chronologically distinct from the training material

French = query or source language

English = output or target language

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 8 / 35

Page 9: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tools used

JAPA an in-house sentence alignerhttp://rali.iro.umontreal.ca/Japa/

LUCENE a freely available full-featured text search enginehttp://lucene.apache.org

SIMAC an in-house implementation of a word aligner(Simard and Langlais, 2003)

GIZA++ a tool to train translation models(Och and Ney, 2000)

GRAMMATICUM a constituant-based parser(Coulombe, 1991)

SYNTEX a dependency-based parser(Bourigault and Fabre, 2000)

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 9 / 35

Page 10: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 10 / 35

Page 11: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Full Sentence CoverageUsing Verbatim Match

Nb. of sentences 1000Nb. of sent. found verbatim 148Avg. size of sent. in test corpus 19.2Avg. size of sent. found verbatim 11.1

14.8 % because of Hansard idioms :

I don’t knowMr. Speaker : Order, please .

within a TM ≡ TSRALI.com (6.6 M. pairs of phrases), we onlyfound 11 out of 1000 sentences of the EuroParl corpus.

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 11 / 35

Page 12: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 12 / 35

Page 13: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageProtocol

1 Query the TM with any sequence of the source (French) material(length ≥ 2)A query found at least once is a valid one

2 Compute a source (French) optimal coverageMaximizing the source coverage while minimizing the number ofqueries

3 Consider the target (English) material associatedBy following the word alignment

4 Compute a target (English) optimal coverageWait for details

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 13 / 35

Page 14: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring Coverage

S Il travaille dans la chocolaterieT He works in a chocolate factoryq la chocolaterie

Match :

S Charlie 1 et 2 [la 3 chocolaterie 4,5]T Charlie and [the chocolate factory]m

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 14 / 35

Page 15: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring Coverage

S Il travaille dans la chocolaterieT He works in a chocolate factoryq la chocolaterie

Match :

S Charlie 1 et 2 [la 3 chocolaterie 4,5]T Charlie and [the chocolate factory]m

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 14 / 35

Page 16: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring Coverage

S Il travaille dans la chocolaterieT He works in a chocolate factoryq la chocolaterie

Match :

S Charlie 1 et 2 [la 3 chocolaterie 4,5]T Charlie and [the chocolate factory]m the chocolate factory

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 14 / 35

Page 17: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring Coverage

S Il travaille dans la chocolaterieT He works in a chocolate factoryq la chocolaterie

Match :

S Charlie 1 et 2 [la 3 chocolaterie 4,5]T Charlie and [the chocolate factory]m the chocolate factory

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 15 / 35

Page 18: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageCoverage statistics

Metric Source TargetOptimal coverage 98.8% 55.8%Cov. unit size (words) 4.09 2.98Number of cov. units 4.65 3.23

Avg. nb. LUCENE queries per sentence : 226

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 16 / 35

Page 19: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageThe unsustainable Truth

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

↗ m.

mci

nnes

: je m’

excu

se

excuse – – – – – –m’ – – – – – 446je – – – – 3719 347: – – – 12330 185 107

mcinnes – – 43 4 0 0m. – 69 43 4 0 0

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 17 / 35

Page 20: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageThe unsustainable Truth

S : m. mcinnes – : je m’ excuseT : mr . mcinnes : i apologize

↗ m.

mci

nnes

: je m’

excu

se

excuse – – – – – –m’ – – – – – 446je – – – – 3719 347: – – – 12330 185 107

mcinnes – – 43 4 0 0m. – 69 43 4 0 0

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 17 / 35

Page 21: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageThe unsustainable Truth

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

: je m’ excuse (107)

46 : i am sorry ,16 : i am sorry14 : i am sorry to14 : i apologize ,

8 : i apologize for interrupting8 : i apologize to8 : i do apologize6 : excuse me ,6 : : i apologize6 : i beg your pardon

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 19 / 35

Page 22: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageThe unsustainable Truth

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

: je m’ excuse (107)

46 : i am sorry ,16 : i am sorry14 : i am sorry to14 : i apologize ,

8 : i apologize for interrupting8 : i apologize to8 : i do apologize6 : excuse me ,6 : : i apologize6 : i beg your pardon

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 19 / 35

Page 23: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Random Substring CoverageThe unsustainable Truth

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

: je m’ excuse (107)

46 : i am sorry ,16 : i am sorry14 : i am sorry to14 : i apologize ,

8 : i apologize for interrupting8 : i apologize to8 : i do apologize6 : excuse me ,6 : : i apologize6 : i beg your pardon

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 19 / 35

Page 24: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 20 / 35

Page 25: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based CoverageQuerying the Memory with Chunks : Pros

Speeding upBy limiting the number of queries

Improving the target material extractionBy taking into account chunk boundaries computed on both sides

Avoiding overwhelming the user with too much dataLess queries, reduced target material overlaps

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 21 / 35

Page 26: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based CoverageProtocol

1 The test material was first chunked using a tool from LinguaTechnologies Inc. (Coulombe,1991)

2 28.35 chunks per source (French) sentence on average

3 11.7 chunks if we only consider those of size ≥ 2

4 We used those selected chunks to query the TM

5 Everything else was kept identical to the previous experiment

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 22 / 35

Page 27: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based CoverageCoverage Statistics

Metric Source TargetOptimal coverage 98.8% 55.8%Cov. unit size (words) 4.09 2.98Number of cov. units 4.65 3.23

Avg. nb. LUCENE queries per sentence : 226

Metric Source TargetOptimal coverage 59.9% 59.3%Cov. unit size (words) 3.73 2.99Number of cov. units 3.08 3.47Avg. nb. L UCENE queries per sentence : 11.7

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 23 / 35

Page 28: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based Coverage

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

↗ m.

mci

nnes

: je m’

excu

se

excuse – – – – – –m’ – – – – – 446je – – – – 3719 347: – – – 12330 185 107

mcinnes – – 43 4 0 0m. – 69 43 4 0 0

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 24 / 35

Page 29: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based Coverage

S : m. mcinnes : – je m’ excuseT : mr . mcinnes : i apologize

↗ m.

mci

nnes

: je m’

excu

se

excuse – – – – – –m’ – – – – – 446je – – – – 3719 347: – – – 12330 185 107

mcinnes – – 43 4 0 0m. – 69 43 4 0 0

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 24 / 35

Page 30: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based Coverage

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

je m’ excuse (347)

40 i am sorry ,33 i apologize to24 i apologize21 i am sorry15 i apologize for13 i am sorry to13 i apologize ,

6 i apologize if6 i do apologize to5 i apologize for interrupting

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 26 / 35

Page 31: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based Coverage

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

je m’ excuse (347)

40 i am sorry ,33 i apologize to24 i apologize21 i am sorry15 i apologize for13 i am sorry to13 i apologize ,

6 i apologize if6 i do apologize to5 i apologize for interrupting

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 26 / 35

Page 32: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Chunk-Based Coverage

S : m. mcinnes : je m’ excuseT : mr . mcinnes : i apologize

m. mcinnes : (69)

42 mr . mcinnes :17 mr . mcinnes )

2 mr . mcinnes ) ,1 mr . mcinnes (1 mr . mcinnes : it not is required reading , mr . speaker1 mr . mcinnes moved1 mr . mcinnes moves

je m’ excuse (347)

40 i am sorry ,33 i apologize to24 i apologize21 i am sorry15 i apologize for13 i am sorry to13 i apologize ,

6 i apologize if6 i do apologize to5 i apologize for interrupting

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 26 / 35

Page 33: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 27 / 35

Page 34: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase Coverage

Motivation : The translations of a good friend could be useful totranslate a very good friend which do not appear in the TM.

From now on,TM = collection of Tree-Phrases (TPs)

whereTP = a combination of a treelet (TL) and an elastic-phrase (EP)

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 28 / 35

Page 35: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase CoverageSYNTEX (Bourigault et Fabre, 2000)

Dependency parser available for French and English.

On demande des cr edits f ederaux

demandeSUB

llllllllll OBJYYYYYYYYYYYYYYYYYY

on cr editsDET

llllllllll ADJRRRRRRRRRR

des f ederaux

A link identifies two words : a governor and its dependent (e.g.demande governs cr edits )

We do not consider link labels in this study

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 29 / 35

Page 36: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase CoverageFacts

We parsed the source (French) part of the training material withSYNTEX

We extracted all TLs of depth 1

We collected more than 3 million types of TLs from which weprojected 6.5 million kinds of EPs

The TLs range in size from 2 to 8 words, and EPs from 1 to 9

Roughly half the TLs and 40% of the EPs are contiguous ones

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 30 / 35

Page 37: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase CoverageOn demande des cr edits f ederaux / We request for federal funding

alignment :demande ≡ request for // federaux ≡ federal // credits ≡ funding

treelets :

demande

qqqqqqqMMMMMMM

on cr edits

cr edits

qqqqqqqMMMMMMM

des f ederaux

tree-phrases :TL? {{on@-1} demande {cr edits@2 }}EP? |request@0||for@1||funding@3|

TL {{des@-1} cr edits {f ederaux@1}}EP |federal@0||funding@1|

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 31 / 35

Page 38: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase CoverageCoverage Statistics

Metric Source TargetOptimal coverage 59.9% 59.3%Cov. unit size (words) 3.73 2.99Number of cov. units 3.08 3.47

Avg. nb. LUCENE queries per sentence : 11.7

Metric Source TargetOptimal coverage 62.7% 56.4%Cov. by contiguous TPs 46.0% 38.6%

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 32 / 35

Page 39: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase Coverage

S presentation de le 1er rapport de le comite permanentT presentation of first report of standing committee

rapport

||||

|BB

BBB

de le 1er

of first report

rapport

���� ++++

de le

report, report of, of report

rapport

||||

|BB

BBB

le de

report, of reportreport of

rapport

||||

|BB

BBB

PPPPPPPP

de le 1er de

report, of first report,first report of

comit e

||||

|BB

BBB

de le

committee, of committee

comit e

uuuuuu

�����

6666

6

de le permanent

of committee, standing committee,of standing committee

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 33 / 35

Page 40: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Tree-Phrase Coverage

S presentation de le 1er rapport de le comite permanentT presentation of first report of standing committee

rapport

||||

|BB

BBB

de le 1er

of first report

rapport

���� ++++

de le

report, report of, of report

rapport

||||

|BB

BBB

le de

report, of reportreport of

rapport

||||

|BB

BBB

PPPPPPPP

de le 1er de

report, of first report,first report of

comit e

||||

|BB

BBB

de le

committee, of committee

comit e

uuuuuu

�����

6666

6

de le permanent

of committee, standing committee,of standing committee

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 33 / 35

Page 41: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

1 Overview of the 3GTM project

2 Experimental Setting

3 ExperimentsSentence CoverageRandom Substring CoverageChunk-Based CoverageTree-Phrase Coverage

4 Discussion

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 34 / 35

Page 42: 3GTM: A Third-Generation Translation Memoryfelipe/Papers/slides-cline-3gtm-2005.pdf · 1st-generation TM never translates again a sentence that has already been translated Full-sentence

Discussion

We have considered distinct approaches to query a TM

Full-sentence queries yield a poor coverage

Random substring querying does well at coverage, but does notseem viable without stringent pruning strategies (Simard, 2003)

Chunk-based querying seems attractive

TP querying seems a viable alternative, and non-contiguous unitsmight be useful to the end user

Coverage simulations are only approximations (Langlais andSimard, 2003)

RALI, Lingua Technologies Inc., ERSS-CNRS ( †RALI Departement d’informatique et de recherche operationnelle Universite de Montreal ‡ Lingua Technologies Inc. Montreal ? ERSS-CNRS Toulouse )3GTM CLiNE — August 26th 2005 35 / 35