generation-heavy hybrid machine translation nizar habash postdoctoral researcher center for...

57
Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia University NLP Colloquium October 28, 2004

Post on 21-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Generation-Heavy Hybrid Machine Translation

Nizar Habash Postdoctoral Researcher

Center for Computational Learning SystemsColumbia University

Columbia University

NLP Colloquium

October 28, 2004

Page 2: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

The IntuitionGeneration-Heavy Machine Translation

Español ‚ عربي ‚

English

Dictionary

gist

gist

E

Page 3: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionResearch Contributions

• A general reusable and extensible Machine Translation (MT) model that transcends the need for large amounts of deep symmetric knowledge

• Development of reusable large-scale resources for English

• A large-scale Spanish-English MT system: Matador; Matador is more robust across genre and produce more grammatical output than simple statistical or symbolic techniques

Page 4: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Roadmap

Introduction Generation-Heavy Machine Translation Evaluation Conclusion Future Work

Page 5: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Interlingua

Gisting

Transfer

Page 6: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionMT Pyramid

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Analysis Generation

Interlingual Lexicons

Dictionaries/Parallel Corpora

Transfer Lexicons

Page 7: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionMT Pyramid

Gisting

Transfer

Source word

Source syntax

Source meaning Target meaning

Target syntax

Target word

Page 8: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionWhy gisting is not enough

Sobre la base de dichas experiencias se estableció en 1988 una metodología.

Envelope her basis out speak experiences them settle at 1988 one methodology.

On the basis of these experiences, a methodology was arrived at in 1988.

Page 9: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

IntroductionTranslation Divergences

• 35% of sentences in TREC El Norte Corpus (Dorr et al 2002)

• Divergence Types– Categorial (X tener hambre X be hungry)

– Conflational (X dar puñaladas a Z X stab Z)

– Structural (X entrar en Y X enter Y)

– Head Swapping (X cruzar Y nadando X swim across Y)

– Thematic (X gustar a Y Y like X)

Page 10: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Roadmap

Introduction Generation-Heavy Machine Translation Evaluation Conclusion Future Work

Page 11: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Generation-Heavy Hybrid Machine Translation

• Problem: asymmetric resources– High quality, broad coverage, semantic resources

for target language– Low quality resources for source language– Low quality (many-to-many) translation lexicon

• Thesis: we can approximate interlingual MT without the use of symmetric interlingual resources

Page 12: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Relevant Background Work

• Hybrid Natural Language GenerationConstrained Overgeneration Statistical Ranking

Nitrogen (Langkilde and Knight 1998), Halogen (Langkilde 2002)

FERGUS (Rambow and Bangalore 2000)

• Lexical Conceptual Structure (LCS) based MT

(Jackendoff 1983), (Dorr 1993)

Page 13: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

LCS-based MTExample

(Dorr, 1993)

Page 14: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Generation-Heavy HybridMachine Translation

Analy

sis

Tra

nsla

tion

Th

eta

Lin

king

Expansio

n

Assig

nm

ent

Pru

nin

g

Lineariza

tion

Rankin

g

Generation

Page 15: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

MatadorSpanish-English GHMT

Analy

sis

Tra

nsla

tion

Spanish

English

Th

eta

Lin

king

Expansio

n

Assig

nm

ent

Pru

nin

g

Lineariza

tion

Rankin

g

Generation

Expansive Rich Generation for English

EXERGE

Page 16: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTAnalysis

• Source language syntactic dependency

• Example: Yo le di puñaladas a Juan.

• Features of representation– Approximation of

predicate-argument structure

– Long-distance dependencies

dar

Yo puñalada a

Juan

:obj

:obj:mod:subj

Page 17: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTTranslation

• Lexical transfer but NO structural change

• Translation Lexicon (tener V) ((have V) (own V) (possess V) (be V))(deber V) ((owe V) (should AUX) (must AUX))(soler V) ((tend V) (usually AV))

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE,

GRANT, HAND, LAND, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

JOHN

:obj

:obj :mod:subj

dar

Yo puñalada a

Juan

:obj

:obj:mod:subj

Page 18: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTThematic Linking

• Syntactic Dependency Thematic Dependency

•Which divergence

Goal

EXTEND, GIVE, GRANT, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

JOHN

ThemeAgent

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND,

LAND, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

JOHN

:obj

:obj :mod:subj

Page 19: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTThematic Linking

Resources

• Word Class Lexicon :NUMBER "V.13.1.a.ii" :NAME "Give - No Exchange” :POS V

:THETA_ROLES (((ag obl) (th obl) (goal obl to)) ((ag obl) (goal obl) (th obl))) :LCS_PRIMS (cause go)

:WORDS (feed give pass pay peddle refund render repay serve))

• Syntactic-Thematic Linking Map (:subj ag instr th exp loc src goal perc mod-poss poss)

(:obj2 goal src th perc ben)

(across goal loc)

(in loc mod-poss perc goal poss prop)

(to prop goal ben info th exp perc pred loc time)

Page 20: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTThematic Linking

• Syntactic Dependency Thematic Dependency

((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (CONFER V.37.6.b ((EXP OBL))) (DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T))) (EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (GRANT V.29.5.e ((AG OBL) (INFO OBL THAT))) (GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO))) (GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM))) (HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (LAND V.9.10 ((AG OBL) (TH OBL))) (RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF))) (RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND,

LAND, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

JOHN

:obj

:obj :mod:subj

Page 21: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTThematic Linking

• Syntactic Dependency Thematic Dependency

((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (CONFER V.37.6.b ((EXP OBL))) (DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T))) (EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (GRANT V.29.5.e ((AG OBL) (INFO OBL THAT))) (GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO))) (GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO))) (GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL))) (HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM))) (HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM))) (LAND V.9.10 ((AG OBL) (TH OBL))) (RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO))) (RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL))) (RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF))) (RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND,

LAND, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

JOHN

:obj

:obj :mod:subj

Page 22: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTThematic Linking

• Syntactic Dependency Thematic Dependency

Goal

EXTEND, GIVE, GRANT, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

JOHN

ThemeAgent

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND,

LAND, RENDER

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

JOHN

:obj

:obj :mod:subj

Page 23: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Interlingua Approximationthrough Expansion Operations

obj

enter

John room

subj in

enter

John room

subj

in

go

John room

subj

developmentNdevelopV

Categorial Variation

putV

butterN

butterV

Node Conflation / Inflation

RelationConflation / Inflation

Relation Variation

Page 24: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Interlingua Approximation2nd Degree Expansion

obj

cross

John river

subj mod

swimming

across

go

John river

subj mod

swimming

Relation Inflation

across

swim

John river

subj

Node Conflation

Page 25: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTStructural Expansion

• Conflation Example

,

Goal

STABV

I JOHN

Agent Goal

GIVEV

I STABN JOHN

ThemeAgent

Page 26: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTStructural Expansion

• Conflation and Inflation• Structural Expansion Resources

– Word Class Lexicon :NUMBER "V.42.2" :NAME “Poison Verbs” :POS V

:THETA_ROLES (((ag obl)(goal obl)))

:LCS_PRIMS (cause go)

:WORDS (crucify electrocute garrotte hang knife poison shoot smother stab strangle)

– Categorial Variation Database (Habash and Dorr 2003)

(:V (hunger) :N (hunger hungriness) :AJ (hungry))

(:V (validate) :N (validation validity) :AJ (valid))

(:V (cross) :N (crossing cross) :P (across))

(:V (stab) :N (stab))

Page 27: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTStructural Expansion

• Conflation Example

Goal

GIVEV

I STABN JOHN

ThemeAgent

STABV

Page 28: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTStructural Expansion

• Conflation Example

Goal

STABV

* *

Agent

[CAUSE GO][CAUSE GO]

Goal

GIVEV

I STABN JOHN

ThemeAgent

Page 29: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTStructural Expansion

• Conflation Example

,

Goal

STABV

I JOHN

Agent Goal

GIVEV

I STABN JOHN

ThemeAgent

Page 30: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Goal

STABV

I JOHN

Agent Goal

GIVEV

I STABN JOHN

ThemeAgent

GHMT Syntactic Assignment

• Thematic Syntactic Mapping

Object

STABV

I, MY …

JOHN

Subject

IObject

GIVEV

I, MY …

STABN, KNIFE_ WOUNDN

JOHN

ObjectSubject

Object

Mod

GIVEV

I, MY …

STABN, KNIFE_ WOUNDN

TO, AT, …

ObjectSubject

JOHN

Page 31: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMT Structural N-gram Pruning

• Statistical lexical selection

Object

STABV

I, MY …

JOHN

Subject

IObject

GIVEV

I, MY …

STABN, KNIFE_ WOUNDN

JOHN

ObjectSubject

Object

Mod

GIVEV

I, MY …

STABN, KNIFE_ WOUNDN

TO, AT, …

ObjectSubject

JOHN

Object

STABV

I JOHN

Subject

IObject

GIVEV

I STABN JOHN

ObjectSubject

Object

Mod

GIVEV

I STABN v TO

ObjectSubject

JOHN

Page 32: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

• Structural N-gram Model– Long-distance – Lexemes

• Surface N-gram Model– Local – Surface-forms

GHMTTarget Statistical Resources

cloudevery liningahas silver

a

lining

silver

have

cloud

every

Page 33: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

GHMTLinearization &Ranking

• Oxygen Linearization (Habash 2000)

• Halogen Statistical Ranking (Langkilde 2002)---------------------------------------------------------

I stabbed John . [-1.670270 ]I gave a stab at John . [-2.175831]I gave the stab at John . [-3.969686]I gave an stab at John . [-4.489933]I gave a stab by John . [-4.803054]I gave a stab to John . [-5.045810]I gave a stab into John . [-5.810673]I gave a stab through John . [-5.836419]I gave a knife wound by John . [-6.041891]

Page 34: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Roadmap

Introduction Generation-Heavy Machine Translation Evaluation

Overall Evaluation Component Evaluation

Conclusion Future Work

Page 35: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationSystems

Gisting(GIST)

Systran(SYST)

IBM Model 4(IBM4)

Matador(MTDR)

Approach SymbolicWord-based

SymbolicTransfer-based

StatisticalWord-based

HybridGeneration-Heavy

TranslationModel

400Ksurface-lexeme

pairs 120Klexeme-lexeme

pairsand

large transferlexicon

Model 4Giza Trained

50K UN sentence pairs

50Klexeme-lexeme

pairs

LanguageModel

UnigramsBrown Corpus

1M words

Bigrams3M words (UN)

Bigrams3M words (UN)

andStructural Bigrams1.5M words (UN)

DevelopmentTime

1 person-month Hundreds ofperson-years

1 person-month 1 person-year

(Brown et al 1990)(Al-Onaizan et al 1999)

(Germann and Marcu 2000)

(Resnik 1997)

Page 36: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationBleu Metric

• Bleu – BiLingual Evaluation Understudy (Papineni et al 2001)

– Modified n-gram precision with length penalty

– Quick, inexpensive and language independent

– Correlates highly with human evaluation

– Bias against synonyms and inflectional variations

Page 37: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationTest Sets

UN FBIS Bible

GenreUnited Nations

documentsNews broadcast Religious

Spanish-EnglishSentence pairs

2,000 2,000 1,000

Sentence Length(words)

15.39 19.27 16.38

Page 38: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationResults

0

5

10

15

20

25

30

UN FBIS BIBLE

Corpus

Ble

u S

core

SYST IBM4 MTDR GIST

Page 39: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationResults

• Systran is overall best

• Gist is overall worst

• Matador is more robust than IBM4

• Matador is more grammatical than IBM4

• Matador has less information loss than IBM4

Page 40: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation Grammaticality

• Example– SP: Ademàs dijo que solamente una inyecciòn masiva de capital extranjero ...

– EN: Further, he said that only a massive injection of foreign capital ...

– IBM4: further stated that only a massive inyecciòn of capital abroad ...

– MTDR: Also he spoke only a massive injection of foreign capital ...

• Parsed all sentences (Spanish, English reference and English output)– Can we find main verb?

– Pro Drop Restoration

Page 41: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation Grammaticality: Verb Determination

0

0.5

1

1.5

2

2.5

Verb

s p

er

Sen

ten

ce

English Spanish SYST IBM4 MTDR GIST

Page 42: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation Grammaticality: Subject Realization

-

10

20

30

40

50

60

70

80

90

100

Perc

en

tag

e o

f R

ealize

d S

ub

ject

s

English Spanish SYST IBM4 MTDR GIST

Page 43: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation Loss of Information

• Example– SP: El daño causado al pueblo de Sudáfrica jamás debe subestimarse.

– EN: The damage caused to the people of his country should never be underestimated.

– IBM4: the damage * the people of south * must never underestimated .

– MTDR: Never the causado damage to the people of South Africa should be underestimated.

Gisting(GIST)

Systran(SYST)

IBM Model 4(IBM4)

Matador(MTDR)

Reference length

109% 109% 94% 104%

Page 44: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Component Evaluation

• Conducted several component evaluations– Parser

• ~75% correct (labeled dependency links)

– Categorial Variation Database • 81% Precision-Recall

– Structural Expansion– Structural N-grams

Page 45: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Component EvaluationStructural Expansion

• Insignificant increase in Bleu score• 40% of divergences pragmatic• LCS lexicon coverage issues • Minimal handling of nominal divergences• Over-expansion

– Además, destruyó totalmente sus cultivos de subsistencia …– EN: It had totally destroyed Samoa's staple crops ... – MTDR: Furthermore, it totaled their cultivations of subsistence …

– SP: Dicha adición se publica sólo en años impares.– EN: That addendum is issued in odd-numbered years only.– MTDR: concerned addendum is excluded in odd years.

Page 46: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Component EvaluationStructural N-grams

0

5000

10000

15000

20000

25000

30000

Parsing Translation Expansion Linearization Ranking

Module

Tim

e (

seconds)

with Structural N-grams without Structural N-grams

• 60% speed-up with no effect on quality

Page 47: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Roadmap

Introduction Generation-Heavy Machine Translation Evaluation Conclusion Future Work

Page 48: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

ConclusionResearch Contributions

• A general reusable and extensible MT model that transcends the need for large amounts of symmetric knowledge

• A systematic non-interlingual/non-transfer framework for handling translation divergences

• Extending the concept of symbolic overgeneration to include conflation and head-swapping of structural variations.

• A model for language-independent syntactic-to-thematic linking

Page 49: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

ConclusionResearch Contributions

• Development of reusable large-scale modules and resources: Exerge, Categorial Variation Database, etc.

• A large-scale Spanish-English GHMT implementation

•An evaluation of Matador against four models of machine translation found it to be robust across genre and to produce more grammatical output.

Page 50: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Ongoing Work• Retargetability to new languages

– Chinese, Arabic• Extending system to use bi-texts

– Phrase dictionary– Weighted translation pairs

• Generation-Heavy parsing– Small dependency grammar for foreign language– English structural n-grams to rank parses

• Extending system with new optional modules– Cross-lingual headline generation

DepTrimmer (work with Bonnie Dorr) extending Trimmer (Dorr, et al. 2003) to dependency representation

Page 51: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Future Work

• Categorial Variation Database– Improving word-cluster correctness

• Structural Expansion– Extending to nominal divergences– Improving thematic linking with a statistical

model

• Structural N-grams– Enriching with syntactic/thematic relations

Page 52: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Thank you!

Questions?

Page 53: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationBleu Metric

Test Sentence

colorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Page 54: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationBleu Metric

Test Sentence

colorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Unigram precision = 4/5

Page 55: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall EvaluationBleu Metric

Test Sentence

colorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep iratelydrab emerald concepts sleep furiously

colorless immature thoughts nap angrily

Unigram precision = 4 / 5 = 0.8Bigram precision = 2 / 4 = 0.5

Bleu Score = (a1 a2 …an)1/n

= (0.8 ╳ 0.5)½ = 0.6325 63.25

Page 56: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation

• Investigating BLEU’s bias towards inflectional variants– SP: Los programas de ajuste estructural se han aplicado rigurosamente.

– EN: Structural adjustment programmes had been rigorously implemented.

– IBM4: structural adjustment programmes have been applied strictly.– MTDR: programmes of structural adjustment have been added

rigurosament.

Page 57: Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia

Overall Evaluation Inflectional Normalization

0

5

10

15

20

25

30

35

40

45

Raw Normalized

Ble

u Sc

ore

SYST IBM4 MTDR GIST