an overview of the avenue project presented by lori levin language technologies institute school of...

An Overview of the AVENUE Project

Presented byLori Levin

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA USA

AVENUE Project

• Dr. Jaime Carbonell, PI • Dr. Alon Lavie, Co-PI• Dr. Lori Levin, Co-PI• Dr. Robert Frederking• Dr. Ralf Brown• Dr. Rodolfo Vega

• Mapudungun– Dr. Eliseo Cañulef– Rosendo Huisca – and others

• Erik Peterson• Christian Monson• Ariadna Font Llitjós• Alison Alvarez• Roberto Aranovich• Dr. Jeff Good• Dr. Katharina Probst

• Hebrew – Dr. Shuly Wintner– student

This research was funded in part by NSF grant number IIS-0121-631.

MT Approaches

Interlingua: introduce-self

Syntactic ParsingPronoun-acc-1-sg chiamare-1sg N

Semantic Analysis

Sentence Planning Text

Generation[np poss-1sg “name”] BE-pres N

SourceMi chiamo Lori

TargetMy name is Lori

Transfer Rules

Direct: SMT, EBMT

AVENUE: Automate Rule Learning

Approaches to MT

• Direct– Works best with large parallel corpora

• Millions of words

– Can be done without linguistic resources

• Interlingua– Useful when you are translating between more than

two languages– Requires linguistic knowledge

• Transfer– Requires linguistic knowledge

Useful Resources for MT

• Parallel corpus

• Monolingual corpus

• Lexicon

• Morphological Analyzer (lemmatizer)

• Human Linguist

• Human non-linguist

Low Resource Situations• Indigenous languages

– May lack large corpora– May lack a computational linguist

• “Strategic” Languages– Aside from standard written Arabic and Chinese

• Resource-rich language: limited domain– Most of the large parallel corpora are newspaper,

parliamentary proceedings, or broadcast news– Fewer resources for conversation related to

humanitarian aid.

Why Machine Translation for Languages with Limited Resources?

• We are in the age of information explosion– The internet+web+Google anyone can get the information

they want anytime…• But what about the text in all those other languages?

– How do they read all this English stuff?– How do we read all the stuff that they put online?

• MT for these languages would Enable:– Better government access to native indigenous and minority

communities– Better minority and native community participation in

information-rich activities (health care, education, government) without giving up their languages.

– Civilian and military applications (disaster relief)– Language preservation

Mixed Resource Situations

• Some resources are available and others aren’t.

Omnivorous MT• Eat whatever resources are available

• Eat large or small amounts of data

AVENUE’s Inventory

• Resources– Parallel corpus– Monolingual corpus– Lexicon– Morphological

Analyzer (lemmatizer)– Human Linguist– Human non-linguist

• Techniques– Rule based transfer

system– Example Based MT– Morphology Learning– Rule Learning– Interactive Rule

Refinement– Multi-Engine MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

AVENUE

• Rules can be written by hand or learned automatically.

• Hybrid– Rule-based transfer– Statistical decoder– Multi-engine combinations with SMT and EBMT

AVENUE systems(Small and experimental, but tested on unseen data)

• Hebrew-to-English – Alon Lavie, Shuly Wintner, Katharina Probst– Hand-written and automatically learned– Automatic rules trained on 120 sentences perform

slightly better than about 20 hand-written rules.

• Hindi-to-English – Lavie, Peterson, Probst, Levin, Font, Cohen, Monson– Automatically learned– Performs better than SMT when training data is limited

to 50K words

AVENUE systems(Small and experimental, but tested on unseen data)

• English-to-Spanish– Ariadna Font Llitjos– Hand-written, automatically corrected

• Mapudungun-to-Spanish – Roberto Aranovich and Christian Monson– Hand-written

• Dutch-to-English – Simon Zwarts– Hand-written


Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

Elicitation

• Get data from someone who is– Bilingual – Literate

• With consistent spelling

– Not experienced with linguistics

English-Hindi Example

Elicitation Tool: Erik Peterson

English-Chinese Example

Note: Translator has to insert spaces between words in Chinese.

English-Arabic Example

Purpose of Elicitation

• Provide a small but highly targeted corpus of hand aligned data– To support machine

learning from a small data set

– To discover basic word order

– To discover how syntactic dependencies are expressed

– To discover which grammatical meanings are reflected in the morphology or syntax of the language

srcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del

singular]comment: You (John) fell

srcsent: Tú estás cayendotgtsent: eymi petu ütünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del

singular]comment: You (John) are falling

srcsent: Tú caíste tgtsent: eymi ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del

singular]comment: You (Mary) fell

Languages

• The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program.

• Translated (by LDC) into:– Thai– Bengali

• Plans to translate into:– Seven “strategic” languages per year for five years.

• As one small part of a language pack (BLARK) for each language.

Languages

• Spanish version in progress at New Mexico State University (Helmreich and Cowie)– Plans to translate into Guarani

• Portuguese version in progress in Brazil (Marcello Modesto)– Plans to translate into Karitiana

• 200 speakers

• Plans to translate into Inupiaq (Kaplan and MacLean)

Previous Elicitation Work

• Pilot corpus– Around 900 sentences– No feature structures

• Mapudungun– Two partial translations

• Quechua– Three translations

• Aymara– Seven translations

• Hebrew• Hindi

– Several translations• Dutch


Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

AVENUE Machine Translation System

Type informationSynchronous Context Free

RulesAlignments

x-side constraints

y-side constraints

xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))

;SL: the old man, TL: ha-ish ha-zaqen

NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)

Rule learning: Katharina Probst

Rule Learning - Overview

• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the major-

language side (grammatical structure)• Three steps:

1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure

2. Compositionality Learning: use previously learned rules to learn hierarchical structure

3. Constraint Learning: refine rules by learning appropriate feature constraints

Flat Seed Rule Generation

Learning Example: NP

Eng: the big apple

Heb: ha-tapuax ha-gadol

Generated Seed Rule:

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1)

(X1::Y3)

(X2::Y4)

(X3::Y2))


• Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS– Words that are aligned word-to-word and have the same POS in

both languages are generalized to their POS– Words that have complex alignments (or not the same POS)

remain lexicalized

• One seed rule for each translation example• No feature constraints associated with seed rules (but

mark the example(s) from which it was learned)

Compositionality Learning

Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))


((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N]

((X1::Y1) (X2::Y2))

Generated Compositional Rule:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4))


• Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks

• Generalization: adjust constituent sequences and alignments

• Two implemented variants:– Safe Compositionality: there exists a transfer rule that

correctly translates the sub-constituent– Maximal Compositionality: Generalize the rule if supported

by the alignments, even in the absence of an existing transfer rule for the sub-constituent

Constraint LearningInput: Rules and their Example Sets

S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}

((X1::Y1) (X2::Y2) (X3::Y4))

NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}

((X1::Y1) (X2::Y2))

Output: Rules with Feature Constraints:


((X1::Y1) (X2::Y2) (X3::Y4)

(X1 NUM = X2 NUM)

(Y1 NUM = Y2 NUM)

(X1 NUM = Y1 NUM))

Constraint Learning

• Goal: add appropriate feature constraints to the acquired rules• Methodology:

– Preserve general structural transfer– Learn specific feature constraints from example set

• Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments)

• Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary

• The seed rules in a group form the specific boundary of a version space

• The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints

Transfer and Decoding

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

The Transfer Engine

AnalysisSource text is parsed into its grammatical structure. Determines transfer application ordering.

Example:

ראיתי את האיש הזקן

(I) saw *acc the man the old

S

VP

V P NP

D N D Adj

הזקן האיש את ראיתי

TransferA target language tree is created by reordering, insertion, and deletion.

S

NP VP

N V NP

DET Adj N

I saw the old man

Source words translated with transfer lexicon.

GenerationTarget language constraints are checked, target morphology applied, and final translation produced.

E.g. “saw” in past tense selected.

Final translation:

“I saw the old man”

Symbolic Decoder

• System rarely finds a full parse/transfer for complete input sentence• XFER engine produces comprehensive lattice of segment

translations• Decoder selects best combination of translation segments• Search for optimal scoring path of partial translations, based on

multiple features:– Target Language Model scores– XFER Rule Scores– Path Fragmentation– Other features…

• Symbolic decoding essential for scenarios where there is insufficient data for training large target LM– Effective Rule Scoring is crucial


Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

Rule Refinement

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

Interactive and Automatic Refinement of Translation Rules

• Problem: Improve Machine Translation quality.

• Proposed Solution: Put bilingual speakers back into the loop; use their corrections to detect the source of the error and automatically improve the lexicon and the grammar.

• Approach: Automate post-editing efforts by feeding them back into the MT system.Automatic refinement of translation rules that

caused an error beyond post-editing.

• Goal: Improve MT coverage and overall quality.

Technical Challenges

Elicit minimal MT information from non-expert users

Automatically Refine and Expand

Translation Rules minimally

Manually written Automatically Learned

Automatic Evaluation of Refinement process

43

Error Typology for Automatic Rule Refinement (simplified)

Missing word

Extra word

Wrong word order

Incorrect word

Wrong agreement

Local vs Long distance

Word vs. phrase

+ Word change

Sense

Form

Selectional restrictions

Idiom

Missing constraint

Extra constraint

TCTool (Demo)• Add a word• Delete a word• Modify a word• Change word order

Actions:

Interactive elicitation of error information

precision recall

error detection 90% 89%

error classification 72% 71%

http://avenue.lti.cs.cmu.edu/aria/spanish/KenProject/dragshort.html?sl=Gaudi%20was%20a%20great%20artist&con=&tl=gaudi%20era%20un%20artista%20grande&al=((1,1),(2,2),(3,3),(5,4),(4,5))&id=2004-11-2-14:39:12-17393&count=0&senum=3

1. Refine a translation rule:R0 R1 (change R0 to make it more

specific or more general)

Types of Refinement Operations

Automatic Rule Adaptation

R0:

R1:

NP

DET N ADJ

NP

DET ADJ N

a nice house

una casa bonito

NP

DET N ADJ

NP

DET ADJ N

a nice house

una casa bonita

N gender = ADJ gender

2. Bifurcate a translation rule:R0 R0 (same, general rule)

R1 (add a new more specific rule)

Types of Refinement Operations


R0: NP

DET N ADJ

NP

DET ADJ N

NP

DET ADJ N

NP

DET ADJ N

R1:

a nice house una casa bonita

a great artist un gran artista

ADJ type: pre-nominal

AVENUE/LETRAS 47

Error Information Elicitation

Refinement Operation Typology


Change word orderSL: Gaudí was a great artist

MT system output:TL: Gaudí era un artista grande

Ucorrection: *Gaudí era un artista grande Gaudí era un gran artista

A concrete example

clue word

error

correction

Mapudungun

• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers

Mapudungun Language

• 900,000 Mapuche people• At least 300.000 speakers of Mapudungun• Polysynthetic

sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/INDtl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.

AVENUE Mapudungun

• Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.

Mapudungun to Spanish Resources

• Initially: – Large team of native speakers at Universidad de la Frontera,

Temuco, Chile• Some knowledge of linguistics• No knowledge of computational linguistics

– No corpus– A few short word lists– No morphological analyzer

• Later: Computational Linguists with non-native knowledge of Mapudungun

• Other considerations:– Produce something that is useful to the community, especially for

bilingual education– Experimental MT systems are not useful

Mapudungun

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

Corpus: 170 hours of spoken Mapudungun

Example Based MT

Spelling checker

Spanish Morphology from UPC, Barcelona

Mapudungun Products

• http://www.lenguasamerindias.org/– Click: traductor mapudungún– Dictionary lookup (Mapudungun to Spanish)– Morphological analysis– Example Based MT (Mapudungun to Spanish)

http://www.lenguasamerindias.org/

V

pe

I Didn’t see Maria

VSuff

la

VSuffG VSuff

fi

VSuffG VSuff

ñ

VSuffG

NP

N

Maria

N

S

V

VP

S

VP

NP“a”V

V“no”

vi N

María

N

V

pe

Transfer to Spanish: Top-Down

VSuff

la

VSuffG VSuff

fi

VSuffG VSuff

ñ

VSuffG

NP

N

Maria

N

S

V

VP

S

VP

NP“a”V

VP::VP [VBar NP] -> [VBar "a" NP]( (X1::Y1)

(X2::Y3)

((X2 type) = (*NOT* personal)) ((X2 human) =c +)

(X0 = X1) ((X0 object) = X2)

(Y0 = X0)

((Y0 object) = (X0 object))(Y1 = Y0)(Y3 = (Y0 object))((Y1 objmarker person) = (Y3 person))((Y1 objmarker number) = (Y3 number))((Y1 objmarker gender) = (Y3 ender)))

Mapudungun

• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers

Collaboration

• Mapuche Language Experts – Universidad de la Frontera (UFRO)

• Instituto de Estudios Indígenas (IEI)– Institute for Indigenous Studies

• Chilean Funding– Chilean Ministry of Education

(Mineduc)• Bilingual and Multicultural Education

Program

Eliseo Cañulef

Rosendo Huisca

Hugo Carrasco

Hector Painequeo

Flor Caniupil

Luis Caniupil Huaiquiñir

Marcela Collio Calfunao

Cristian Carrillan Anton

Salvador Cañulef

Carolina Huenchullan Arrúe

Claudio Millacura Salas

Accomplishments

• Corpora Collection

– Spoken Corpus• Collected: Luis Caniupil Huaiquiñir • Medical Domain• 3 of 4 Mapudungun Dialects

– 120 hours of Nguluche– 30 hours of Lafkenche– 20 hours of Pwenche

• Transcribed in Mapudungun• Translated into Spanish

– Written Corpus• ~ 200,000 words• Bilingual Mapudungun – Spanish• Historical and newspaper text

nmlch-nmjm1_x_0405_nmjm_00:M: <SPA>no pütokovilu kay koC: no, si me lo tomaba con agua

M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués

nmlch-nmjm1_x_0406_nmlch_00:M: ChengewerkelafuymiürkeC: Ya no estabas como gente entonces!

Accomplishments

• Developed At UFRO– Bilingual Dictionary with Examples

• 1,926 entries

– Spelling Corrected Mapudungun Word List• 117,003 fully-inflected word forms

– Segmented Word List• 15,120 forms• Stems translated into Spanish

Accomplishments

• Developed at LTI using Mapudungun language resources from UFRO– Spelling Checker

• Integrated into OpenOffice

– Hand-built Morphological Analyzer– Prototype Machine Translation Systems

• Rule-Based• Example-Based

– Website: LenguasAmerindias.org

AVENUE Hebrew

• Joint project of Carnegie Mellon University and University of Haifa

Hebrew Language

• Native language of about 3-4 Million in Israel• Semitic language, closely related to Arabic and with

similar linguistic properties– Root+Pattern word formation system– Rich verb and noun morphology– Particles attach as prefixed to the following word: definite article

(H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)…

• Unique alphabet and Writing System– 22 letters represent (mostly) consonants– Vowels represented (mostly) by diacritics– Modern texts omit the diacritic vowels, thus additional level of

ambiguity: “bare” word word– Example: MHGR mehager, m+hagar, m+h+ger

Hebrew Resources

• Morphological analyzer developed at Technion

• Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary

• Human Computational Linguists

• Native Speakers

Hebrew

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT


Learning Example: NP

Eng: the big apple

Heb: ha-tapuax ha-gadol

Generated Seed Rule:


((X1::Y1)

(X1::Y3)

(X2::Y4)

(X3::Y2))


Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))


((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N]

((X1::Y1) (X2::Y2))

Generated Compositional Rule:


((X1::Y1) (X2::Y2) (X3::Y4))

Constraint LearningInput: Rules and their Example Sets

S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}

((X1::Y1) (X2::Y2) (X3::Y4))

NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}

((X1::Y1) (X2::Y2))

Output: Rules with Feature Constraints:


((X1::Y1) (X2::Y2) (X3::Y4)

(X1 NUM = X2 NUM)

(Y1 NUM = Y2 NUM)

(X1 NUM = Y1 NUM))

Challenges for Hebrew MT

• Paucity in existing language resources for Hebrew– No publicly available broad coverage morphological

analyzer– No publicly available bilingual lexicons or dictionaries– No POS-tagged corpus or parse tree-bank corpus for

Hebrew– No large Hebrew/English parallel corpus

• Scenario well suited for CMU transfer-based MT framework for languages with limited resources

Hebrew Morphology Example

• Input word: B$WRH

0 1 2 3 4

|--------B$WRH--------|

|-----B-----|$WR|--H--|

|--B--|-H--|--$WRH---|

Hebrew Morphology Example

Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE))

Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET))

Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

Sample Output (dev-data)

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police

in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

QuechuaSpanish MT

• V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier]

• Intensive Quechua course in Centro Bartolome de las Casas (CBC)

• Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words)

Quechua Spanish Prototype MT System

• Stem Lexicon (semi-automatically generated): 753 lexical entries

• Suffix lexicon: 21 suffixes – (150 Cusihuaman)

• Quechua morphology analyzer• 25 translation rules• Spanish morphology generation

module• User-Studies: 10 sentences, 3

users (2 native, 1 non-native)

Quechua facts• Agglutinative language

• A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes

• Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes

• No irregular verbs, nouns or adjectives

• Does not mark for gender

• No adjective agreement

• No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)

Quechua examples

– taki+ni (also written takiniy)sing 1sg (I sing) canto

– taki+sha+ni (takishaniy)sing progr 1sg (I am singing) estoy cantando

– taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative

(para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)

Quechua Resources

• A few native speakers, not linguists

• A computational linguist learning Quechua

• Two fluent, but non-native linguists

Quechua

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

Parallel Corpus: OCR with correction

Grammar rules;taki+sha+ni -> estoy cantando (I am singing){VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V]( (X1::Y2)

((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense))((x0 tense) = (x3 tense))((y1 mood) = (x3 mood))((x3 inflected) =c +)((x0 inflected) = +))

lex = cantarmood = ger

lex = estarperson = 1number = sgtense = presmood = ind

SpanishMorphologyGeneration

estoy

cantando

Hindi Resources

• Large statistical lexicon from the Linguistic Data Consortium (LDC)

• Parallel Corpus from LDC• Morphological Analyzer-Generator from LDC• Lots of native speakers• Computational linguists with little or no

knowledge of Hindi• Experimented with the size of the parallel corpus

– Miserly and large scenarios

Hindi

Learning

Module

Learned Transfer

Rules

Lexical Resources


Decoder

Translation

Correction

Tool


Elicitation Tool

Elicitation Corpus


Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer


rules

INPUT TEXT

OUTPUT TEXT

15,000 Noun Phrases from Penn TreeBank

Parallel Corpus

EBMT

SMT

Supported by DARPA TIDES

Manual Transfer Rules: Example

; NP1 ke NP2 -> NP2 of NP1; Ex: jIvana ke eka aXyAya; life of (one) chapter ; ==> a chapter of life;{NP,12}NP::NP : [PP NP1] -> [NP1 PP]( (X1::Y2) (X2::Y1); ((x2 lexwx) = 'kA'))

{NP,13}NP::NP : [NP1] -> [NP1]( (X1::Y1))

{PP,12}PP::PP : [NP Postp] -> [Prep NP]( (X1::Y2) (X2::Y1))

NP

PP NP1

NP P Adj N

N1 ke eka aXyAya

N

jIvana

NP

NP1 PP

Adj N P NP

one chapter of N1

N

life

System BLEU M-BLEU NIST

EBMT 0.058 0.165 4.22

SMT 0.093 0.191 4.64

XFER (naïve) man

grammar

0.055 0.177 4.46

XFER (strong) no grammar

0.109 0.224 5.29

XFER (strong) learned

grammar

0.116 0.231 5.37

XFER (strong) man

grammar

0.135 0.243 5.59

XFER+SMT

0.136 0.243 5.65

Very miserly training data.

Seven combinations of components

Strong decoder allows re-ordering

Three automatic scoring metrics

Hindi-English

an overview of the avenue project presented by lori levin language technologies institute school of...

Documents

limited resources

availableeat large

broadcast newsfewer

transfer systemexample

sg chiamare

rule learningapproaches

limited domainmost

english stuff