building and using an inuktitut-english parallel corpus

12
Building and Using an Inuktitut-English Parallel Corpus Joel Martin, Howard Johnson, Benoit Farley & Anna Maclachlan <firstname>.<lastname>@nrc.gc.ca

Upload: lewis

Post on 02-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Building and Using an Inuktitut-English Parallel Corpus. Joel Martin, Howard Johnson, Benoit Farley & Anna Maclachlan [email protected]. Agglutinative written form. qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building and Using an Inuktitut-English Parallel Corpus

Building and Using an Inuktitut-English Parallel Corpus

Joel Martin,Howard Johnson, Benoit Farley &

Anna Maclachlan<firstname>.<lastname>@nrc.gc.ca

Page 2: Building and Using an Inuktitut-English Parallel Corpus
Page 3: Building and Using an Inuktitut-English Parallel Corpus

Agglutinative written form

qaisaaliniaqquunngikkaluaqpuq

Root- suffixes -grammatical suffix

qai-, -saali-, -niaq-, -qquu-, -nngit-, -galuaq, -puq

“Actually, he probably won’t come early today.”

Page 4: Building and Using an Inuktitut-English Parallel Corpus

Nunavut Hansards• 155 days of Nunavut Legislative Assembly • April 1, 1999 to November 1, 2002

These symbols, like the Qamutik thatrests on the floor, will find a home inour new Assembly building. I would finally like to recognize the artists who created the mace.

taakkua qamutiik natirmiittuuk iniqarumaanniaqtuuk nutaamik maligaliurvingmi. kingulliqpaami ilitarijumavakka sananngualauqtuminiujuit anautarmik.

Characters Words Sentences Paragraphs

English 20,124,587 3,432,212 348,619 112,346

Inuktitut 13,457,581/

21,305,295

1,586,423 352,486 118,733

Page 5: Building and Using an Inuktitut-English Parallel Corpus

Difficulties Aligning InuktitutHansards

No spelling checkersMany dialects (translators)“School”: ilinniarvik, ilisavik, ilinniaqvik, ilitarvik, ilinniavik

Words1:1 Word alignment is not usually possibleNo root dictionary for Eastern Canada

LengthsAligning by length in Words not a good ideaAligning by length in Chars: average =1.05

Page 6: Building and Using an Inuktitut-English Parallel Corpus

• Length Alignment: (Gale and Church, 1993)

• Gaussian to estimate matching probability

• Dynamic programming to optimize the match

• Lexical Alignment:

• non-alphabetic sequences (9:00, 42-1(1) and 1999)• 8 reliable word correspondences

• speaker/uqaqti• motion/pigiqati

Alignment Techniques

Page 7: Building and Using an Inuktitut-English Parallel Corpus

Precision Recall

Gale & Church 2448/3670 = 66.7% 2448/3424 = 71.5%

G&C paragraphs 2978/3479 = 85.6% 2978/3424 = 87%

Lexical & Length 3161/3459 = 91.4% 3161/3424 = 92.3%

Initial Alignment Results

Page 8: Building and Using an Inuktitut-English Parallel Corpus

Is the alignment useful?• Term Dictionary

• Few contemporary dictionaries• Few with roots and suffixes (Eastern Arctic)• Spelling differences, Dialectical differences

• Examples:• -kiaq “don’t know”• tukisi- “understand”• -juma- “want”• maligaliur(vi)- “assembly”• piita “Peter”• kanata- “Canada”• makalain “McLean”

Page 9: Building and Using an Inuktitut-English Parallel Corpus

What is a term?• Inuktitut Terms

• Words, phrases of 2 to 4 words• Prefixes, internal substrings, final substrings < 10 ch.

• English Terms• Words, phrases of 2 to 4 words• Prefixes

All against all• Consider every Inuktitut term to every English term

• Slow with big files of partial results

Page 10: Building and Using an Inuktitut-English Parallel Corpus

Consistent TranslationsBead contains Inuktitut Term

Inuktitut term is missing

Bead contains English Term

I & E ~I & E

English term is missing

I & ~E ~I & ~E

Pr(I&E)PMI = log

Pr(I)*Pr(E)

Confidence Interval around Ratios (95%)Frequency Total Lower Upper

2 2 0.3424 1.0000

2 10 0.0567 0.5098

167 1000 0.1452 0.1914

Page 11: Building and Using an Inuktitut-English Parallel Corpus

Glossary Results

4362 term pairs72.3% of English word occurrences (but…)

Exact Matches (43%):a) half were uninflected proper nouns. b) inuup and person’s.

Good (more in the Inuktitut) Matches (44%):pigiaqtitara and deal. “I deal with

him”.

Page 12: Building and Using an Inuktitut-English Parallel Corpus

Summaryhttp://www.InuktitutComputing.ca/NunavutHansard/en/

1) Sentence alignment of an agglutinative language.2) Use of the sentence alignment to build a glossary.

-lauqsimanngit- “have never”inuliriji- “social worker”-kiaq “don’t know”nuu juak “New York”tusaumajjutilirinirmut kanngunaqtulirinirmullu (kamis-)

“Information and Privacy Commissioner”