building and using an inuktitut-english parallel corpus
DESCRIPTION
Building and Using an Inuktitut-English Parallel Corpus. Joel Martin, Howard Johnson, Benoit Farley & Anna Maclachlan [email protected]. Agglutinative written form. qaisaaliniaqquunngikkaluaqpuq Root- suffixes -grammatical suffix - PowerPoint PPT PresentationTRANSCRIPT
Building and Using an Inuktitut-English Parallel Corpus
Joel Martin,Howard Johnson, Benoit Farley &
Anna Maclachlan<firstname>.<lastname>@nrc.gc.ca
Agglutinative written form
qaisaaliniaqquunngikkaluaqpuq
Root- suffixes -grammatical suffix
qai-, -saali-, -niaq-, -qquu-, -nngit-, -galuaq, -puq
“Actually, he probably won’t come early today.”
Nunavut Hansards• 155 days of Nunavut Legislative Assembly • April 1, 1999 to November 1, 2002
These symbols, like the Qamutik thatrests on the floor, will find a home inour new Assembly building. I would finally like to recognize the artists who created the mace.
taakkua qamutiik natirmiittuuk iniqarumaanniaqtuuk nutaamik maligaliurvingmi. kingulliqpaami ilitarijumavakka sananngualauqtuminiujuit anautarmik.
Characters Words Sentences Paragraphs
English 20,124,587 3,432,212 348,619 112,346
Inuktitut 13,457,581/
21,305,295
1,586,423 352,486 118,733
Difficulties Aligning InuktitutHansards
No spelling checkersMany dialects (translators)“School”: ilinniarvik, ilisavik, ilinniaqvik, ilitarvik, ilinniavik
Words1:1 Word alignment is not usually possibleNo root dictionary for Eastern Canada
LengthsAligning by length in Words not a good ideaAligning by length in Chars: average =1.05
• Length Alignment: (Gale and Church, 1993)
• Gaussian to estimate matching probability
• Dynamic programming to optimize the match
• Lexical Alignment:
• non-alphabetic sequences (9:00, 42-1(1) and 1999)• 8 reliable word correspondences
• speaker/uqaqti• motion/pigiqati
Alignment Techniques
Precision Recall
Gale & Church 2448/3670 = 66.7% 2448/3424 = 71.5%
G&C paragraphs 2978/3479 = 85.6% 2978/3424 = 87%
Lexical & Length 3161/3459 = 91.4% 3161/3424 = 92.3%
Initial Alignment Results
Is the alignment useful?• Term Dictionary
• Few contemporary dictionaries• Few with roots and suffixes (Eastern Arctic)• Spelling differences, Dialectical differences
• Examples:• -kiaq “don’t know”• tukisi- “understand”• -juma- “want”• maligaliur(vi)- “assembly”• piita “Peter”• kanata- “Canada”• makalain “McLean”
What is a term?• Inuktitut Terms
• Words, phrases of 2 to 4 words• Prefixes, internal substrings, final substrings < 10 ch.
• English Terms• Words, phrases of 2 to 4 words• Prefixes
All against all• Consider every Inuktitut term to every English term
• Slow with big files of partial results
Consistent TranslationsBead contains Inuktitut Term
Inuktitut term is missing
Bead contains English Term
I & E ~I & E
English term is missing
I & ~E ~I & ~E
Pr(I&E)PMI = log
Pr(I)*Pr(E)
Confidence Interval around Ratios (95%)Frequency Total Lower Upper
2 2 0.3424 1.0000
2 10 0.0567 0.5098
167 1000 0.1452 0.1914
Glossary Results
4362 term pairs72.3% of English word occurrences (but…)
Exact Matches (43%):a) half were uninflected proper nouns. b) inuup and person’s.
Good (more in the Inuktitut) Matches (44%):pigiaqtitara and deal. “I deal with
him”.
Summaryhttp://www.InuktitutComputing.ca/NunavutHansard/en/
1) Sentence alignment of an agglutinative language.2) Use of the sentence alignment to build a glossary.
-lauqsimanngit- “have never”inuliriji- “social worker”-kiaq “don’t know”nuu juak “New York”tusaumajjutilirinirmut kanngunaqtulirinirmullu (kamis-)
“Information and Privacy Commissioner”