enriching transliteration lexicon using automatic transliteration extraction

Enriching Transliteration Lexicon Using

Automatic Transliteration Extraction

Sarvnaz KarimiSchool of Computer Science and IT

RMIT UniversitySupervisors: Dr. Falk Scholer and Dr. Andrew Turpin

Keywords: Transliteration, Parallel Corpus

Machine Transliteration

• Machine transliteration transforms a word from a source language to atarget language with preserved pronunciation.

• Machine translation, cross-lingual information retrieval and cross-lingualquestion answering are the main areas that automatic transliteration isapplicable.

• Transliteration has been studied in two major areas: transliterationgeneration and transliteration extraction.

• Transliteration generation gets an input source word in source language(e.g. Sydney in English) and generates its transliteration in targetlanguage (e.g. úGYJ� in Persian).

• Transliteration extraction is discovering transliteration pairs (e.g.(Sydney,úGYJ�) in bilingual texts.

Transliteration Extraction So Far!

Discovery of transliteration methods in literature consider:

• Extraction from parallel corpus:

– Statistical methods are beneficial, particularly because thesentences/words can be aligned.

– Yet parallel corpus is hard to find for many less-computerisedlanguages.

• Extraction from comparable corpus:

– More evidence than just statistical information are required to extractpairs (e.g. temporal, phonetic information or Web-count).

– Comparable corpora are easier to construct and find than parallel one.

Most studies use name entity (NE) recogniser to separate proper nouns thatare subject to transliteration from other words.

Persian and English Transliteration

• Transliteration generation has been studied using n-gram based andconsonant-vowel based approaches.

• Transliteration extraction is not previously studied for this language pair,mainly due to lack of any parallel or comparable corpus.

• Transliteration extraction has been studied using co-occurrence, temporal,edit distance measures or phonetic similarities. We aim to apply ourtransliteration generation methods as a basis for this task.

Proposed Method: Application of Transliteration

Generation in Extraction1. For each document in each language we perform a pre-processing

to generate a bag of words from each document (tokenise) and

also remove stop-words

2. Each word in source language is matched against a dictionary,

if not found then it is an out-of-dictionary word that needs

transliteration in target document.

3. A ranked list of possible transliterations for each source

word is generated by transliteration system.

4. Those transliterations matching with the target document

potential words are considered as a potential pair.

5. A score can be given to these pairs based on the rank of the

transliteration and number of times they are paired.

Experimental Setup

• An English-Persian comparable corpus of news texts is constructedconsisting of 3,474 documents.

• An English machine-readable dictionary was applied which contains120,177 entries.

• Experiments:

– Accuracy of transliterations extracted (Fixed Training Collection).Different methods of matching experienced (1-English documents areparsed to extract their out-of-dictionary words using dictionary look-upand stemming. 2- A parsing on Persian documents is performed byrendering the words that contain allophones characters to one uniquecharacter. 3- Repeating the previous experiment including capitalcharacters knowledge.)

– Impact of seed transliteration lexicon.

Experiments and Results

Experiment 1 : Accuracy

#Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.

1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3

2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4

3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2

Experiment 2 : Train Size

Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.

200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2

300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3

400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1

500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0

Conclusions and Further Work

• Transliteration extraction can be helpful in automatically generatingtransliteration lexicons.

• Transliteration lexicon as a dictionary of transliteration of a proper noun ortechnical terms that are not translated are beneficial in dictionary-basedmachine translation applications.

• We investigated a method of applying the current yet incompletetransliteration lexicons in enriching them using comparable corpora.

• In future, role of NE-recogniser will be investigated to compare with asimple dictionary look-up.

enriching transliteration lexicon using automatic transliteration extraction

Technology