enriching transliteration lexicon using automatic transliteration extraction

8
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction Sarvnaz Karimi School of Computer Science and IT RMIT University Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin Keywords: Transliteration, Parallel Corpus

Upload: sarvnaz-karimi

Post on 13-Apr-2017

156 views

Category:

Technology


17 download

TRANSCRIPT

Page 1: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Enriching Transliteration Lexicon Using

Automatic Transliteration Extraction

Sarvnaz KarimiSchool of Computer Science and IT

RMIT UniversitySupervisors: Dr. Falk Scholer and Dr. Andrew Turpin

Keywords: Transliteration, Parallel Corpus

Page 2: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Machine Transliteration

• Machine transliteration transforms a word from a source language to atarget language with preserved pronunciation.

• Machine translation, cross-lingual information retrieval and cross-lingualquestion answering are the main areas that automatic transliteration isapplicable.

• Transliteration has been studied in two major areas: transliterationgeneration and transliteration extraction.

• Transliteration generation gets an input source word in source language(e.g. Sydney in English) and generates its transliteration in targetlanguage (e.g. úGYJ� in Persian).

• Transliteration extraction is discovering transliteration pairs (e.g.(Sydney,úGYJ�) in bilingual texts.

Page 3: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Transliteration Extraction So Far!

Discovery of transliteration methods in literature consider:

• Extraction from parallel corpus:

– Statistical methods are beneficial, particularly because thesentences/words can be aligned.

– Yet parallel corpus is hard to find for many less-computerisedlanguages.

• Extraction from comparable corpus:

– More evidence than just statistical information are required to extractpairs (e.g. temporal, phonetic information or Web-count).

– Comparable corpora are easier to construct and find than parallel one.

Most studies use name entity (NE) recogniser to separate proper nouns thatare subject to transliteration from other words.

Page 4: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Persian and English Transliteration

• Transliteration generation has been studied using n-gram based andconsonant-vowel based approaches.

• Transliteration extraction is not previously studied for this language pair,mainly due to lack of any parallel or comparable corpus.

• Transliteration extraction has been studied using co-occurrence, temporal,edit distance measures or phonetic similarities. We aim to apply ourtransliteration generation methods as a basis for this task.

Page 5: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Proposed Method: Application of Transliteration

Generation in Extraction1. For each document in each language we perform a pre-processing

to generate a bag of words from each document (tokenise) and

also remove stop-words

2. Each word in source language is matched against a dictionary,

if not found then it is an out-of-dictionary word that needs

transliteration in target document.

3. A ranked list of possible transliterations for each source

word is generated by transliteration system.

4. Those transliterations matching with the target document

potential words are considered as a potential pair.

5. A score can be given to these pairs based on the rank of the

transliteration and number of times they are paired.

Page 6: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Experimental Setup

• An English-Persian comparable corpus of news texts is constructedconsisting of 3,474 documents.

• An English machine-readable dictionary was applied which contains120,177 entries.

• Experiments:

– Accuracy of transliterations extracted (Fixed Training Collection).Different methods of matching experienced (1-English documents areparsed to extract their out-of-dictionary words using dictionary look-upand stemming. 2- A parsing on Persian documents is performed byrendering the words that contain allophones characters to one uniquecharacter. 3- Repeating the previous experiment including capitalcharacters knowledge.)

– Impact of seed transliteration lexicon.

Page 7: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Experiments and Results

Experiment 1 : Accuracy

#Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.

1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3

2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4

3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2

Experiment 2 : Train Size

Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.

200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2

300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3

400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1

500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0

Page 8: Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Conclusions and Further Work

• Transliteration extraction can be helpful in automatically generatingtransliteration lexicons.

• Transliteration lexicon as a dictionary of transliteration of a proper noun ortechnical terms that are not translated are beneficial in dictionary-basedmachine translation applications.

• We investigated a method of applying the current yet incompletetransliteration lexicons in enriching them using comparable corpora.

• In future, role of NE-recogniser will be investigated to compare with asimple dictionary look-up.