a hybrid approach to translate moroccan arabic...

5
9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco A hybrid approach to translate Moroccan Arabic dialect Ridouane Tachicart Mohammadia school of Engineers Mohamed Vth Agdal University, Rabat, Morocco [email protected] Karim Bouzoubaa Mohammadia school of Engineers Mohamed Vth Agdal University, Rabat, Morocco [email protected] Abstract— Today, several tools exist for automatic processing of standard Arabic. In comparison, resources and tools for Arabic dialects especially Moroccan dialect are still lacking. Given the proximity of the Moroccan dialect with standard Arabic, a way is to provide a translation of the first to the second by combining a rule-based approach and a statistical approach, using tools designed for Arabic standard and adapting these tools to Moroccan dialect. We describe in this paper an architecture for such a translation. KeywordsAutomatic translation, Moroccan dialect, Standard Arabic, language model, translation model, bilingual dictionary. I. INTRODUCTION Arabic-speaking world is characterized by diglossia [15]. On the one hand, modern standard Arabic (MSA) is shared by the entire Arab world but it’s not the mother tongue of any Arab country. It is used in the press, in broadcast news, in writing official government documents, etc. On the other hand, several Arabic dialects exist and are mother tongues. These dialects are usually not written and therefore have no standard spelling convention although initiatives have been launched to this end [1]. This particular situation is problematic for the automatic processing of Arabic dialects, and especially machine translation since resources for these languages are almost nonexistent. In addition, direct integration of resources designed for MSA in translation systems of Arabic dialects produces very low performance [1]. This leads to the necessity of either developing new Arabic dialect tools and resources or making adjustments to those existing for MSA. Scientific researches of Arabic dialects translation systems started only in the 2000s and are still in the early stages. Moroccan dialect (Darija) in turn, is not concerned in any research, which makes it isolated and remains difficult to understand for the majority of other Arab countries. In this paper we present an approach that involves translating the Moroccan dialect to the MSA by using tools designed to MSA and adaptating them to Moroccan dialect. On one hand, this translation aims to make a first step towards the automatic processing of Moroccan dialect. On the other hand, due to the increase of the content of this dialect on the web, it may facilitate cultural exchanges with people of the other Arab countries. We present in this paper the architecture of the future translation system combining a rule-based (analysis, transfer and generation) and a statistical approach to benefit from the advantages that these approaches provide. We proceed to the development of this system as soon as its architecture will be validated. In the first step, we use linguistic resources (morphological analyzer and bilingual dictionary) to try to produce possible translations of the source text of the Moroccan dialect, then we will use statistical tools for improving the results of this translation. Finally, evaluation of the quality of our system will be carried out using metrics which compare the result with the correct expected product results. The content of this article is as follows. We describe in section 2 the characteristics of the Moroccan dialect. Then we present in section 3 the existing approaches for the realization of a translation system. In section 4, we discuss previous works concerning automatic translation of Arabic dialects. We present in section 5 our approach. Finally, section 6 concludes the paper and suggests extensions and future work. II. MOROCCAN DIALECT Moroccan dialect, along with the Amazigh language, is the mother tongue of the Moroccan people. This is an evolving dialect as it continues to strongly integrate French, English and Spanish words especially in technical areas such as: ﺴﻲﻟﺒ /PC/, ﻛﺮ ﺑﺮ/program/ etc. We describe the characteristics of the dialect in the following. A. Transcription There are many ways to transcribe the sounds of the Moroccan dialect. First, very often a simplified writing using MSA characters is used. . This writing has sometimes unnecessary cases. For example silent letters such as (end of a word) is pronounced in Moroccan dialect since this letter has no phonological or grammatical function, which is not the

Upload: others

Post on 27-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A hybrid approach to translate Moroccan Arabic dialectarabic.emi.ac.ma/alelm/sites/default/files/... · 2016. 4. 19. · Standard Arabic, language model, translation model, bilingual

9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco

A hybrid approach to translate Moroccan Arabic dialect

Ridouane Tachicart Mohammadia school of Engineers Mohamed Vth Agdal University,

Rabat, Morocco [email protected]

Karim Bouzoubaa Mohammadia school of Engineers Mohamed Vth Agdal University,

Rabat, Morocco [email protected]

Abstract— Today, several tools exist for automatic processing of standard Arabic. In comparison, resources and tools for Arabic dialects especially Moroccan dialect are still lacking. Given the proximity of the Moroccan dialect with standard Arabic, a way is to provide a translation of the first to the second by combining a rule-based approach and a statistical approach, using tools designed for Arabic standard and adapting these tools to Moroccan dialect. We describe in this paper an architecture for such a translation.

Keywords— Automatic translation, Moroccan dialect, Standard Arabic, language model, translation model, bilingual dictionary.

I. INTRODUCTION Arabic-speaking world is characterized by diglossia [15].

On the one hand, modern standard Arabic (MSA) is shared by the entire Arab world but it’s not the mother tongue of any Arab country. It is used in the press, in broadcast news, in writing official government documents, etc. On the other hand, several Arabic dialects exist and are mother tongues. These dialects are usually not written and therefore have no standard spelling convention although initiatives have been launched to this end [1]. This particular situation is problematic for the automatic processing of Arabic dialects, and especially machine translation since resources for these languages are almost nonexistent. In addition, direct integration of resources designed for MSA in translation systems of Arabic dialects produces very low performance [1]. This leads to the necessity of either developing new Arabic dialect tools and resources or making adjustments to those existing for MSA.

Scientific researches of Arabic dialects translation systems started only in the 2000s and are still in the early stages. Moroccan dialect (Darija) in turn, is not concerned in any research, which makes it isolated and remains difficult to understand for the majority of other Arab countries. In this paper we present an approach that involves translating the Moroccan dialect to the MSA by using tools designed to MSA and adaptating them to Moroccan dialect. On one hand, this translation aims to make a first step towards the automatic

processing of Moroccan dialect. On the other hand, due to the increase of the content of this dialect on the web, it may facilitate cultural exchanges with people of the other Arab countries. We present in this paper the architecture of the future translation system combining a rule-based (analysis, transfer and generation) and a statistical approach to benefit from the advantages that these approaches provide. We proceed to the development of this system as soon as its architecture will be validated. In the first step, we use linguistic resources (morphological analyzer and bilingual dictionary) to try to produce possible translations of the source text of the Moroccan dialect, then we will use statistical tools for improving the results of this translation. Finally, evaluation of the quality of our system will be carried out using metrics which compare the result with the correct expected product results.

The content of this article is as follows. We describe in section 2 the characteristics of the Moroccan dialect. Then we present in section 3 the existing approaches for the realization of a translation system. In section 4, we discuss previous works concerning automatic translation of Arabic dialects. We present in section 5 our approach. Finally, section 6 concludes the paper and suggests extensions and future work.

II. MOROCCAN DIALECT Moroccan dialect, along with the Amazigh language, is the

mother tongue of the Moroccan people. This is an evolving dialect as it continues to strongly integrate French, English and Spanish words especially in technical areas such as: االبیيسي /PC/, برووكراامم /program/ etc. We describe the characteristics of the dialect in the following.

A. Transcription There are many ways to transcribe the sounds of the

Moroccan dialect. First, very often a simplified writing using MSA characters is used. . This writing has sometimes unnecessary cases. For example silent letters such as ةة (end of a word) is pronounced اا in Moroccan dialect since this letter has no phonological or grammatical function, which is not the

Page 2: A hybrid approach to translate Moroccan Arabic dialectarabic.emi.ac.ma/alelm/sites/default/files/... · 2016. 4. 19. · Standard Arabic, language model, translation model, bilingual

9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco

case for the standard Arabic. In addition, this script cannot transcribe specific phenomena in Darija such as the pronunciation of G, V and P which we transcribe with respectivly the letters فف ,كك, and بب, such as جلس /sit/, االبوططو /bar/ and االفیيداانج /drain/. A second method to transcribe Moroccan Arabic uses the Latin script. Its main drawback is that several letters transcribe the same phoneme as in the case of K, Q, and QU which can transcribe the letters قق and Vice . ططversa, ثث and تت can be represented by T. The last transpcription is the alphanumeric writing that involves writing the Darija using the Latin alphabet and some numbers in order to note phonemes that do not exist in the French language such as /عع/ and قق. For example the word عقوبة /sanction/ is transcribed with 3o9oba. This writing, in addition to its aberrant shape, has reading difficulties and cannot represent numbers with emphatic consonants such as emphatic 'S' or emphatic 'D'. The example below illustrates this:

B. Phonology Pronunciation of Darija and MSA is similar, but with two

major differences. On the one hand, vowels are often omitted in Darija, especially when they are in end of an open syllable. This is probably due to interference with the Amazigh language [17]. For example, the word كتبت /to write/ is pronounced in MSA as 'Katabto', while in Darija it is pronounced as 'Ktbt’. On the other hand, Moroccan dialect is characterized by the pronunciation of consonants G, V and W [16] that don’t exist in the MSA. For example, 'Gouli', which means 'tell me', its equivalent in MSA is قولي. C. Vocabulary

The vocabulary of Moroccan Arabic imports its words from several languages, but it is strongly influenced by the vocabulary of the MSA. The majority of its words are imported from the MSA. We can also find in the Moroccan dialect words of French origin, and a few words of Spanish and Amazigh origins. Table I shows the sources of some words of the Moroccan dialect.

TABLE I. VOCABULARY OF MOROCCAN DIALECT

Word Origin Darija equivalent شربت MSA شربتشىأأتم MSA كنتمشى

فھهمتني MSA فھهمتني اابجاوو   Amazigh اابجاوو   مش Amazigh موشش تماررةة Amazigh تماررةة

Autobus French ططوبیيس Fromage French فرماجج

Fourchette French فوررشیيطا Cuerda Spanish كورردداا

Roueda Spanish رروویيداا Cucina Spanish كوززیينة

D. Morpho-syntax

The morphology of the Moroccan dialect is much less complex compared to the MSA. First, the composition of a sentence in Darija [17] is dominated by the order subject-verb-object. Then the words of MSA origin keep their morphology, or have some modifications in their patterns or their affixes. For example, the prefix سس in MSA is converted to غادديي or غا in Darija. In addition, the conjugation of Moroccan dialect verbs is possible using a list of prefixes and suffixes with a slight modification in the pattern of the word. In addition and unlike the MSA the conjugation of verbs in the female plural and dual don’t exist, but it is replaced by the simple plural. Table II compares the conjugation of a verb in both languages.

TABLE II. VERBS CONJUGATION

MSA MD Dual ضربوھھھهم ضربوھھھهما Plural ضربوھھھهم ضربوھھھهم

Female Plural ضربوھھھهم ضربوھھھهن

Finally, negation in Moroccan Arabic dialects is similar to the Arabic dialects of North Africa, where the verb is always included between the prefix ما and the suffix شش such as مانكتبش /I don’t write/.

III. EXISTENT APPROACHES FOR AUTOMATIC TRANSLATION The development of a system for translating the Moroccan

dialect requires an approach of machine translation. In fact, there are two basic strategies. On the one hand, the rule-based approach uses a morphological analysis by exploiting linguistic information of the source and target languages. It takes place in three phases (Figure 1), where the first is the analysis that produces a series of units to determine the grammatical structure of each word, knowing that often a morphological analyzer is introduced in this phase. Then each previous analysis is associated with one or more analyzes of the target language using a bilingual dictionary in the transfer phase. Finally, a target language text is produced in the generation phase with an appropriate order.

Arabic script: ما لقیيتش بحالو Latin script: Malkitch bhalo Alphanumeric script: Mal9itch b7alo

Page 3: A hybrid approach to translate Moroccan Arabic dialectarabic.emi.ac.ma/alelm/sites/default/files/... · 2016. 4. 19. · Standard Arabic, language model, translation model, bilingual

9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco

Fig. 1. Rule-based translation

On the other hand, the statistical approach, which is used to reduce the cost of developing translation systems, uses two analyzes: the first is an analysis of a parallel corpus made by a translation model; the second is an analysis of a monolingual corpus made by a language model. These analyses give some parameters to be used by the decoder in order to calculate probabilities of the translation . In fact, the language model introduces the constraints imposed by the syntax of the target language and estimates the probability of a sentence of the language, while the translation model models the process of generating a source sentence from a target sentence. The output of these two models is used by a decoder that maximizes the likelihood of the translation of the source sentence to the target sentence in an acceptable time (Figure 2)

Fig. 2. Statistical translation

II. PREVIOUS WORKS The number of research concerning the automatic

translation between Arabic dialects and MSA has recently increased but remains low compared to research on the MSA. Indeed, researchers usually rely on three types of approaches

to build these systems. First, some have built their translation systems using rules [6, 7]. Others [8, 9] were based on statistical tools. The third approach, which is a hybrid, served some others for the implementation of their systems [4]. Nizar Habash and Owen Rambow [2] developed the MAGEAD morphological analyzer for Arabic dialects which can output the root, the pattern and the affixes for each entry. In the same principle, the authors of [3] rely on the use of the morphological analyzer Buckwalter [14] to translate Egyptian dialect to the MSA with diacritics combining a statistical approach and knowledge of linguistic rules. Wael Salloum and

Nizar Habash have used ADAM [4], a morphological Arabic dialects analyzer, to create ELISSA a rule-based system, which translates Arabic dialects to the MSA. Finally, Yahya alAmlahi [5] introduced a system of translation of Yemeni dialect to the MSA without the use of tools, but only using an algorithm that analyzes the words of this dialect based on the list of affixes.

III. APPROACH OF OUR SYSTEM

A. Adopted approach

A rules-based approach may be particularly suited to certain phrases, while other linguistic phenomena are properly addressed by a corpus-based approach (statistical method). We opt for a hybrid approach to leverage the strengths of each approach and produce a satisfactory translation. Indeed, we proceed through several stages (Figure 3) to reach our goal and translate the Moroccan dialect to the MSA.

Page 4: A hybrid approach to translate Moroccan Arabic dialectarabic.emi.ac.ma/alelm/sites/default/files/... · 2016. 4. 19. · Standard Arabic, language model, translation model, bilingual

9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco

Fig. 3. Architecture of our system

Our approach is divided into two phases. The first is to collect and produce the linguistic resources necessary for the development of such a translation system, namely: a bilingual dictionary MSA versus Moroccan dialect and a Moroccan

dialect corpus. In order to prepare the corpus, we will use the writings of some television productions scenarios and we use some MSA dictionaries to produce a bilingual dictionary. The extension of the bilingual dictionary will be made by collecting other resources on the web to ensure maximum coverage of the vocabulary of the Moroccan dialect. The second phase is to develop the system of translation in several steps. In the first step of identification, our system separates the content of the source text to distinguish the content of the dialect in MSA in order to facilitate the translation. Then, in the step of analyzing, the source text is segmented into annotated dialectal units using the Alkhalil morphological analyzer [10]. In addition of being free and open source, the advantage of this analyzer is its performance (ability to analyze ten words per second). It also provides for each word all possible analyzes based on morphological and syntactic rules. It uses independent database which contain a large linguistic resources, which offers the possibility to adapt the database with our needs without modifying the body of the morphological analyzer. In fact, we extend the list of prefixes and suffixes of Alkhalil to cover the Moroccan dialect and we adapt its dictionary with our dictionary (Moroccan dialect versus MSA) which includes words from several origins (French, Spanish, Amazigh ...). In the transfer step, each unit produced above will be linked to one or more corresponding units in the target language (MSA) using our bilingual dictionary. Then, the system uses the previous analyses to generate different MSA phrases in the generation step. After that, it must optimize the result to produce the most fluent sentences using a language model built on the basis of SRILM tool [11]. Let us note that SRLIM is developed under Linux as scripts, it’s free to use and quickly allows building language models. These models rely on the analysis of the n-grams model and allow representing and processing texts in several languages. Finally, to measure the quality of the translation of our system, we use usual and conventional indicators such as Recall, Precision, METEOR [12] and BLUE [13]. The latter has gained the status of automatic measurement of reference within the community of machine translation, and is characterized by a brevity penalty to penalize systems that try

TABLE III. TRANSLATION EXAMPLE

1 Input You did not see them now ? ؟ ماشوفتوھھھهمش اددااب 2

Analysis دداابا ماشوفتوھھھهمش

3

شش حرفف نفي

ووھھھهم ضمیيرر االغائبیينن

تت تاء االمخاططب

ففوش االجذعع

ما حرفف نفي

دداابا ظظرفف ززمانن

4

Transfert

ووھھھهم ضمیيرر االغائبیينن

تت تاء االمخاططب

ررأأىى االجذعع

لم حرفف نفي

االآنن ظظرفف ززمانن

5 Generation االآنن لم ررأأیيتموھھھهم االآنن لم ررأأیيتموھھھهم ؟ 67 Language Model لم ترووھھھهم االآنن ؟ 8 Output لم ترووھھھهم االآنن You did not see them now?

Page 5: A hybrid approach to translate Moroccan Arabic dialectarabic.emi.ac.ma/alelm/sites/default/files/... · 2016. 4. 19. · Standard Arabic, language model, translation model, bilingual

9th International Conference on Intelligent Systems, (SITA'14), 07-08 May 2014, Rabat, Morocco

to artificially boost their scores by producing deliberately short sentences, while METEOR which is referred to the improvement of the correlation between the translation system and the translation of human translation level segments is characterized by the harmonic mean of the uni-grams of precision and recall.

B. Example of translation Table III shows an example of translation of the text in

Moroccan dialect دداابا ماشوفتوھھھهمش /you did not see them now/ to MSA according to the approach that we have just explained. In fact, in the analysis phase, the system uses the Alkhalil morphological analyzer to split this sentence in two words: دداابا and ماشوفتوھھھهمش in order to produce annotated units of each word as shown in the third row of table III. Indeed, the word دداابا is analyzed as ظظرفف ززمانن /adverb of time/, while the word is analyzed as a verbal unit and ماشوفتوھھھهمشdecomposed to the prefix ما and the suffixes ووھھھهم ,تت and شش, and the stem شوفف. Then, in the transfer phase (line 4), each unit produced in the previous phase is linked to its equivalent in MSA using the bilingual dictionary already created. In our example, the system translates respectively the units دداابا , شش+ماتت, شوفف and ووھھھهم to its MSA equivalents تت, ررأأىى, لم, االآنن and ووھھھهم.

Then, in the generation phase (line 5 and 6), the phrase االآنن لم .in MSA is generated from the previous analysis ررأأیيتموھھھهمFinally, the component of the language model (line 7) uses the SRILM tool to improve the quality of translation and leads to the final sentence لم ترووھھھهم االآنن that is more fluent than the sentence االآنن لم ررأأیيتموھھھهم produced in the generation phase.

IV. CONCLUSION We have presented an approach to translate the Moroccan

dialect to MSA by exploiting processing tools for MSA. This is the first translation work concerning the Moroccan dialect. Among its extensions, we plan to introduce in the identification phase a grammar and spelling corrector.

REFERENCES [1] Habash N., Diab M., Rambow O. (2011). Conventional

Orthography for Dialectal Arabic. In: Proc. Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 21-27.

[2] Habash N., Rambow O. (2011). MAGEAD: a morphological analyzer and generator for the Arabic dialects (2006). In: Proc.

21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Stroudsburg, USA, July 2006.

[3] Abo Bakr H., Shaalan K., Ziedan I. (2008). A Transferring Egyptian Colloquial Dialect into Modern Standard Arabic. In: Proc. International Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, September 27-29.

[4] Salloum W., Habash N. (2012). Elissa: A Dialectal to Standard Arabic Machine Translation System. In: Proc. 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, December 2012.

[5] Almalahi Y., Fateh, A., (2007). Sana’ani Dialect to Modern Standard Arabic: Rule-based Direct Machine Translation.

[6] Abo Bakr H., Shaalan K., Ziedan I. (2008). A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic. In: Proc. 6th International Conference on Informatics and Systems, INFOS 2008, Cairo, Egypt, Mars 27-28.

[7] Mohamed E., Mohit B., and Oflazer K., (2012) Transforming Standard Arabic to Colloquial Arabic. In: Proc. 50th Annual Meeting of the Association for Computational Linguistics, ACL 2012, Jeju, Korea, July 8-14.

[8] Zbib R., Malchiod E., Devlin J., Stallard D., Matsoukas S. Schwartz R., Makhoul J., Zaidan O., Callison-Burch C. (2012) Machine translation of Arabic dialects. In: Proc. 12th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2012, Montreal, Canada, June 3-8.

[9] Salloum W., Habash N. Plamondon, L., (2011). Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation. In: Proc. Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, July 27-31.

[10] Boudlal A., Lakhouaja A., Mazroui A., Meziane A., Ould abdallahi ould bebah A., Shoul M. (2010).Alkhalil Morpho Sys1: A Morphosyntactic analysis system for Arabic texts. In: Proc. The International Arab Conference on Information Technology, ACIT 2010, Benghazi, Libya, December 14-15.

[11] Stolcke A. (2002) SRILM An Extensible language modeling toolkit. In Proc. 7th International Conference on Spoken Language Processing, ICSLP 2002, Denver, Colorado, USA, September 16-20.

[12] Denkowski M., Lavie A. (2011). Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems. In: Proc. 6th Workshop on Statistical Machine Translation, EMNLP 2011, Edinburgh, Scotland, UK, July 30–31, 2011.

[13] Papenini K., Roukos S., Ward T., Zhu W. (2002). BLEU: a method for automatic evaluation of machine translation. In: Proc. 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, Pennsylvania, Philadelphia, USA, July 7-12.

[14] Tim Buckwalter, "Buckwalter Arabic Morphological Analyzer Version 1.0. ", Tim Buckwalter ed Linguistic Data Consortium: University of Pennsylvania, 2002.

[15] Ferguson, Charles (1959). "Diglossia". Word 15. [16] Ennaji Moha: Multilingualism, cultural identity, and education

in Morocco. p 130 - 134. 1989. [17] Ait cherif A., Boukbout M., Mahmoudi M., and Ouhmouch A.

(2011). Moroccan Arabic Textbook - Peace corps Morocco.