citala2009 - morroco rule-based approach in arabic nlp: tools, systems and resources dr khaled...

83
CITALA2009 - Morroco Rule-based approach in Arabic NLP: Tools, Systems and Resources Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE Khaled.shaalan@{buid.ac.ae, gmail.com}

Upload: sharyl-short

Post on 25-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

CITALA2009 - Morroco

Rule-based approach in Arabic NLP: Tools, Systems

and ResourcesDr Khaled Shaalan

Professor, Faculty of Computers & Information, Cairo UniversityOn Secondment to BUiD, UAEKhaled.shaalan@{buid.ac.ae, gmail.com}

Agenda Objective Language Tasks NLP Approaches Rule-based Arabic Analysis and

generation tools Rule-based Arabic NLP applications Some Arabic NLP Free Resources Major and Arabic mailing lists Conclusion

Objective

To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.

Separating Language Tasks

English vs. French vs. Arabic vs . . . spoken language (dialogue) vs written test

vs hand written script Genuine Script vs transliterated

(Romanized) script Vocalized (vowelized) vs non-vocalized Understanding vs. generation First language learner vs second language

learner Classical or Qur’anical Arabic vs Modern

Standard Arabic vs colloquial (dialects) Stem-based vs root-based

Rules

Situation/Action If match(stem.prefix, def_article)

then romve(stem.prefix,Stem_FS)

If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)

Common Mistake

Rule-based approach is not a rule-based expert systems !!!!!!!

Both consist of rules. Rule-based expert systems solves

the problem by Recognize-Act Cycle Loop Conflict resolution strategy

7

Recognize-Act Cycle

RuleBase

FactBase

Match ConflictResolution

nExecute

1

NewFact

NewRule

Working Memory

Domain Knowledge

loop1. Match: Rules are

compared to working memory to determine matches. if no rule matches then stop

2. Conflict Resolution: Select or enable a single rule for execution

3. Execute: Fire the selected rule• Add new fact, or• Learn a new rule

end loop

NLP Approaches

Rule-based Statistical-based

NLP Approaches (1)

Relies on hand-constructed rules that are to be acquired from language specialists

requires only small amount of training data

development could be very time consuming

developers do not need language specialists expertise

requires large amount of annotated training data (very large corpora)

automated

NLP Approaches (2)

some changes may be hard to accommodate

not easy to obtain high coverage of the linguistic knowledge

useful for limited domain Can be used with both

well-formed and ill-formed input

High quality based on solid linguistic

some changes may require re-annotation of the entire training corpus

Coverage depends on the training data

Not easy to work with ill-formed input as both well-formed and ill-formed are still probable

Less quality - does not explicitly deal with syntax

Rule-based Arabic NLP tools

Morphological Analyzers Morphological Generators Syntactic Analyzers Syntactic Generators

Rule-based Arabic Morphological Analyzer

Morphological Analysis Breakdown the inflected Arabic word into a

root/stem, affixes, features. Example: sa- ‘uEty- kumA (سأعطیكما) - ‘will I give

you…’

-sa :س -uEty‘- :أعطی kumA- :كماTYPE: ParticleINFLECTION: ‘Future’

TYPE: VERBASPECT: IMPERFMOOD: INDPERS: 1GENDER: M/FNUMBER: SGSUBJ: I

TYPE: AFFPRGENDER: M/FNUMBER: DUALGF: OBJ

Rules - Augmented Transition Network (ATN) technique Rules associated with arcs represent the

context-sensitive knowledge about the relation between a root and inflections.

More than one rule may be associated with one arc.

Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.

Arabic Morphology using ATN Technique

Types of Rules

Remove Prefix or Suffix Remove doubled letter Add/change Hamza, Weak letter,… …

Analysis of the verb "شاهدتك" (I saw you): Remove suffixes

S1 S2 S3

شاهد”ك“ = last1ت last2 = “ت” شا

S10S0هد

شاهدتك

•stem: "شاهد" (saw)• perfect•1st person sg pronoun: "ت"•2nd person sg pronoun "ك"

Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix

S1 S2 S3

last2 = “ون” لعب

S10S0

لعبون

•stem: “لعب" (played)• imperfect•Plural subject

Begin2 = “ي” لعبون

Issues in the morphological analysis

Overgeneration (too many output) Ambiguity Reconstruction of vowels MultiWord/compound Expressions Out-of-Vocabulary (OOV) Handling ill-formed input

Detection (spell checking) Correction- relaxation “ه” instead of “ة”

Prevent ill-formed output Check the compatibility (the prefix “ف” cannot

come after the prefix “ب” (or “ك”)).

Rule-based Arabic Morphological Generator

Morphological generation Synthesis of an inflected Arabic word from

a given root/stem according to a combination of morphological properties that include: definiteness (definite article “ال”), gender (masculine, feminine), number (singular, dual, plural), case (nominative, genitive, accusative,…), person (first, second, third) …

Types of Rules

synthesis of inflected Noun Verb particle

Synthesis of inflected Nouns definite noun feminine noun pluralize noun dual noun attach a prefix preposition attach a suffix pronoun end case ….

Synthesis of feminine noun

If noun.gender = masculineThen attach suffix feminine letter

Example: (wife) ”زوجة“ husband) ( “زوج”

Synthesis of suffix pronoun

If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun

Example: (my wife) ”زوجتي“ (wife) ”زوجة“

Synthesis of inflected Verbs(very complex-rich in form and meaning)

conjugate a verb with tense conjugate a verb with number conjugate a verb with prefix

pronoun conjugate a verb with suffix

pronoun ….

Rule: synthesize first person plural of assimilated verbs

Input: first person singular past verbOutput: inflected verbExample: - -وصل سنصل نانصلIf verb.tense = futurethen remove first weak & attach_prefix(""سن)else if verb.tense = present then remove first weak & attach_prefix(""ن) else attach_suffix(verb.stem,"نا")

Issues in the morphological generation

MultiWord/compound Expressions Out-of-Vocabulary (OOV) Some forms need special handling:

Substitution: This man – الرجل هذا literal numbers (complex nouns) Arabic script

‘ ال’ + ‘ل ’ ‘للـ’ ” ي“ + ” ’زمالئي‘ ’زمالءي‘ “زمالء ”غرفتان“ “غرفة”

Rule-based Arabic Syntactic Analyzer

Types of Rules

Grammatical rules: Describe sentence and phrase

structures, and ensure the agreement relations between various elements in the sentence.

Parsing Accepts the input and generates the

sentence structure (parse tree)

مجتهدة الطالبة

noun (definite, fem, sg)

noun (indefinite, fem, sg)

definite(definite, fem, sg)

enunciative (indefinite, femfem, sgsg) Inchoative (defined, femfem, sgsg)

nominal sentence

Agreement:•Number•Gender

Parsing of the sentence “ الطالبة ”مجتهدةThe student (sg,f) is diligent (sg,f)

Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)

Issues in the syntactic analysis

Ambiguity (more than parse tree) Disambiguation techniques

Handling ill-formed input Detection (grammar checking) Recovering (Partial parsing - parses =

chunks to be related)

Rule-based Arabic Syntactic Generator

Types of Rules

Determine phrase structures Determine syntactic structure Ensure the agreement relations

between various elements in the sentence.

Rule: verb-subject agreement

Input: verb and inflected subject (a pre-verbal NP )

Output: inflected verb agreed with its inflected subject

synthesize_verb(Subject.number,verb.stem)

synthesize_verb(Subject.gender,verb.stem)

An agreement example:

زاروا قديمة األوالد متاحف خمسthe-boys visited-they five museum oldThe boys visited five old museums

قديمةمتاحفخمسزاروااألوالد

Adj-noun counted-Num verb-Subject(G) (G) (N,G)

Issues in the syntactic generation

Word order (VSO,SVO, etc.) Agreement (full/partial) dropping the subject pronoun (called Pro-

drop), i.e., to have a null subject, when the inflected verb includes subject affixes.

Syntax that captures the source/intended meaning My son is 8 = سنوات ثماني عمره أبني I did not understand the last sentence = لم أنا

األخيرة الجملة أفهم

A Rule-based Arabic NLP applications

Named Entity Recognition Machine translation Transferring Egyptian Colloquial

Dialect into Modern Standard Arabic

What is entity recognition?

Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies.

Makes unstructured data more structured

Entity Extractor

Politics of UkraineIn July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies.

PersonLocation

Date

Person Entity Recognition (1)

Example: ‘ الثاني عبد األردني الملك الله ’ The Jordanian king Abdullah II

We want to have a rule that recognizes a person name composed of a first name followed by optional last names, based on a preceding person indicator pattern.

Person Entity Recognition (2)

The Rule component of this example: Name Entity: الله عبد [Abdullah] indicator pattern:

an honorific such as "الملك" [The king] Nasab: (optional) inflected from a location name

.[Jordanian] "األردني" The rule also matches an optional ordinal

number appearing at the end of some names such as "الثاني" [II].

Person Entity Recognition (3)

((honorfic+(location( ي|ية ))?)+first_Name(last_Name)?+(number)?)

This (Regular Expression) rule can recognize: الله عبد الملك الله عبد األردني الملك الثاني عبد األردني الملك الله رانيا األردنية ةالملك …

Issues in the Arabic NER Complex Morphological System

(inflections) Non-casing language (No initial

capital for proper nouns) Non-standardization and

inconsistency in Arabic written text (typos, and spelling variants)

Ambiguity

Machine Translation

Direct Transfer Interlingua

MT ApproachesMT Pyramid

Source word

Source syntax Target syntax

Target word

Analysis Generation

Direct

Transfer

Interlingua

English-to-Arabic Transfer based Approachsource sentence

(English)

Sentence AnalysisSentence AnalysisMorphological & syntactic Analysis Rules of English

English Dic.

TransferTransferEnglish-to-ArabicTransformation RulesBi-ling Dic.

Sentence SynthesisSentence SynthesisMorphological Gen. &Synthesis Rules ofArabic

Arabic Dic.

Target sentence(Arabic)

English Parse Tree

Arabic Parse Tree

Transfer approach

Involves analysis, transfer, and generation components

If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component

Simple Transfer

(1) [wi:$1, wi+1:$2, …, wk:$k] (1 i k)

[wk:$k, wk-1:$k-1, …, wi:$i] (1 i k)

np

noun

networkspl

np

npnoun

performancesg

noun

evaluationsg

transfertransfer

np

noun

تقييمsg

np

npnoun

أداءsg noun

شبكةpl

Networks performance evaluation تقييمشبكة أداء

Issues in the Transfer-based MT approach Synonyms of a word

Acquisition “اكتساب” or “استخالص”. Agreement

intelligent tutoring systems “ نظمالذكية “ or ”التعليم الذكي التعليم ”نظم

Problems with prepositions did you do fungal analysis? “ قمت الفطر ـبهل ؟تحليل ”

Interlingua MT – Multilingual translation Interlingua = Semantic Representation Deep analysis –

no need for transfer component) Only analysis and generation components

Add Arabic analyzer to translate to other languages

Add Arabic generator to translate from other languages

Analysis of Arabic to Interlingua حجز: في أرغب أنا العميل

الفندق في غرفة

Interlingua(IF)c:introduce-topic+reservation+disposition+room (room-spec=(room,

specifier=hote,identifiability=yes),disposition=(desire,who=i))

Parse Tree

Preprocessor

Sentence Analyzer

Morphological Analyzer

Arabic Grammar Rules

Arabic Morphology Rules

ArabicLexicon

MapperMapLexicon

Ontology

Generating Arabic from Interlingua

Interlingua(IF)c:introduce-topic+reservation+disposition+room (room-spec=(room,

specifier=hote,identifiability=yes),disposition=(desire,who=i))

Sentence Generator

Morphological Generator

Arabic Morphology Rules

ArabicLexicon

Arabic Grammar Rules

Mapper

Feature StructureMap Rules

MapLexicon

Ontology

حجز: في أرغب أنا العميلالفندق في غرفة

Issues in the interlingua approach

Interlingua: language-neutral representation captures the intended meaning of the

source sentence Requires a fully-disambiguating

parser

Transferring Egyptian Colloquial Dialect into Modern Standard Arabic

Be able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words.

Facilitate the communication with colloquial Arabic speakers

Restore the Arabic dialect to the standard language in use nowadays.

A one-to-one transfer example

امتي؟

Mapping

؟متيwhen?

A one-to-many transfer example

عال

On-the

Mapping

الthe

عليon

A complete sentence example

امتي؟ جيت You-came when?

Mapping

؟متيجئت

reordering

جئت؟ متي When did-you-come ?

•Step (1) جئت جيت•متي امتي•

•Step (2)• the New Segment Position for the word “امتى” is start of sentence (SoS)

Issues in the transfer to MSA

More investigations are needed

Arabic NLP Free Resources

Arabic NLP Free Resources

Arabic Morphological Analyzers

Tim Buckwalter Morphological http://www.qamus.org/ http://www.ldc.upenn.edu/Catalog/

CatalogEntry.jsp?catalogId=LDC2002L49

Xerox http://www.cis.upenn.edu/~cis639/arabic/input/keyboard_input.html

Arabic Morphological Analyzers

Aramorph http://www.nongnu.org/aramorph/

english/index.html

Arabic spell checker

Aspell http://aspell.net/ http://www.freshports.org/arabic/

aspell

Arabic Morphological Generation

Sarf http://sourceforge.net/projects/sarf

Tokenization & POS tagging

ArabicSVMTools: The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text http://www1.cs.columbia.edu/~mdiab/ http://www1.cs.columbia.edu/~mdiab/

software/AMIRA-1.0.tar.gz

Tokenization & POS tagging

MADA: a full morphological tagger for Modern Standard Arabic. http://www1.cs.columbia.edu/

~rambow/software-downloads/MADA_Distribution.html

POS tagging

Stanford Log-linear Part-Of-Speech Tagger http://nlp.stanford.edu/software/

tagger.shtml http://nlp.stanford.edu/software/

stanford-arabic-tagger-2008-09-28.tar.gz

Tokenization & POS tagging

Attia's Finite State Tools for Modern Standard Arabic http://www.attiaspace.com/getrec.asp?

rec=htmFiles/fsttools

Arabic Parsers

Dan Bikel’s Parser http://www.cis.upenn.edu/~dbikel/ http://www.cis.upenn.edu/~dbikel/

software.html Attia Arabic Parser

http://www.attiaspace.com/ http://decentius.aksis.uib.no/logon/

xle.xml

Arabic wordnet

Arabic WordNet http://www.globalwordnet.org/AWN/

http://personalpages.manchester.ac.uk/staff/paul.thompson/AWNBrowser.zip

Translation resources

Tools: GIZA++, MOSES, Pharaoh, Rewrite and BLEU

http://www.statmt.org/ APIs:

http://code.google.com/apis/ajax/playground/#translate

http://code.google.com/apis/ajax/playground/#batch_translate

Transliterate

Transliterate http://code.google.com/apis/ajax/

playground/#transliterate_arabic

Mailing Lists – just to be connected to the NLP community

[email protected] http://mailman.uib.no/listinfo/corpora

[email protected] http://www.linguistlist.org/

[email protected] http://www.semitic.tk/

[email protected] http://www.arabicscript.org/CAASL3/

index.html

Conclusion (1)

Arabic requires the treatment of the language constituents at all levels: morphology, syntax, and semantics.

Most of the researches in Arabic NLP are mainly concentrated on the analysis part aiming at automated understanding of Arabic language.

Conclusion (2)

Arabic NLP in general is significantly under developed.

In order to bridge this gab and help Arabic NLP research to catch up with the many recent advances of Latin languages, we need collaborative efforts from the Arabic research community.

Conclusion (3)

We need Public Domain (in Electronic Form) for: Linguistic resources such as large Arabic

(bilingual) Corpora and treebanks. Machine readable (bilingual) dictionaries Morphological Analyzers Parsers …

Conclusion (4)

We need to secure fund for: Exchanging visits (experience Expert

Network) Buy software Secure dedicated RA’s and/or PhD

students for the NLP task.

References (1) - Journals Khaled Shaalan, Hafsa Raza, NERA: Named Entity

Recognition for Arabic, the Journal of the American Society for Information Science and Technology (JASIST), John Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.

Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological Generation from Interlingua: A Rule-based Approach, in IFIP International Federation for Information Processing, Vol. 228, Intelligent Information Processing III, eds. Z. Shi, Shimohara K., Feng D., (Boston:Springer), PP. 441-451, 2006.

Shaalan, K., Talhami H., and Kamel I., Morphological Generation for Indexing Arabic Speech Recordings, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 20(1)1:14, 2007.

References (2) - Journals Shaalan K. An Intelligent Computer Assisted Language

Learning System for Arabic Learners, Computer Assisted Language Learning: An International Journal, Taylor & Francis Group Ltd., 18(1 & 2): 81-108, February 2005.

Shaalan K. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.

Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine Translation of English Noun Phrases into Arabic, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 17(2):121-134, 2004.

Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network, Software Practice and Experience, John Wiley & sons Ltd., UK,23(6):567-588, June 1993.

References (3) – workshops & conferences

Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological Rule Induction for Arabic, In the Proceedings of The LREC'08 workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, 31st May, PP. 97-101, 2008.

Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian Colloquial into Modern Standard Arabic, International Conference on Recent Advances in Natural Language Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525-529, September 27-29, 2007.

Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H., Generating Arabic Text from Interlingua, In the Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages, CAASL-2, Linguistic Institute, Stanford, California, USA, PP. 137-144, July 21-22, 2007.

References (4) – workshops & conferences Othman E., Shaalan K., and Rafea A., Towards

Resolving Ambiguity in Understanding Arabic Sentence, In the Proceedings of the International Conference on Arabic Language Resources and Tools, NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, , 2004.

Othman E., Shaalan K., and Rafea A. A Chart Parser for Analyzing Modern Standard Arabic Sentence, In proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA., September, 2003.

Thank you!Merci!

Shukran!شكرا