keynote - computational processing of arabic dialects: challenges, advances and future directions

Post on 14-Apr-2017

260 Views

Category:

Education

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computa(onalProcessingofArabicDialects:Challenges,Advances&FutureDirec(ons

KeynoteThe2ndWorkshoponArabicCorporaandProcessingTools

LRECMay24,2016

NizarHabashNewYorkUniversityAbuDhabi

nizar.habash@nyu.edu

CAMeL Lab

2

Roadmap

• Introduc(on• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

3

IntroducBon•  FormsofArabic

–  ClassicalArabic(CA)•  ClassicalHistoricaltexts•  Liturgicaltexts

–  ModernStandardArabic(MSA)•  Newsmedia&formalspeechesandsePngs•  OnlywriQenstandard

–  DialectalArabic(DA)•  Predominantlyspokenvernaculars•  NowriQenstandards

•  Dialectvs.Language

ArabicanditsDialects•  Officiallanguage:ModernStandardArabic(MSA)

Ø Noone’snaBvelanguage•  Whatisa‘dialect’?

–  PoliBcalandReligiousfactors•  RegionalDialects

–  EgypBanArabic(EGY)–  LevanBneArabic(LEV)–  GulfArabic(GLF)–  NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian–  Iraqi,Yemenite,Sudanese,Maltese?

•  Socialdialects–  City,Rural,Bedouin–  Gender,Religiousvariants

5

IntroducBon•  ArabicDiglossia

– Diglossiaiswheretwoformsofthelanguageexistsidebyside

– MSAistheformalpubliclanguage• Perceivedas“languageofthemind”

– DialectalArabicistheinformalprivatelanguage• Perceivedas“languageoftheheart”

•  GeneralArabpercepBon:dialectsareadeterioratedformofClassicalArabic

•  ConBnuumofdialects

6

CodeSwitching

الأنامابعتقدألنهعمليةالليعمبيعارضوااليومتمديدللرئيسلحودهمالليطالبوابالتمديدللرئيسالهراويوبالتاليموضوعمنهموضوعمبدئيعلىاألرضأنابحترمأنهيكونفينظرةديمقراطيةلألموروأنهيكونفياحترامللعبةالديمقراطيةوأنيكونفيممارسةديمقراطيةوبعتقدإنهالكلفي

علىموضوعإنجازاتبسبدييرجعلحظةأكثريةساحقةفيلبنانتريدهذااملوضوع،لبنانأوفيلبنانمنالنظامرئاسينظامفيلبنانالنظامعنإنجازاتالعهدلكنهليعنينعمنحكيالعهد

عمليابيدالحكومةمجتمعةوالرئيسلحودأثبتهيرئاسيوبالتاليالسلطةنظامبعدالطائفليسشخصمسؤولفيمنصبمعنيوأناعشتهذااملوضوعبأنهملابيكونفياألخيرةممارستهخالل

صالحةضمنخطابومبادئخطابملابياخدمواقفشخصيابممارستيفيموضوعاالتصاالتالسلطةالتنفيذيةألنهمنهرئيسجمهوريةهويكونرئيسمشمطلوبمنإنماهوإلىجانبهالقسم

عليهالتوجيهعليهإبداءاملالحظاتعليهبقىفيلبنانمابعدإتفاقالطائفرئيسالسلطةالتنفيذيةالوطنيةالشاملةكييظلفيمصالحةوطنيةكييظلالقولماهوخطأوماهوصحعليهتثميرجهود

باتجاهيروحتوافقمابنياملسلمواملسيحيفيلبنانيحتضنأبناءهذاالبلدمايتركاملسارفيوآمنوافيهاالليمشيوامعهالخطأنعمإنماخطابالقسمكانموضوعمبادئطرحتهوملتزمفيها

التزموافيهاأناأثبتخاللاألربعسنواتباملمارسةالحكوميةأنيالتزمتفيهاوملاالتزمنابهذاأنابتفهمتمامااملوضوعكانالرئيسلحودإلىجنبنافيهذااملوضوع،أمااملوضوعالديمقراطي

فتحإعادةانتخابهذاهالوجهةالنظربسماممكننقولإنهالدستورأوتعديلههوأوإمكانيةمسحهيئةفيجمهوريةبواليةثانيةهوديمقراطيضمناملجلسوالتصويتإلىماهنالكلرئيس

قناعتيفيهذااملوضوع.يعنيجوهرالديمقراطيةهذاباألقل

MSAandDialectmixinginspeech• phonology,morphologyandsyntax

AljazeeraTranscripthQp://www.aljazeera.net/programs/op_direcBon/arBcles/2004/7/7-23-1.htm

MSA

LEV

WhyisArabicprocessinghard?

Arabic EnglishOrthographicambiguity More LessOrthographicinconsistency More LessMorphologicalinflecBons More LessMorpho-syntacBccomplexity More LessWordorderfreedom More LessDialectalvariaBon More Less

ComputaBonalProcessingofStandardArabic

•  TherehasbeenalargeandgrowingamountofworkonStandardArabicprocessing:–  MulBplemorphologicalanalyzersandtaggers

•  BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,etc.

–  MulBpletreebanksandparsers•  PennATB,PragueDTB,CATiB,QuranCorpus

–  LargecollecBonsofmonolingualtext•  Gigaword,newscollecBons,QALB,andothers

–  LargecollecBonsofbilingual/mulBlingualtext•  UNcorpus,newscollecBons,etc.

–  SenBmentResources•  ArSenL,SLSA,SAMAR,etc.

–  NottomenBonthetradiBonalresourcesonlexicography,morphologyandsyntax!

•  MuchmoretodotosBll!•  Resourcesandworkondialectsareverylimitedincomparison.

8

9

WhyWorkonArabicDialects?•  DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversaBonal,talkshows,interviews,etc.–  SpeechrecogniBonanddialoguesystemsmustmodeldialects

•  DialectsareincreasinglyinuseinnewwriQenmedia(newsgroups,weblogs,forumsetc.)–  TextanalyBcsofArabicmustincludedialectalmodeling

•  SubstanBalDialect-MSAdifferencesimpededirectapplicaBonofMSANLPtools

ComputaBonalChallenges

•  Enormousvariety– Manydialectsandsub-dialects,codeswitching

•  Orthographicambiguity– Under-specificaBonandinconsistency

• Morphologicalcomplexity– morecliBcsandlessmorphofeaturesthanMSA

•  Overallannotatedresourcepoverty–  Thereisalotofmonolingualrawdata–  Limitedlexicons–  Limitedtreebanks,propbanks,etc.

10

ComputaBonalSoluBons•  TreatArabicdialectsasdifferentlanguages

–  Buildresourcesandtoolsfromscratch•  Morphologicalanalyzers,annotatedtreebanks,paralleldata…

–  Pro:modeldifferentgenres–  Con:expensive,effortduplicaBon

•  ExploitsimilaritybetweendialectsandMSAandamongdialects–  Convert(orrelate)dialectalresourcestoMSAorviceversatoadapt–  Pro:lessduplicaBon,exploitsrelaBonships–  Con:thereisalimittohowwellthiswillwork

•  Hybridapproach•  Communitystandards

–  Orthography,morphologicalanalysis,POStagsets,treebanks,etc.

11

12

Roadmap

• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

13

DialectalPhonologicalVariaBons•  Major variants

•  Some of many limited variants

•  /l/ à/n/ MSA: /burtuqāl/ à LEV: /burtʔān/ ‘orange’

•  /ʕ/ à /ħ/ MSA: /kaʕk/ à EGY: /kaħk/ ‘cookie’

•  Emphasis add/delete: MSA: /fustān/ à LEV: /fustān/ ‘dress’

MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/j/

ArabicScriptOrthographicVariants

IRQ LEV EGY TUN MOR/ʤ/ ج ج چ ج ج/g/ گ چ ج ڨ ڭ/tʃ/ چ تش تش تش تش/p/ پ پ پ پ پ/v/ ڤ ڤ ڤ ڥ ڥ

15

LaBnScriptforArabic?•  SeveralproposalstotheArabic

LanguageAcademyinthe1940s•  SaidAklExperiment(1961)•  WebArabic(Arabizi,Arabish,Franco-arabe)

–  Nostandard,butcommonconvenBons

عربي IPA La(n عربي IPA La(nأإآءؤئ /ʔ/ ‘ 2 Ø ث /θ/ th

ة /a/,/t/ a t ط /tʕ/ t T 6

ح ħ H h 7 ع /ʕ/ ‘ 3 Ø

خ /x/ kh 7’ x 8 غ /ʁ/ g gh 3’

ذ /δ/ th ق /q/ q

ش /ʃ/ sh ch ي /y//ay//ī//ē/

y,i,e, ai,ei,…

Akl1961

16

LackofOrthographicStandards

•  Orthographicinconsistency

•  EgypBan/mabinʔulhalakʃ/

– mAbinquwlhAlak$ مابنقولهالكش– mAbin&ulhalak$ مابنؤلهالكش – mAbin}ulhAlak$ مابنئلهالكش– mAbinqulhAlak$ مابنقلهالكش– …

SpellingInconsistency

•  SocialmediaspellingvariaBons– +ak– +aaaaak– +k

18

ArabicLexicalVariaBon

•  ArabicDialectsvarywidelylexically

•  ArabicorthographyallowsconsolidaBngsomevariaBons

English Table Cat Of I_want There_is There_isn’tMSA Tāwila

طاولةqiTTaقطة

idafaØ

‘uriduاريد

yūjaduيوجد

lāyujaduاليوجد

Moroccan midaميدة

qeTTaقطة

dyālديال

bγītبغيت

kāynكاين

mākāynšماكاينش

Egyp(an Tarabēzaطربيزة

‘oTTaقطة

bitāςبتاع

ςāwezعاوز

gفي

magšمفيش

Syrian Tāwleطاولة

bisseبسة

tabaςتبع

biddiبدي

gفي

māfiمافي

Iraqi mēzميز

bazzūnaبزونة

mālمال

‘arīdاريد

akuاكو

mākuما

CODA:AConvenBonalOrthographyforDialectalArabic

•  Developed by CADIM for computational processing •  Objectives

– CODA covers all DAs, minimizing differences in choices

– CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script

•  Inspired by previous efforts from the LDC and linguistic studies

19

CODAExamples

CODA االمتحانات قبل اللي الفترة صحابي ماشفتش

gloss the exams before which the period my friends I did not see

Spelling variants

متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشماـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ

ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـناتـحـمتـإلا qbl ـيلـا il�ra Su7abi فتشوشـماناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ

ilimB7anat ـيإلـ masho�ish

limBhanaat إلىilli

CODAExamples

21

Phenomenon Original CODASpellingErrorsTyposSpeecheffectsMergesSplits

االجابهشبب

كبييييييييراليومبريستيج

روف املع

اإلجابةسببكبير

اليوم بريستيجاملعروف

MSARootCognate آلب، كلب قلبDialectalCli(cGuidelines

عهلبيتمشفناش

عهالبيتماشافناش

UniqueDialectWords بردو، برضو برضه

CODAfica(onRawOrthographytoCODAConversion

•  What:-ConvertsfromrawDAorthographytoCODA-Correctstyposandvariousspeecheffects

•  Approach• Eskanderetal.(2012)(CODAFY)

• Modelspecificphenomena:hamza,PluralwAsuffix,etc.• Supervisedlearning• ClassificaBonproblem

• Farraetal.(2014)• Generalizedcharacterreplacementmodel.

• Bestresults–integratedinmorphologicalanalysis(MADA-ARZ)

CODAfica(on Accuracy(tokens)

A/YNorm.Accuracy(tokens)

Baseline(doingnothing) 76.8% 90.5%

CODAFYv0.4 91.5% 95.2%

MADA-ARZ 92.9% 95.5%

Input مشفتش صحابى الفتره الى فاتتm$s$SHAbYAlsrhAlYfAt

Output ما شفتش صحابي الفترة اللي فاتتmA$s$SHAbyAlsrpAllyfAt

•  Example:

•  EvaluaBon:•  EgypBanArabic

3ArribArabizi-to-ArabicConversion

•  AsystemforautomaBcmappingofArabizitoArabicscriptinCODA

•  EvaluaBon–  transliteraBoncorrect83.6%ofArabicwordsandnames.

anamsh3arefa2raellyentakatboAnAm$EArfAqrAAllyAntkAtbh

انامشعارفاقراالليانتكاتبهwfelaa5ertele3fshenkwmab2raasharabicwflAxrTlEf$nkwmab2raashArAbyk

ارابيكmab2raashو+فال+اخرطلعفشنكو

(Al-Badrashinyetal.,CONLL2014;Eskanderetal.,EMNLPCodeSwitchWorkshop2014)

3ArribhQp://nlp.ldeo.columbia.edu/arrib/

•  x

24

25

Roadmap

• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

26

DialectalArabicMorphologicalVariaBon

•  Nouns–  Nocasemarking

• WordorderimplicaBons–  ParadigmreducBon

•  ConsolidaBngmasculine&feminineplural

•  Verbs–  ParadigmreducBon

•  Lossofdualforms•  ConsolidaBngmasculine&feminineplural(2nd,3rdperson)•  Lossofmorphologicalmoods

–  SubjuncBve/jussiveformdominatesinsomedialects–  IndicaBveformdominatesinothers

•  Otheraspectsincreaseincomplexity

27

DAMorphologicalVariaBonVerbMorphology

conjverbobject subj tense

IOBJ negneg

MSAولمتكتبوهاله

/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him

EGYوماكتبتوهالوش

/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/

and+not+wrote+you+it+for_him+not

Andyoudidn’twriteitforhim

28

Perfect Imperfect

Past SubjuncBve Presenthabitual

Presentprogressive

Future

MSAكتب

/kataba/يكتب

/jaktuba/يكتب

/jaktubu/يكتبسـ

/sajaktubu/

LEVكتب

/katab/يكتب/jiktob/

يكتببـ/bjoktob/

يكتببـعم/ʕam bjoktob/

يكتبحـ/ħajiktob/

EGYكتب

/katab/يكتب/jikBb/

يكتببـ/bjikBb/

يكتبهـ/hajikBb/

IRQكتب/kitab/

يكتب/jikBb/

يكتبد/dajikBb/

يكتبرح/raħjikBb/

MORكتب/kteb/

يكتب/jekteb/

يكتبكـ/kjekteb/

يكتبغـ/ʁajekteb/

DAMorphologicalVariaBon

29

DAMorphologicalVariaBonVerbconjugaBon

Perfect Imperfect

1S 2S♂ 2S♀ 1S 1P 2S♀

MSA ت كتبـ /katabtu/

تكتبـ /katabta/

تكتبـ

/katabti/

كتب ا

/aktubu/

كتب نـ

/naktubu/

ين كتبـتـ/taktubīna/

ـيكتبـتـ

/taktubī/

LEV ت �كتبـ/katabt/

تي كتبـ

/katabti/

كتب ا/aktob/

كتبنـ /noktob/

ـيكتبـتـ

/toktobi/

IRQ ت �كتبـ/kitabt/

تيكتبـ

/kitabti/

كتب ا/aktib/

كتب نـ/niktib/

ينكتبـتـ

/tikitbīn/

MOR ت كتبـ/ktebt/

�تي كتبـ/ktebti/

كتب�نـ/nekteb/

وا�كتبـنـ/nektebu/

ـيكتبـتـ

/tektebi/

MorphologicalAmbiguity

•  Morphological richness – Token Arabic/English = 80% – Type Arabic/English = 200%

•  Morphological ambiguity – Each word: 12.3 analyses and 2.7 lemmas

•  Derivational ambiguity العني –  the eye, the water spring, Al-Ain city, the notable

Analysisvs.DisambiguaBon

Will will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated

PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)

NOUN_PROP biyn Ben

ADJ bay~in Clear

PREP bayn Between,among

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

أفليكفيدورباتمان؟بنيهلسينجح

Analysisvs.Disambigua(on

Will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated

PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)

NOUN_PROP biyn Ben

ADJ bay~in Clear

PREP bayn Between,among

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

*

أفليكفيدورباتمان؟بنيهلسينجح

W-3 W-2 W-1 W0 W1 W2 W3 W4 W-4

MORPHOLOGICAL ANALYZER

MORPHOLOGICAL CLASSIFIERS

•  Rule-based

•  Human-created

•  Multiple independent classifiers •  Corpus-trained

2nd

3rd

5th 4th

1st

RANKER

•  Heuristic or corpus-trained

MADA (Habash&Rambow 2005;Roth et al. 2008) MADAMIRA (Pasha et al., 2014)

MADAMIRA•  NewesttoolfromtheCADIMgroup(Pashaetal.,

2014)•  CombinesMADA(Habash&Rambow,2005)and

AMIRA(Diabetal.,2004)–  MorphologicaldisambiguaBon–  TokenizaBon–  Basephrasechunking–  NamedenBtyrecogniBon

•  MSAandEgypBanArabicmodes•  Server-modewithXMLinterface•  Onlinedemo

–  hQp://nlp.ldeo.columbia.edu/madamira/–  hQp://camel.abudhabi.nyu.edu/madamira/

InputArabicText

MorphologicalDisambigua(on

Tokeniza(on

BasePhraseChunking

NamedEn(tyRecogni(on

UserNLPApplica(ons

MorphologicalDisambiguaBon

System MDMRA-MSA MADA-ARZ

TrainingData MSA MSA ARZ MSA+ARZ

TestSet MSA EGY

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0

w+ kAtb

wkAtbوكاتب and (the) writer of

CALIMA-EgypBanv0.5•  CALIMAistheColumbiaArabicLanguageMorphological

Analyzer•  CALIMA-EGY

•  Extends the EgypBan Colloquial Arabic Lexicon (ECAL) (Kilany et al.,2002) and Standard ArabicMorphological Analyzer (SAMA) (Graff etal.,2009).

•  Follows the part-of-speech (POS) guidelines used by the LDC forEgypBanArabic(Maamourietal.,2012b).

•  AcceptsmulBpleorthographicvariantsandnormalizesthemtoCODA(Habashetal.,2012).

•  Incorporates annotaBons by the LDC for EgypBan Arabic. (~ 250Kwords)

CALIMA-ARZExample

katab_1LemmamA_katabt_lahA$CODAmA/NEG_PART+katab/PV+t/PVSUFF_SUBJ:2MS++li/PREP+hA/PRON_3FS+$/NEG_PART

POS

not+write+you+to/for+it/them/her+notGloss

katab_1LemmamA_katabit_lahA$CODAmA/NEG_PART+katab/PV+it/PVSUFF_SUBJ:3FS+li/PREP+hA/PRON_3FS+$/NEG_PART

POS

not+write+she/it/they+to/for+it/them/her+notGloss

mktbtlhA$ مكتبتلهاش

CALIMA-EgypBanv0.5

•  IncorporatesLDCARZannotaBons(p1-p6)– 251Ktokens,52Ktypes– AnnotaBoncleanupneeded– ExtendsSAMA(StandardArabicMorphAnalyser)

System TokenRecall

TypeRecall

SAMAv3.1(StandardArabic) 67.7% 59.7%CALIMA-EGYv0.5(EgypBancore) 88.7% 75.8%CALIMA-EGYv0.5(++SAMAdialectextensions) 92.6% 81.5%

MorphologicalDisambiguaBon

System MDMRA-MSA MADA-ARZ

TrainingData MSA MSA ARZ MSA+ARZ

TestSet MSA EGY

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0

w+ kAtb

wkAtbوكاتب and (the) writer of

MorphologicalDisambiguaBon

System MDMRA-MSA MDMRA-EGY

TrainingData MSA MSA EGY MSA+EGY

TestSet MSA Egyp(anArabic(EGY)

All 84.3% 27.0% 75.4% 64.7%

POS+Features 85.4% 35.7% 84.5% 75.5%

FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%

Lemma(za(on 96.1% 67.1% 86.3% 82.8%

BasePOS-tagging 96.1% 82.1% 91.1% 91.4%

ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

 ي •

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/

44

Roadmap

•  IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons

Towards Morphological Tagging of a New Dialect?

•  Review the literature –  Hidden gems from previous efforts

•  Data Collection •  Data Annotation

–  Guidelines: CODA, POS tags, etc. –  Noisy automatic processing: Egyptian MADAMIRA? –  Training annotators, quality control –  This is necessary to benchmark at least

•  Building the Morphological Analyzer –  Eskandar et al. (2013)’s technique for paradigm completion –  Salloum and Habash’s (2011) ADAM method for extending MSA

•  Building the Morphological Tagger –  MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) –  Other tagging techniques

45

Towards Morphological Tagging of a New Dialect?

•  Review the literature –  Hidden gems from previous efforts

•  Data Collection •  Data Annotation

–  Guidelines: CODA, POS tags, etc. –  Noisy automatic processing: Egyptian MADAMIRA? –  Training annotators, quality control –  This is necessary to benchmark at least

•  Building the Morphological Analyzer –  Eskandar et al. (2013)’s technique for paradigm completion –  Salloum and Habash’s (2011) ADAM method for extending MSA

•  Building the Morphological Tagger –  MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) –  Other tagging techniques

46

•  Curras Corpus (Jarrar et al., 2014)

•  Gumar Corpus (Khalifa et al., 2016)

The Gumar Corpus: A Morphologically Annotated Corpus of Gulf Arabic

•  ~100 million words •  Mainly long conversational novels published

anonymously online ( النتروايات ‘Internet novels’). •  Writers of the novels remain anonymous under

pen names. Although there is no claim of copyrights, it is conventional to credit the writer when the material is copied/transferred as per the writer request.

السالم علیكم

القصه هاذي قطریه روعه أتمنى انها تعجبكم طبعا أهي قصه منقوله من منتدى ثاني

وطبعا مصرحه الكاتبه نقل القصه مع ذكر اسمها وهي الكاتبهتحفه فنیه )) القطریه ((

نبدأ .....

الكاتبة تحفة فنیة

الفصل االول :-

وضحة والتوتر بدا یظهر علیها : الجازي ماتدرین عمي متى بیجي ؟ الجازي : واهللا یختس مدري بس ماهو باطي ،اله انت وشعندس الیوم على ابوي

؟ اخبرس ما تحبین مقعاد معاه ؟توترها : سالمتس بس بغیت اسلم علیه قبل ما وضحة وهي تحاول السیطرة على یجي حمد و نروح البیت ، قدلي كم مرة اجي وال القاه عد مهب عدله من زمان

ماوجهته . الجازي وهي تغمز عینها : ماوجهتي ابوي وال تنطرین ناس ؟

خجل على طول صار وجه وضحة احمر مثل الطماطم ، والجازي اعتبرت انه وتمت تضحك على وضحة ما تدري ان سبب احمرار وضحة هو القهر وجرح

الكرامة الى تحس به من بدت تلمح عن راشد و تقول في نفسها ماتدرین یالجازي، وفي هذه اللحظة انزلت علیهم ام راشد مرت عم ان اتمنه العمى وال اشوفه

وضحة جایه من غرفتها وفي ایدها كیسه كبیره ومدته على وضحة وهي تقول :خلها توزعه كلن وضحة یمس هذي صوغتن لكم من عند راشد عطیها امس

تعطیه حقه .

An example of raw text (Qatari) from a novel

Gumar Corpus Statistics

Words 112,410,688 Sentences 9,335,224 Documents 1,236

•  Words are whitespace tokenized and the counts include punctuation.

•  Number of sentences represents the number of lines. •  Each document generally represents a single novel

Gumar Corpus Dialect Distribution

(Document level)

Dialect Percentage SA 60.52 AE 13.35 KW 5.91 OM 1.13 QA 0.65 BH 0.94 GA (other) 10.03 Arabic (other) 7.93

•  92% of the corpus is written in GA with SA being the most dominant.

•  GA (other) are the cases of a novels containing a combination of several GA dialects. Or the case of dialect ambiguity (esp. between OM, QA and AE)

•  The rest of the corpus (7.93%) is mostly MSA (original text or translation attempts of existing non Arabic text) and other DA such as Egyptian, Iraqi, Levantine, ... etc.

Morphological Analysis Evaluation

•  Preliminary investigation into GA annotation are performed.

•  4000 words from text are annotated manually for: –  Orthography (CODA) –  Morphology (tokenization) –  Part-of-speech –  Lemma

•  Same text was given to MADAMIRA (MSA & EGY) –  Outputs are then evaluated against the gold standard.

Gulf CODA

•  CODA: Conventional Orthography for Dialectal Arabic (Habash et al. 2012).

•  There exist CODA guidelines for both EGY and PAL (Palestinian Arabic).

•  CODA guidelines for different dialects share general rules that applies to all.

•  Exceptional cases differs from one dialect to another.

Gulf CODA •  One main feature that is different among dialects is the

root consonant mapping rules.

•  General rules: spelling Al, Ta Marbuta, clitic attachment •  Other examples of specific spelling…

سيدا، مب، مانيب، +ج\+ك

MSA/CODA Variants CODA Compliant CODA non-compliant

قدام /q/ or /ɡ/ or/ʤ/ ق جدام

�كبد /k/ or /ʧ/ or /ts/ ككذب

�جبدتسذب

جلس /ʤ/ or /j/ ج يلسشاي /ʃ/ or /ʧ/ ش چاي

CODAfied text examples

Example 1 Raw ياويلتس منتس هالحتسي اسمع

CODA ياويلج منج هالحكي اسمعEnglish

Example 2 Raw جاهز؟ الغدى عسى

CODA جاهز؟ الغدا عسىEnglish

Example 3 Raw الجامعهفياللحنياناصغيررونهمنيبساره

CODA الجامعةفيالحنياناصغيرونةمانيبسارةEnglish

An Annotation Example

Morphological Analysis Evaluation

•  Preliminary investigation into GA annotation are performed.

•  4000 words from text are annotated manually for: –  Orthography (CODA) –  Morphology (tokenization) –  Part-of-speech –  Lemma

•  Same text was given to MADAMIRA (MSA & EGY) –  Outputs are then evaluated against the gold standard.

Morphological Analysis Evaluation

•  Accuracy measure for the annotated features again the automatic output of MADAMIRA in two modes (MSA and EGY)

•  MADAMIRA-EGY outperforms MADAMIRA-MSA on different metrics, confirming that it is better to use it as a baseline for manual annotation.

•  Similar conclusions were reported by Jarrar et al. (2014)

Feature MADAMIRA-MSA MADAMIRA-EGY

Ortho 83.81 88.34

Morph 76.16 83.62 POS 72.37 80.39 Lemma 64.03 81.51

Summary & Future Directions •  Arabic dialects pose many challenges to NLP

–  No orthographic standards –  Limited resources –  Large number of differences from MSA

•  A combination of solutions works best –  Exploit similarities between dialects and MSA –  Exploit similarities among dialects –  Address differences through resource building

•  Our goal is to make basic support for MSA and Dialects at the level of English –  So, we can focus more on higher level applications!

Summary & Future Directions Although dialect processing may seem daunting, just remember •  Breathe! There are rules in the dialects. Just not the

same rules as the ones in MSA.

•  All these challenges are amazing opportunities to advance NLP –  Not just for Arabic but for all languages.

•  For Arabic native speakers, working with dialects is an eye opener (and can be a lot of fun!)

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD in Computer Science. –  Contact me if interested.

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD in Computer Science. –  Contact me if interested.

Announcements •  Project MADAR

–  Multi-Arabic Dialect Applications and Resources –  QNRF funded project –  Collaboration among CMUQ, NYUAD and Columbia –  Modeling 25 Arabic city dialects

•  Lexical resources, parallel data, dialect id, dialect MT –  Looking for linguists and postdocs!

•  WARDAT 2016 –  First Workshop on Arabic Dialect Technologies –  Discuss future of collaborations on Arabic Dialect Technologies –  Funded by the NYUAD Institute; to be held in NYU Abu Dhabi –  By invitation. Limited slots. Contact me if interested.

•  CAMeL Lab –  Hiring postdocs! –  Funded NYU PhD Program in Computer Science. –  Contact me if interested.

•  http://nyuad.nyu.edu/en/

63

Thank You! Questions?

top related