artenten a new, vast corpus for arabic yonatan belinkov, nizar habash, adamkilgarriff, noam ordan,...
TRANSCRIPT
arTenTenA new, vast corpus for ArabicYonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit SuchomelMIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz
We all want corpora to be
•Bigger•Better
▫More text types▫Richer metadata▫Cleaner▫Better linguistic processing
Arabic
•Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only
•Leeds▫2005 Arabic web corpus (oldish)
•Others▫Mostly
small or not available or newswire
arTenTen•TenTen family
See paper in main conference▫Web crawled
Spiderling Pomikalek and Suchomel, WAC 2012
▫Cleaning and deduplication justText, Onion (Pomikalek)
Size
•5.8 b space-separated tokensFully processed:•200M words
Tokenise, lemmatise, POS-tag by MADA, Columbia U
Sketch grammar: new work (Belinkov)
Varieties/dialects•We don’t know yet
Availability
•In Sketch Engine•demo
Encoding
•‘Vertical’ format Sketch Engine input format
▫One word per line, tab-separated columns Twenty-nine
▫Structural markup: XML
For each word word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1
pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic glosssource