artenten a new, vast corpus for arabic yonatan belinkov, nizar habash, adamkilgarriff, noam ordan,...
TRANSCRIPT
![Page 1: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/1.jpg)
arTenTenA new, vast corpus for ArabicYonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit SuchomelMIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz
![Page 2: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/2.jpg)
We all want corpora to be
•Bigger•Better
▫More text types▫Richer metadata▫Cleaner▫Better linguistic processing
![Page 3: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/3.jpg)
Arabic
•Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only
•Leeds▫2005 Arabic web corpus (oldish)
•Others▫Mostly
small or not available or newswire
![Page 4: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/4.jpg)
arTenTen•TenTen family
See paper in main conference▫Web crawled
Spiderling Pomikalek and Suchomel, WAC 2012
▫Cleaning and deduplication justText, Onion (Pomikalek)
![Page 5: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/5.jpg)
Size
•5.8 b space-separated tokensFully processed:•200M words
Tokenise, lemmatise, POS-tag by MADA, Columbia U
Sketch grammar: new work (Belinkov)
Varieties/dialects•We don’t know yet
![Page 6: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/6.jpg)
Availability
•In Sketch Engine•demo
![Page 7: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/7.jpg)
Encoding
•‘Vertical’ format Sketch Engine input format
▫One word per line, tab-separated columns Twenty-nine
▫Structural markup: XML
![Page 8: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing](https://reader036.vdocuments.site/reader036/viewer/2022082611/56649f035503460f94c17580/html5/thumbnails/8.jpg)
For each word word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1
pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic glosssource