artenten a new, vast corpus for arabic yonatan belinkov, nizar habash, adamkilgarriff, noam ordan,...

8
arTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

Upload: jessie-bridges

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

arTenTenA new, vast corpus for ArabicYonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit SuchomelMIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

Page 2: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

We all want corpora to be

•Bigger•Better

▫More text types▫Richer metadata▫Cleaner▫Better linguistic processing

Page 3: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

Arabic

•Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only

•Leeds▫2005 Arabic web corpus (oldish)

•Others▫Mostly

small or not available or newswire

Page 4: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

arTenTen•TenTen family

See paper in main conference▫Web crawled

Spiderling Pomikalek and Suchomel, WAC 2012

▫Cleaning and deduplication justText, Onion (Pomikalek)

Page 5: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

Size

•5.8 b space-separated tokensFully processed:•200M words

Tokenise, lemmatise, POS-tag by MADA, Columbia U

Sketch grammar: new work (Belinkov)

Varieties/dialects•We don’t know yet

Page 6: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

Availability

•In Sketch Engine•demo

Page 7: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

Encoding

•‘Vertical’ format Sketch Engine input format

▫One word per line, tab-separated columns Twenty-nine

▫Structural markup: XML

Page 8: ArTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing

For each word word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1

 pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic glosssource