p03- mandiac: a web-based annotation system for manual arabic diacritization

Post on 15-Feb-2017

220 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MANDIAC: A Web-based Annotation System For Manual

Arabic Diacritization

Collaborators: Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer

Ossama ObeidCarnegie Mellon University in Qatar

owo@qatar.cmu.edu

Introduction• Arabic text is composed of consonants, long vowels, and short

vowels (diacritics).• Absence of diacritics:

oAdds lexical and morphological ambiguity.oConfusing to beginners.o Impacts performance of Arabic NLP tasks.

• Very few texts are diacritized.

Introduction

Possible pronunciation and meanings of the undiacritized Arabic word ذكر.

Introduction• Most automatic diacritization systems trained on Arabic

Treebanks.• Different genre and dialects need new datasets:

oTime consuming.oMust insure data quality and consistency.

Currently Available Annotation Tools• Very basic text-editor-like interfaces.• Can’t handle a large number of documents and annotators.• Not easily customizable.

MANDIAC• Web-based.• Intuitive and easy to use.• Easily manages thousands of documents.• Distributes tasks (including IAA evaluation tasks) to tens of

annotators .• Doubles annotation speed!• Based on QAWI.• Provides Annotation and Annotation Management interfaces.

Annotation Interface• Token-based annotation system similar to QAWI.• Annotators can choose pre-computed diacritizations (derived

using MADAMIRA) and/or manually edit diacritics.• Additional features to increase annotator productivity.

Annotation InterfaceExtra Features:• Undo/Redo buttons• Edits restricted to diacritics only• Timer• Counter indicating number of words left to annotate• Link to annotation guidelines• Token highlighting:

o Annotated wordso Tokens that should not be edited (eg digits, non-Arabic words, punctuation)

• Flag documents• Mark tokens as ambiguous

Annotation Interface

Annotation Interface at a glance

Annotation Interface

Dropdown showing top 3 automatically diacritized

candidates.

Manual token editor

Management InterfaceUser Management• Add/remove users.• Add users to annotation groups.• Display user activity log and statistics.

Management InterfaceAnnotation Workflow Management:• Upload files in various formats.• Organize files into groups.• Assign files to individuals or to a group (for IAA).• Highlight tasks as untouched, edited, or completed.

Management InterfaceEvaluation and Monitoring:• Evaluate IAA.• Compare annotations to gold reference.• Use WER and DER as metrics.• 10% of assigned documents are randomly assigned for IAA.

Management Interface

User management view

Management Interface

Task assignment popup

Tasklistview

System Design and Architecture• Four main components:

oAnnotation interfaceoManagement interfaceoBack-end serveroMADAMIRA

Component interaction diagram

System Design and ArchitectureData storage:• Relational database (SQL):

o Fast data search and retrieval.o Almost any SQL database can be used.

• Annotation data stored as JSON blobs:o Flexible data format.o Quickly add new functionality and annotation modes with little back-end

modification.

EvaluationExperimental setup:• Around 1,500 words were extracted from Penn Arabic Treebank.• Five annotators were asked to fully diacritize the extracted words:

o First half of the text using a text editor.o Second half of the text with MANDIAC:

−Use automatically diacritized candidate if possible.−Manually edit otherwise.

Evaluation

• Experimental results:oUsing a text editor: 302 words/houroUsing MANDIAC: 618 words/hour

• Using the text editor introduced typos.

Acknowledgements• This project has been funded by the Qatar National Research

Fund (grant NPRP 6-1020-1-199).• We also thank the annotators for their feedback on MANDIAC.

Thank You!

top related