a call to action. dr. vukosi marivate ... - usaf
TRANSCRIPT
A call to action. Using data science in the advancement of African
Languages.
Dr. Vukosi Marivate & CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP
Overview
▪ Setting the scene [Why do we care?]
▪ How did we get here?
▪ How can we tackle these challenges?
▪ Some results and future work
Language, AI and MLArtificial Intelligence (AI) & Machine Learning (ML)
● Text and language is a rich interface to share information and interact with machines.
● We need to ask ourselves a few questions.○ How do machines process language
information?○ Why is local language important?
Brief overview of AI/MLArtificial Intelligence (AI) & Machine Learning (ML)
Russell and NorvigManning
● AI○ Machine○ In an environment (can perceive it)○ Perform Actions○ Reach a Goal
● ML○ Learning patterns from data
● Natural Language Processing (NLP)○ Learning language tasks
Challenges with African Languages Use Case: South Africa
● Lack of sufficient language resources.
● Inequality in data availability.
● Rare to find annotated datasets (for different NLP tasks) publicly.
● How can we innovate in collection, curation, annotation and classification?
Why Data is Important
Where is the data?
A framework to understand the challenge
Martinus and Abbott 2019 https://arxiv.org/pdf/1906.05685.pdf
● Low Availability
● Discoverability
● Focus
● Reproducibility and Benchmarks
Languages world map [Wikimedia]
Importance indigenous languages
UNESCO and DSI reporting on indigenous languages
● What does language capture?○ Indigenous knowledge○ Culture
● How did we get here?○ Inequality of language○ Colonial legacies○ Move to monolingualism○ Lack of Data○ Who develops the systems
Importance indigenous languages
UNESCO and DSI reporting on indigenous languages
● Internet is becoming more and more monolingual.
● How do we increase access to local populations?
● How
So what are we going to do about it?Some thoughts on
Expand the current community of practice across the African Continent and Global South
Innovate on what has come before!● We have great new tools in ML/DL,
exploit them.● Expand data availability and
gathering.
Questioning the Status Quo
● Civil Disobedience?
● Can we afford to wait longer?
● Tapping into our youth.
How do we move forward
Martinus and Abbott 2019
● Better public understanding
● Collect, collate and annotate data
● Expanding practice and skill -Building community
Current Future Directions▪ More data collection, curation and annotation
▪ See AI4D and Lacuna Data Set Challenges
▪ Ongoing research at DSFSI▪ Enhancement of ML pipelines for low resource scenarios▪ New models for augmentation▪ Dataset Curation & Pretrained Models (Masakhane isiZulu)▪ Masakhane Web Tools▪ Teaching NLP https://dsfsi.github.io/cos802/
▪ Building on our foundations▪ Masakhane Community - https://www.masakhane.io/▪ Sauti-Yetu Unconference - 10 October 2020 -
https://sites.google.com/view/sautiyetu-nlp/
Low resource language dataset creation, curation
and classification: Setswana and Sepedi
Vukosi Marivate Tshephisho Sefara Vongani Chabalala
Keamogetswe Makhaya Tumisho Mokgonyane
Rethabile Mokoena Abiodun Modupe
Moseli MotsoeliMasakhane Community
Idea 1: Use National Broadcaster as ResourceSouth African Broadcasting Corporation [SABC]
● SABC is South Africa's state broadcaster○ 19 Radio Stations○ 5 TV Channels○ Online digital news.
● Currently does not publish digital news in other languages except English.
● Radio stations in all 11 official languages [scripts exist, not public]
● Idea: Get headlines from Radio Facebook Pages, annotated for category classification
SEPEDI [nso]
SETSWANA [tn]
Idea 1: Use National Broadcaster as Resource
SEPEDI [nso]
SETSWANA [tn]
Example Setswana Data and Annotations
Datasets available: https://zenodo.org/record/3668495
Idea 1: Use National Broadcaster as Resource
SEPEDI [nso]
SETSWANA [tn]
Datasets available: https://zenodo.org/record/3668495
Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages
Embeddings available: https://zenodo.org/record/3668481 (Being updated)
● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR
Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages
Embeddings available: https://zenodo.org/record/3668481 (Being updated)
● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR
● Train traditional vectorisers● Train word embeddings [Word2Vec]● Train sentence embeddings
[Doc2Vec]● Useful for downstream tasks
Idea 3: Text Augmentation with Quality CheckIncrease data sizes with data augmentation
● Build robust classifiers with data augmentation for text.
● Contextual augmentation using word2vec “synonyms”
● Novel: Added a quality check using doc2vec [Algorithm 1]
Initial Results: BenchmarksRan through
Full paper available through RAIL LREC workshop paper https://arxiv.org/abs/2004.04813
Models:● Logistic
Regression● Support Vector
Classifier● XGBoost● MLP● Comparisons
○ TF, TFIDF, W2V○ With and without
augmentation
Whats Next
One More Thing
Text Augment LibraryWe release this library as part of this paper▪ https://github.com/dsfsi/textaugment
MasakhaneWhy African Natural
Language Processing Now
Dr. Vukosi Marivate &Masakhane CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP
MasakhaneMachine
Translation for Africa
MASAKHANE is research effort for machine translation for African
languages that is
OPEN SOURCE
CONTINENT-WIDE
DISTRIBUTED
ONLINE
Online and Accessible
Masakhane: The Reach
https://masakhane.io
Masakhane: The Reach
https://masakhane.io
2020 so far [Workshop papers]● ∀, Masakhane - Machine Translation for Africa (2020)● Dossou, Bonaventure FP, and Chris C. Emezue. "FFR V1. 0: Fon-French Neural
Machine Translation." (2020).● Orife, Iroro. "Towards Neural Machine Translation for Edoid Languages." (2020)● Orife, Iroro, et al. "Improving Yor\ub\'a Diacritic Restoration." (2020).● Marivate, Vukosi, et al. "Investigating an approach for low resource language dataset
creation, curation and classification: Setswana and Sepedi." (2020).● Ahia, Orevaoghene, and Kelechi Ogueji. "Towards Supervised and Unsupervised Neural
Machine Translation Baselines for Nigerian Pidgin." (2020).● Van Biljon, Elan, Arnu Pretorius, and Julia Kreutzer. "On optimal transformer depth for
low-resource language translation. (2020).● Öktem, Alp, Mirko Plitt, and Grace Tang. "Tigrinya Neural Machine Translation with
Transfer Learning for Humanitarian Response." (2020)● Martinus, Laura, et al., Neural Machine Translation for South Africa's Official Languages
(2020)
Masakhane: Impact
EMNLP 2020
Thank You
It takes a village (literally)
Keep in touchJoin our research group newsletter https://tinyletter.com/datascience-up/
Made with ❤ in Tshwane
Dr. Vukosi [email protected]://dsfsi.github.io@vukosi