a call to action. dr. vukosi marivate ... - usaf

30
A call to action. Using data science in the advancement of African Languages. Dr. Vukosi Marivate & Collaborators ABSA UP Chair of Data Science Dept. of Computer Science, UP

Upload: others

Post on 29-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A call to action. Dr. Vukosi Marivate ... - USAf

A call to action. Using data science in the advancement of African

Languages.

Dr. Vukosi Marivate & CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP

Page 2: A call to action. Dr. Vukosi Marivate ... - USAf

Overview

▪ Setting the scene [Why do we care?]

▪ How did we get here?

▪ How can we tackle these challenges?

▪ Some results and future work

Page 3: A call to action. Dr. Vukosi Marivate ... - USAf

Language, AI and MLArtificial Intelligence (AI) & Machine Learning (ML)

● Text and language is a rich interface to share information and interact with machines.

● We need to ask ourselves a few questions.○ How do machines process language

information?○ Why is local language important?

Page 4: A call to action. Dr. Vukosi Marivate ... - USAf

Brief overview of AI/MLArtificial Intelligence (AI) & Machine Learning (ML)

Russell and NorvigManning

● AI○ Machine○ In an environment (can perceive it)○ Perform Actions○ Reach a Goal

● ML○ Learning patterns from data

● Natural Language Processing (NLP)○ Learning language tasks

Page 5: A call to action. Dr. Vukosi Marivate ... - USAf

Challenges with African Languages Use Case: South Africa

● Lack of sufficient language resources.

● Inequality in data availability.

● Rare to find annotated datasets (for different NLP tasks) publicly.

● How can we innovate in collection, curation, annotation and classification?

Page 6: A call to action. Dr. Vukosi Marivate ... - USAf

Why Data is Important

Where is the data?

Page 7: A call to action. Dr. Vukosi Marivate ... - USAf

A framework to understand the challenge

Martinus and Abbott 2019 https://arxiv.org/pdf/1906.05685.pdf

● Low Availability

● Discoverability

● Focus

● Reproducibility and Benchmarks

Languages world map [Wikimedia]

Page 8: A call to action. Dr. Vukosi Marivate ... - USAf

Importance indigenous languages

UNESCO and DSI reporting on indigenous languages

● What does language capture?○ Indigenous knowledge○ Culture

● How did we get here?○ Inequality of language○ Colonial legacies○ Move to monolingualism○ Lack of Data○ Who develops the systems

Page 9: A call to action. Dr. Vukosi Marivate ... - USAf

Importance indigenous languages

UNESCO and DSI reporting on indigenous languages

● Internet is becoming more and more monolingual.

● How do we increase access to local populations?

● How

Page 10: A call to action. Dr. Vukosi Marivate ... - USAf

So what are we going to do about it?Some thoughts on

Expand the current community of practice across the African Continent and Global South

Innovate on what has come before!● We have great new tools in ML/DL,

exploit them.● Expand data availability and

gathering.

Page 11: A call to action. Dr. Vukosi Marivate ... - USAf

Questioning the Status Quo

● Civil Disobedience?

● Can we afford to wait longer?

● Tapping into our youth.

Page 12: A call to action. Dr. Vukosi Marivate ... - USAf

How do we move forward

Martinus and Abbott 2019

● Better public understanding

● Collect, collate and annotate data

● Expanding practice and skill -Building community

Page 13: A call to action. Dr. Vukosi Marivate ... - USAf

Current Future Directions▪ More data collection, curation and annotation

▪ See AI4D and Lacuna Data Set Challenges

▪ Ongoing research at DSFSI▪ Enhancement of ML pipelines for low resource scenarios▪ New models for augmentation▪ Dataset Curation & Pretrained Models (Masakhane isiZulu)▪ Masakhane Web Tools▪ Teaching NLP https://dsfsi.github.io/cos802/

▪ Building on our foundations▪ Masakhane Community - https://www.masakhane.io/▪ Sauti-Yetu Unconference - 10 October 2020 -

https://sites.google.com/view/sautiyetu-nlp/

Page 14: A call to action. Dr. Vukosi Marivate ... - USAf

Low resource language dataset creation, curation

and classification: Setswana and Sepedi

Vukosi Marivate Tshephisho Sefara Vongani Chabalala

Keamogetswe Makhaya Tumisho Mokgonyane

Rethabile Mokoena Abiodun Modupe

Moseli MotsoeliMasakhane Community

Page 15: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 1: Use National Broadcaster as ResourceSouth African Broadcasting Corporation [SABC]

● SABC is South Africa's state broadcaster○ 19 Radio Stations○ 5 TV Channels○ Online digital news.

● Currently does not publish digital news in other languages except English.

● Radio stations in all 11 official languages [scripts exist, not public]

● Idea: Get headlines from Radio Facebook Pages, annotated for category classification

SEPEDI [nso]

SETSWANA [tn]

Page 16: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 1: Use National Broadcaster as Resource

SEPEDI [nso]

SETSWANA [tn]

Example Setswana Data and Annotations

Datasets available: https://zenodo.org/record/3668495

Page 17: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 1: Use National Broadcaster as Resource

SEPEDI [nso]

SETSWANA [tn]

Datasets available: https://zenodo.org/record/3668495

Page 18: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages

Embeddings available: https://zenodo.org/record/3668481 (Being updated)

● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR

Page 19: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 2: Train pre-trained vectorisersWe can get some data of all 11 South African Languages

Embeddings available: https://zenodo.org/record/3668481 (Being updated)

● Sources of local language data○ Wikipedia○ JW300○ Bible○ South African Constitution○ SADilaR

● Train traditional vectorisers● Train word embeddings [Word2Vec]● Train sentence embeddings

[Doc2Vec]● Useful for downstream tasks

Page 20: A call to action. Dr. Vukosi Marivate ... - USAf

Idea 3: Text Augmentation with Quality CheckIncrease data sizes with data augmentation

● Build robust classifiers with data augmentation for text.

● Contextual augmentation using word2vec “synonyms”

● Novel: Added a quality check using doc2vec [Algorithm 1]

Page 21: A call to action. Dr. Vukosi Marivate ... - USAf

Initial Results: BenchmarksRan through

Full paper available through RAIL LREC workshop paper https://arxiv.org/abs/2004.04813

Models:● Logistic

Regression● Support Vector

Classifier● XGBoost● MLP● Comparisons

○ TF, TFIDF, W2V○ With and without

augmentation

Page 22: A call to action. Dr. Vukosi Marivate ... - USAf

Whats Next

Page 23: A call to action. Dr. Vukosi Marivate ... - USAf

One More Thing

Text Augment LibraryWe release this library as part of this paper▪ https://github.com/dsfsi/textaugment

Page 24: A call to action. Dr. Vukosi Marivate ... - USAf

MasakhaneWhy African Natural

Language Processing Now

Dr. Vukosi Marivate &Masakhane CollaboratorsABSA UP Chair of Data ScienceDept. of Computer Science, UP

MasakhaneMachine

Translation for Africa

Page 25: A call to action. Dr. Vukosi Marivate ... - USAf

MASAKHANE is research effort for machine translation for African

languages that is

OPEN SOURCE

CONTINENT-WIDE

DISTRIBUTED

ONLINE

Page 26: A call to action. Dr. Vukosi Marivate ... - USAf

Online and Accessible

Page 27: A call to action. Dr. Vukosi Marivate ... - USAf

Masakhane: The Reach

https://masakhane.io

Page 28: A call to action. Dr. Vukosi Marivate ... - USAf

Masakhane: The Reach

https://masakhane.io

2020 so far [Workshop papers]● ∀, Masakhane - Machine Translation for Africa (2020)● Dossou, Bonaventure FP, and Chris C. Emezue. "FFR V1. 0: Fon-French Neural

Machine Translation." (2020).● Orife, Iroro. "Towards Neural Machine Translation for Edoid Languages." (2020)● Orife, Iroro, et al. "Improving Yor\ub\'a Diacritic Restoration." (2020).● Marivate, Vukosi, et al. "Investigating an approach for low resource language dataset

creation, curation and classification: Setswana and Sepedi." (2020).● Ahia, Orevaoghene, and Kelechi Ogueji. "Towards Supervised and Unsupervised Neural

Machine Translation Baselines for Nigerian Pidgin." (2020).● Van Biljon, Elan, Arnu Pretorius, and Julia Kreutzer. "On optimal transformer depth for

low-resource language translation. (2020).● Öktem, Alp, Mirko Plitt, and Grace Tang. "Tigrinya Neural Machine Translation with

Transfer Learning for Humanitarian Response." (2020)● Martinus, Laura, et al., Neural Machine Translation for South Africa's Official Languages

(2020)

Page 29: A call to action. Dr. Vukosi Marivate ... - USAf

Masakhane: Impact

EMNLP 2020

Page 30: A call to action. Dr. Vukosi Marivate ... - USAf

Thank You

It takes a village (literally)

Keep in touchJoin our research group newsletter https://tinyletter.com/datascience-up/

Made with ❤ in Tshwane

Dr. Vukosi [email protected]://dsfsi.github.io@vukosi