Download - Sarawak Language Technology
![Page 1: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/1.jpg)
Sarawak Language Technology (SaLT)Research Group
SaLT Initiatives:Preservation and Maintenance of Sarawak Languages Faculty of Computer Science and Faculty of Computer Science and
Information TechnologyInformation Technology
Universiti Malaysia SarawakUniversiti Malaysia SarawakAssociate Professor Alvin W. YeoAssociate Professor Alvin W. Yeo
![Page 2: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/2.jpg)
Overview
• Languages in Sarawak• Maintenance and Revitalisation: Holistic Approach
– Sarawak Language Technology (SaLT) Research Group
• SaLT Projects– Borneo Corpus Management System (BCMS)– Iban-English Machine Translation
• TRanslation IBan-English (TRIBE)– Multimodal-INTegration (MINT) of Sketch and Melanau Daro-Matu
Speech in Spatial Queries– Speech Language Dialog Systems (SLaDS)– Development of Language Tools
• Current findings
![Page 3: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/3.jpg)
Where are we?
![Page 4: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/4.jpg)
East Malaysia> Sarawak> Kuching
Kuching
![Page 5: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/5.jpg)
Introduction (cont’d)
• Sarawak is a state rich in culture. – 27 ethnic groups in Sarawak (STB, 2005), each with
its own culture and language. – Sarawak has 46 living languages and 1 extinct; according to the Ethnologue (Gordon, 2005) – Each ethnic group may have
different languages– Sarawak Dewan Bahasa dan Pustaka
• 63 known languages in Sarawak
![Page 6: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/6.jpg)
Rationale
Population No. of languages Cumulative no. of languages
Cumulative (%)
1– 100 4 4 9%
101 – 500 8 12 27%
501 – 1000 4 16 36%
1001 – 5000 18 34 76%
5001 – 10,000 4 38 84%
10,001 – 50,000 6 44 98%
50,001 – 100,000 0 44 98%
100,001 - 1 45 100%
Extinct 1 1
No data available
1 1
Cumulative language and number of speakers (Ethnologue,2005)
![Page 7: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/7.jpg)
Problem
• World’s linguistic and cultural diversity is under threat. – Many minority languages are on
the brink of extinction. • Minority language communities
– Further disadvantaged economically and socially. – Dominant languages– Exogamy
• Revitalizing minority languages can bring economic and social benefits as well as cultural benefits.
![Page 8: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/8.jpg)
Holistic Approach: Framework for Language Revitalization and Maintenance
People
Preservation of Culture
Applications
Internet: Online Presence
Software applications and operating systems
Hardware: Input devices: keyboards, tablets/pen/stylus
Supporting Technologies
Stakeholders
Web techno-logies: Java, Flash
Methodologies: engaging communities; development lifecycles
Computing Technologies:Natural Language Processing, Image Processing, Speech Recognition and Generation
Community/civil society
Research institutions
Government agencies
NGOs
Industry
Trainers
Translators
Linguists
IT spec.
Comp. Scientists
Researchers
Social Scientists
Communi-ty readiness: ICT literacy
Ethnic group organisations
…
![Page 9: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/9.jpg)
Sarawak Language Technologies (SaLT) Research Group
![Page 10: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/10.jpg)
SaLT
• Role of technology in language maintenance and revitalisation• On revitalising and maintaining the existing conventional
languages by building corpora, conducting research and developing tools for Sarawak Ethnic Languages.
Sarawak Language Technologies (SaLT) Research Group covers• Codification of the ethnic languages
– Creation of corpora of the various languages in Sarawak
• Research in computational linguistics projects– which involves languages and peoples of Sarawak
• Development of tools: word processors, spellcheckers
![Page 11: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/11.jpg)
Language Technology
• Understanding and explication of language phenomena in a– computationally tractable form, resulting in – techniques for interchanging various linguistic
forms • speech, text, morphology, syntax, semantics/meaning,
discourse, knowledge, – thus leading to the creation and development of
intelligent applications involving language.
![Page 12: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/12.jpg)
Levels of Technology
INPUT (corpus)
APPLICATION (machine translation,
multimodal spatial application)
PROCESSOR (tagger, parser,
multimodal integration)
Lexicographer/Linguist/ comp. scientist
Linguist/ comp. scientistGeneral and
conceptual dictionary
![Page 13: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/13.jpg)
Specialists Needed
• Lexicographers• Computer scientists DBA, SE & N/W (data
maintenance & grid)• Linguists• Information Scientists• Psychologists• Anthropologists• Computational Linguistics Natural Language
Processing
![Page 14: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/14.jpg)
Current Projects
![Page 15: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/15.jpg)
Current Projects (cont’d)
![Page 16: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/16.jpg)
Roadmap for SaLT
![Page 17: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/17.jpg)
Advisors and Organisations Involved
No Name Expertise Organisation
1 Prof. Zaharin Yusoff Computational Linguistics (CL) & Natural Language Proc. (NLP)
MMU
2. Prof. Ahmad Zaki Abu Bakar CL & NLP UTM
3. AP Dr Normaziah Abdul Aziz NLP & Artificial Intelligence UIAM
4. Prof. Dr. Tang Enya Kong CL & NLP MMU
5 Dr. Bali Ranaivo NLP & CL MMU
6. Prof. Dr. Zuraidah Mohd. Don Linguistics UM
7. Dr. Gerry Knowles Phonetics and Phonology MIQUEST Worldwide Sdn
Bhd 8. Professor Dr. Peter Songan Community development UNIMAS
![Page 18: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/18.jpg)
Collaborators
Organisations Involved1. Tun Jugah Foundation2. Dewan Bahasa dan
Pustaka (Sarawak Branch)3. Melanau Association 4. Dayak Bidayuh National
Association5. Sarawak Museum6. Pustaka Negeri Sarawak7. Majlis Adat Istiadat
Universities Involved1. UNIMAS (FCSIT, FCSHD,
FSS, CLS) 2. Multimedia University3. Universiti Teknologi Malaysia4. Universiti Islam Antarabangsa
Malaysia5. Universiti Sains Malaysia6. Universiti Malaya7. Localisation Research Centre,
University of Limerick, Ireland8. University of Waikato, New
Zealand
![Page 19: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/19.jpg)
Team members
a. Staff FCSIT• AP Dr Alvin Yeo Wee (Head)• AP Dr. Narayanan K.• Dr Edwin Mit• Suhaila Saee• Sarah Flora Samson• Nurfauza Jali• Suriati Khartini Jali• Sy. Fazlin Seyed Fadzir• Lee Jun Choi
FCSHD• Dr. Ng Giap Weng• D’oria Islamiah• Wan Norizan
CLS• Dr. Ting Su Hie• Salbia Hassan• Yvonne Michelle Campbell
![Page 20: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/20.jpg)
Team members (cont’d)
b. Research Assistants1. Beatrice Chin (FCSIT)2. Teh Lee Na (FCSIT)3. Jennifer Wilfred (FCSIT)4. Lai Nyong Fock (FCSIT)5. Mohd. Hanafiah Semuni (FCSHD)6. Loh Chee Wyai (FCSIT) 7. Ang Siaw Tiong (FCSIT)
c. StudentsLevel No. of Students
Post-graduate PhD 2
Master by Research 6
Master by Coursework 5
Undergraduate 22
Total 35
![Page 21: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/21.jpg)
Borneo Corpus Management System (BCMS)
• Problem/Background: – Currently there is no existing corpus management system to manage
corpora available in minority languages of Sarawak
• Solution: – Build a system that is able to manage and maintain the corpora
• Objectives: – To design an easy and usable Corpus Processing Toolkit for
researchers– Integrate the various tools together in one single platform
• Current Status: – Working on the Morphological Analysers and Spell Checkers
![Page 22: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/22.jpg)
Corpus Manager (After processing)
Editable Content
Used to highlight the extracted information
in the content
File tree that display the processed files. The file is
stored in the folder based on category
Original Content Processed Content
![Page 23: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/23.jpg)
Corpus Analyser: Sentence Splitter
The output is each sentence of current document
![Page 24: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/24.jpg)
Iban-Corpus Development
• Problem/Background – Indigenous languages in Sarawak are slowly dying out due to:– One way to stem this “extinction” of languages:
• Provide more local content – but how??
• Solution– Translate English documents to documents in minority languages– MT is needed to facilitates and accelerates the translation process
• Objectives– Identify a methodology that can be used to translate English to minority
languages, by taking Iban as a case study
• Current Status – Built Iban corpus with 23,833 words with 3,831 distinct words– Constructed bilingual lexicon with 1,688 words with 1,192 distinct words
![Page 25: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/25.jpg)
Iban-English Machine Translation
• Problem/Background – Traditional knowledge (TK) is tacit knowledge; generally not
stored and known only by the older generation, who speaks little English
– TK is very important. It needs to be preserved and protected.– Machine Translation (MT) can help to preserve TK– Translate available resources into English so that it is accessible
by all, e.g. researchers (social scientists) and younger generation– However, translation of closely related languages is easier
• Solution– Translate TK documents to English through a closely related
language as pivot language– Case study: Iban as source language, Malay as pivot language
and English as target language
![Page 26: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/26.jpg)
• Objectives– To demonstrate that the performance of translation
through a pivot language is comparable with performance of direct translation
• Realise benefits (efficiency) of translating multiple “similar” languages through a common pivot language
• Current Status – Building of Iban corpus and lexicon– Linguistic comparison on Iban and Malay language
![Page 27: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/27.jpg)
![Page 28: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/28.jpg)
Multimodal Integration: Preamble
User sketching on the Wacom tablet with CogSketch sketch interface describing a place.
Dragon Naturally Speaking software for capturing thespeech with a microphone.
![Page 29: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/29.jpg)
Multimodal Integration of Sketch and Melanau Daro-Matu Speech in Spatial Queries (MINT)
• Problem/Background– English: main communication medium– Language is unique and distinct
• Individual uses different languages may have different approaches in conceptualizing, communicating, reasoning, expressing their thoughts
– Translation is not sufficient enough – Building the entire system for certain targeted speakers is
time consuming
• Solution– Internationalisation (i18n)– Localisation (l10n)
![Page 30: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/30.jpg)
• Objectives– Integrate Melanau Daro-Matu speech and sketch
(image) modalities– Identify the interaction patterns of Melanau users.– Identify the similarities and difference of English,
Malay and Melanau (extending to Iban as well)– Localise architecture and representation of multimodal
integration in Melanau Daro-Matu, and other languages
![Page 31: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/31.jpg)
Input Capturing
Input Interpretation
Modalities Representation Speech Representation
Sketch Interpretation
Sketch
Speech Interpretation
Part-Of-Speech Tagging
Language-Dependent Components
Tokenization
Tagging using trained corpus
Tagging corrections acquired from templates
Lexicon required
Grammar rules required
Annotated Text
Spatial information retrieval
Speech
Sketch Representation
Modalities Integration Sketch and Speech Integration
Database Searching
Sentence Splitter
Transcription
![Page 32: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/32.jpg)
Spoken Language Dialogue System (SLaDS)
• Problem/Background– Spoken language system (SLS) has become an ever-increasing human-
system interface. – Many studies have been conducted by foreign researchers to unravel
the challenge in the design of spoken language system. – This study focuses on the design and development of spoken language
dialogue system within the context of Malaysian user.
• Solution– The project is performed by conducting a simulation test of the real SLS
system with local user. – The system is then evaluated by adopting the Wizard of Oz method with
the objectives to determine its efficacy. – The result of this testing will be useful for the future development of
Malaysian SLS.
![Page 33: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/33.jpg)
• Objectives– To investigate the spoken language and interaction design,
and its employment in the development of Spoken Language Dialog Systems
– To determine the efficacy of imported usability evaluation techniques applied in the Spoken Language Dialogue Systems
– Identify speech patterns to develop a predictive model for speech recognition
• Current Status– To date, the study is already in its testing stage to capture the
dialogue content. – Respondent is prompt to interact with the system. – The dialogue from the interaction will be taped, transcribed and
analysed.
![Page 34: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/34.jpg)
Wizard’s Control Panel
User’s view
SCREENSHOTS VIDEO
Video showing interaction sample;
![Page 35: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/35.jpg)
Research Projects: Fundamental Research Grant
• Minority Languages Online (MiLO): Preserving Cultures by Mobilising Minority Languages (of Sarawak) Online. (completed 30 June 2007) – Continued with CLS, Univ. of Waikato– Wikipedia approach to development of Bidayuh lexicon
• Bario Lakuh Digital Library (completed) – Recordings of Kelabit songs– Transcibed, translated – With audio and video
![Page 36: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/36.jpg)
e-Vocabulary for Sarawak Malay
• Problems: Language endangerment• Vocabulary of Sarawak Malay (Original source) • Main source: Vocabulary book written by W.S.B.BUCK
from Bau, which was published by Sarawak Civil Service on 11th May, 1932.
• Total of word entries: 1026 words
![Page 37: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/37.jpg)
Background• One of the most widely used computer application
nowadays is the word processor.• Open Source Software (OSS): can used, studied, and
redistributed in modified or unmodified form without restriction
Solution/Objectives• AbiWord (comprehensive word processor) to be
localised• To identify the processes of translation of computing
terminology
AbiWord in Local Languages
![Page 38: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/38.jpg)
Current Status:
Task Progress
Data collection:Template Ongoing
Interface:ToolbarMenuSubmenuIconTooltipsOperation
CompletedCompletedOngoingOngoingOngoingRunning
![Page 39: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/39.jpg)
Screen shots
Interface
Example of Menu Panel
![Page 40: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/40.jpg)
Current Findings: Challenges
• Resources of some languages available– Generally lacking; data collection very challenging
• Writing systems and grammar rules do not exist• Lack of human resources
– Fluent in the (untainted) form (translating, POS tagging)
![Page 41: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/41.jpg)
Current Findings: Bright future• Community Awareness
– Associations of ethnic groups aware of need– Advanced in age interested, younger generation not so
• Protocol followed– Upper management support required to “open doors”
• Local researchers are interested– Colleagues & students
• Machine translation, speech to text, text to speech• Development of speech corpus
![Page 42: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/42.jpg)
Multi-ethnic Group
![Page 43: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/43.jpg)
Concluding Remarks
• Decreasing number of speakers of languages in Sarawak• Maintenance and Revitalisation: Holistic Approach
– Sarawak Language Technology (SaLT) Research Group
• SaLT Projects– Machine translation, multimodal integration, speech language dialog
system, corpus management systems, online dictionaries/repositories, digital libraries
• Challenges: community involvement and data collection and analysis
• Silver lining: committed NGOs and researchers• Internationalisation and localisation approach
![Page 44: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/44.jpg)
Acknowledgements
• Institutional support from – Universiti Malaysia Sarawak– Jugah Foundation, Melanau Association, Dewan Bahasa
dan Pustaka (Sarawak Branch), Majlis Adat Istiadat, Dayak Bidayuh National Association
• Financial Support grants – UNIMAS Fundamental Research Grant Scheme– Federal Ministry of Science, Technology and Innovation
Science Fund Grant Scheme (01-09-SF0028, SF0029, SF0030)
![Page 45: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/45.jpg)
Fifth International Cyberspace Conference on Ergonomics (CybErg 2008)– Theme: Local knowledge, Global Applications– Special Discussion on Maintenance and Preservation of
Languages– On-going 15 Sept – 15 Oct 2008– Free Registration– http://www.cyberg08.org/forum
![Page 46: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/46.jpg)
Sixth International Conference on IT In Asia (CITA’09)•Theme: “Enabling technologies for Knowledge-driven Society: People-Powered Systems”•Tracks on Computational Linguistics, Human Computer Interaction, Software Engineering•Kuching, Malaysia, 6- 9 July 2009; Rainforest Music Festival
![Page 47: Sarawak Language Technology](https://reader034.vdocuments.site/reader034/viewer/2022042611/589062e51a28ab3d4b8c078c/html5/thumbnails/47.jpg)
Thank YouTerima Kasih
Jian Kenin