spoken language corpus project spoken corpora for the 9 official south african african languages

41
SPOKEN LANGUAGE CORPUS PROJECT SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES

Post on 20-Dec-2015

225 views

Category:

Documents


4 download

TRANSCRIPT

  • Slide 1
  • SPOKEN LANGUAGE CORPUS PROJECT SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES
  • Slide 2
  • Workshop Overview The Asmara Declaration Rusandre Whats the point of spoken language corpora? Jens Overview of the project and its phases Rusandre The recording phase Jens/Mmem The transcription phase Jens The checking phase Jens The tagging phase Leif/Rusandre Research output - Jens
  • Slide 3
  • THE ASMARA DECLARATION - 2000 Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled. All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education.
  • Slide 4
  • ASMARA DECLARATION - CNTD Promoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages. The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages.
  • Slide 5
  • Whats the point of spoken language corpora? Jens Allwood Corpus linguistics / Armchair linguistics
  • Slide 6
  • PROJECT MANAGEMENT Goteborg/Unisa Nguni Rhodes Fort Hare UPE/Vista Natal Unizul Sotho N-SothoTswana Univ of NorthNorthwest Univ Venda Univ.Univ. Botswana Venda Venda Univ Tsonga
  • Slide 7
  • OBJECTIVES To develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA The resources will be in the form of archived audio-visual recordings of activity-based natural language use; machine-readable transcriptions of recordings for corpus-driven searches; morphologically tagged corpora for corpus-based searches.
  • Slide 8
  • PROJECT PHASES 2002 - 2004 1.Ongoing Audio-video recordings of activity- based spoken language use (min. 200hrs p/l). 2.Transcriptions (enriched with comment lines) of recordings in machine-readable text format. 3.Checking and editing of transcriptions. 4.Manual morphological tagging of corpora. 5.Automated tagging of corpora. 6.Research outputs.
  • Slide 9
  • The recording phase What to record Activity types What to think about when recording natural language dialogues Keep it natural The video camera, microphone, etc Keep the camera fixed!
  • Slide 10
  • Recording and transcription Practical exercise! 1.A short recording 2.Transcribe together
  • Slide 11
  • Transcription Structure Header (background information about transcription and recorded activity) Body (the actual transcription consisting of two kinds of elements) Contributions (transcribed utterances of participants in the recorded activity) Information lines - marks various peculiar aspects in the contributions and recorded activity
  • Slide 12
  • Example of a header @ Recorded activity ID: V010501 @ Activity type: Informal conversation @ Recorded activity title: Getting to know each other @ Recorded activity date: 20020725 @ Recorder: Britta Zawada @ Participant: A = F2 (Lunga) @ Participant: B = F1 (Bukiwe) @ Transcriber: Mvuyisi Siwisa @ Transcription date: 20020805 @ Checker: Rusandre Hendrikse @ Checking date: 20020912 @ Anonymised: No @ Activity Medium: face-to-face @ Activity duration: 00:44:30 @ Other time coding: Each section @ Tape: V0105 @ Section: Family affairs @ Section: Crime @ Section: Unemployment @ Section: Closing @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga
  • Slide 13
  • Transcription header @ Recorded activity ID: V010501 V = Video, 01 = project number 05 = Tape number within this project 01 = Recording number @ Activity type: Informal conversation @ Recorded activity title: Getting to know each other @ Recorded activity date: 20020725 @ Recorder: Britta Zawada
  • Slide 14
  • Transcription header, cont @ Participant: A = F2 (Lunga) @ Participant: B = F1 (Bukiwe) F stands for female F1 is unique for Bukiwe in the entire corpus A and B are ID:s for the participants
  • Slide 15
  • Transcription header, cont @ Transcriber: Mvuyisi Siwisa @ Transcription date: 20020805 @ Checker: Rusandre Hendrikse @ Checking date: 20020912
  • Slide 16
  • Transcription header, cont @ Anonymised: No Indicates whether personal names, etc have been changed to pseudonyms (Yes) or not (No) both in the header and in the conversation @ Activity Medium: face-to-face Normally spoken, face to face, but could also have other values, like telephone conversations.
  • Slide 17
  • Transcription header, cont @ Activity duration: 00:44:30 Duration in hours, minutes and seconds @ Other time coding: Each section There is a time line for each section @ Tape: V0105 This is a part of the recorded activity ID
  • Slide 18
  • Transcription header, cont @ Section: Family affairs @ Section: Crime @ Section: Unemployment @ Section: Closing @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga Any relevant information that is not covered by any of the required headings
  • Slide 19
  • The body This is the actual transcription - the background information is in the header Four kinds of lines: $A: uyakhonza kaneneContribution @ Information line At officeSection line # 00:10:00Time line
  • Slide 20
  • Sections Family affairs $B: sibabini kuphela esibabalwe sada safunda ke noko sakwazi ukuphangela sikwazi ke noko kuba ndinobhuti wam osebenzayo... Religion $B: uyakhonza kanene $A: ndiyakhonza owu ndiyamthand{a} [4 uthixo ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela $B: [4 nantso ke sisi e: e: ]4 $B: nantso ke into efunekayo uthixo ulithemba lethu [5 uthixo ulithemba lethu ulixhadi lethu ]5 uligwiba $A: [5 ulixhadi lethu ulixhadi lethu]5 $B: [6 uligwiba andazi ukuba ndingangendithini ngendiphi na xa uthixo heyi ]6 Situation on their arrival at Medunsa $A: [6 ucinga ukuba ngesiphi na ngesisemedunsa ]6 $B: uye wasithatha khona waza kusibeka kule ndawo...
  • Slide 21
  • Contributions Religion $B: uyakhonza kanene $A: ndiyakhonza owu ndiyamthand{a} [4 ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela @
  • Slide 22
  • Overlaps Religion $B: uyakhonza kanene $A: ndiyakhonza owu ndiyamthand{a} [4 ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela $B: [4 nantso ke sisi // e: e: ]4 @
  • Slide 23
  • Contrastive stress, pauses and lengthening $B: abanye ke bazihlalele nje: / abanye ABAZANGE bafune sikolo // uyayiqonda ke la meko yokungabikho mzali uqhubayo / uthi aba baza emva kwam bobabini ABAZANGE bafunde kuyaphi // kodwa ke // andigxeki nto kuba ke / ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso kakhulu ]2 $A: [2 ya / m: ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshata
  • Slide 24
  • Unclear speech and glottal stop $M: loo nto ke njengo{ku}ba sekunyanzeleke ukuba ndiye phaya nje (...) ndikwazi ukuncedisa phaya ndiyiphushile ukwenzela ukuba ndibe neclaim endizakuba nayo that is why ndithole because ndiyaclaimer so that at least uba ndiclayimile ndikwazi ukuhamba $T: ke ngoku ke yenye yezinto endifuna ukuyoyenza $M: ngolwesithathu (what she said to me ngoku bendiphaya ngecawe) besingcwaba umfazi kasicaka jama $T: ee andekufuni ukutya
  • Slide 25
  • Comment Lines $A: kunetha imvula sinemithwalo engaka nako sisa @ $B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu ukuba wayengekho ngesasitheni na asazi mntu @
  • Slide 26
  • Research output Jens Allwood A distributed database (corpus) Networks (homepages) Spoken language corpus activities (seminars, workshops)
  • Slide 27
  • TAGGING SPOKEN LANGUAGE SAMPLES PROBLEMATIC ISSUES CONVENTIONS & STANDARDS A P Hendrikse 16/03/04
  • Slide 28
  • PROBLEMATIC ISSUES Loans and codeswitching Fixed expressions Spoken language reductions Morphophonological issues Designing a tag set Manual tagging A drag-and-drop tagger Automated tagging
  • Slide 29
  • Loans and Codeswitching Non-indigenised codeswitching ndifuna Indigenised but non-standardised codeswitching loans >ndiyakleyimisha? ndiyaklayimisha? ndiyafonisha? ndiyafowunisha?
  • Slide 30
  • Fixed Expressions A continuum: Idioms/proverbs prefabricated expressions collocations How fixed is fixed? Into yokuba (*izinto zokuba) Nantso ke (*nantsi ke?) (Ke) kaloku (ke) Bafondini/mfondini Undincedile Ungadinwa nangomso
  • Slide 31
  • Fixed Expressions cntd Flagging fixed phrases Into_yokuba Ke_kaloku_ke Morphosyntactic tagging or not? Ke >_kaloku >_ke > > Or Ke_kaloku_ke >
  • Slide 32
  • Spoken language reductions Standardised reductions Ngokuba > ngoba Written standard reduction: reconstruction convention {} not used, i.e. *ngo{ku}ba Non-standardised reductions Musa ukuhamba > sukuhamba (wsr) > Suhamba (non-standardised)
  • Slide 33
  • Spoken Reductions cntd Reconstruction convention S{uku}hamba Tagged S >{uku >}hamb >a > >
  • Slide 34
  • Morphophonological Issues Coalescence Nenkomo > ne >n >komo > Neenkomo > ne >en >komo > Syllabification Ngasendl{w}ini > nga >se >n >dl{w} >ini > Ayikafiki > ayi >ka >fik >i >
  • Slide 35
  • Morphophonological cntd Elision Andinamoto > andi >na > >m oto > > Stem modifications Emlanjeni > e >m >lanj >en i > >
  • Slide 36
  • Designing a tag set Granularity Lexical categories N, V (Tagging lexical categories is problematic in an agglutinating language) Syntagmatic morphological slots amadodana > a >ma >dod >ana >
  • Slide 37
  • Designing cntd Paradigmatic instantiations within a syntagmatic slot gnp = >--- > Word categories nje (wenjenje) > nje >; njalo >; njeya > ke > ke > kaloku > ke > ke_kaloku_ke > e >m >lanj >eni >??
  • Slide 38
  • Designing cntd Spoken language expressions Non-word like expressions 2 problems 1.Standardising orthographic representation 2.Tags e: > mh: > uh_uh_uh >
  • Slide 39
  • Designing cntd Word-like expressions >thixo > Thixo > Heyi_wethu Nantso_ke Suka_(wena)
  • Slide 40
  • Manual tagging Manual tagging necessary for 3 reasons Identifying tagging problems and problematic phenomena and revising the tag set Developing a training corpus Correcting automated tagging errors Manual (typing) tagging not ideal Tedious Error-prone Solution: Drag-and-drop tagger
  • Slide 41
  • Drag-and-drop tagger Demonstration of drag-and-drop tagger