spoken speech corpora shu-chuan tseng academia sinica january 2002
TRANSCRIPT
Spoken Speech Corpora
Shu-Chuan Tseng
Academia SinicaJanuary 2002
2January 2002
Contents
• Speech Corpora and Annotation
• Related Research Issues
• Transcribing and Annotating Mandarin Spontaneous Dialogues
• Interface and Annotation Tags
3January 2002
State of the Art – Speech Corpora
• Air Travel Information System. Spoken Language Systems Pilot Corpus (ATIS, h-c, English, MADCOW 1992)http://www.ldc.upenn.edu/Catalog/LDC93S4A.html
http://morph.ldc.upenn.edu/readme_files/atis/sspcrd/corpus.html
• SRI’s Amex Travel Agent Data (AMEX, h-h, English, Kowtko & Price 1989)66 conversations
http://www.ai.sri.com/~communic/amex/amex.html
• Switchboard Corpus (SWBD, h-h, English, Godfrey, Holliman and McDaniel 1992)2400 conversations
http://stripe.colorado.edu/~jurafsky/manual.august1.html
4January 2002
State of the Art - Speech Corpora
• HCRC Map Task Corpus (h-h, English, Anderson et al. 1991)64 subjects
http://www.hcrc.ed.ac.uk/dialogue/maptask.html
• BAUFIX (h-h, h-c, German, Sagerer et al. 1994, Brindöpke et al.1995)h-h: 22 dialogues, h-c: 32 dialogues
http://www.sfb360.uni-bielefeld.de/transkript/
• TRAINS 93 (h-h, English, Heeman & Allen 1995)20 different tasks, 34 different speakers, 6.5 hrs, 5900 turns and 55000 transcribed words
http://www.cs.rochester.edu/research/cisd/projects/trains/
• Pattern-Description Monologues (h, Dutch, Levelt 1983)53 different patterns, 53 subjects
Levelt
5January 2002
State of the Art - Speech Annotation• Transliteration:
- audio data => written transcripts - systems: verbatim, annotated, shortened/cleaned, conversation acts,
suprasegmental- contents:
words, boundaries, tones, non-speech sequences, discourse- and syntax-related roles
• Labelling = Transcripts + Time-aligned Signals- raw acoustical data => time-aligned, segmented acoustical data- tiers: phone, phoneme, syllable, word, boundary & tone
6January 2002
Transcription: Turns (BAUFIX)01K020 ach so, {ich}<spk: I, s/> dachte, die wären auf der anderen Seite.
<hum: lachen> <noise> war gerade am überlegen ist das </noise: rascheln> <par> <noise> nicht (ei)n bißchen kurz </noise: rascheln> </par: 2>
01I022 <par> also nicht über Kreuz </par: 2> <noise> sondern wirklich so gerade übereinander. </noise: rascheln>
01K021 {ja, ja}<noise: klappern>
01I023 {mhm}<noise: klappern>
01K022 <noise> <sil: 48> hm, irgendwie geht das nicht so toll fest. <hum: lachen> </noise: klappern>
01I024 <noise> <sil: 1> {hält das nicht?}<spk: K, ?> </noise: klappern>
01K023 <noise> <sil: 1> hm, nee dieses eine Rautenteil war zu klein. das ging <quest: da> nicht drüber. <hum: lachen> ja <-> und nun? </noise: rascheln>
01I025 <noise> <hum: atmen> ja und du hast also jetzt diese <hum: atmen> {äh ja}<hum: atmen> diese benzolförmigen Dinger sind jetzt sag(e) ich mal oben <-> und auch oben <-> auf diesen <-> beiden Platten ist jetzt dieser <--> Würfel. </noise: rascheln>
01K024 {mhm}<noise: rascheln>
7January 2002
Transcription: Conversation Acts (TRAINS, Traum/Hinkelman 1992)
DU
(Discourse Unit)UU
(Utterance Unit)Sub UU
8January 2002
Prosodic Annotation• ToBI: Tones and Break Indices Pan-Mandarin ToBI System (http://deall.ohio-state.edu/chan.9/MToBI.htm)
9January 2002
Related Issues: Lexical Distribution
0
50
100
150
200
250
300
350
400
Nu
mbe
r of
Typ
es
90 80 70 60 50 40 30 20 10 9 8 7 6 5 4 3 2 1
Word Frequency
10January 2002
Related Issues: turn-initial words
D1-A D1-B D2-A D2-B D3-A D3-B
嗯 en他 ta哎 ai對 dui嗷 ou她 ta那 na我 wo有 you好像haoxiang
371097654444
嗯 en那 na哎 ai對 dui嗷 ou他 ta我 wo她 ta就 jiu哦 o
45987765544
嗯 en嗷 ou嗨 hai這樣子zheyangzi哦 o那 na對 dui嗯嗯 enen真的 zhende哎 ai
18877
665555
嗯 en對 dui是 shi哎 ai哦 o呵 he嗷 ou啊 a我 wo呃 e
9888775555
嗨 hai那 na嗷 ou我 wo對啊 duia哎 ai對 dui嗯 en這樣 zheyang哦 o
22187766444
4
嗯 en那 na啊 a嗷 ou哎 ai我 wo呃 e對啊 duia就 jiu哦 o
181796544333
11January 2002
Related Issues: Repair Types
Type Occurrence Type OccurrenceRepetition 202 Addition/Substitution 1Substitution 46 Addition/Repetition 5 Addition 43 Deletion/Repetition 1Deletion 9 Repetition/Addition 10
Repetition/Addition/Sub 1Repetition/Deletion 1Repetition/Substitution 3Substitution/Repetition 3
12January 2002
Related Issues: POS in Chinese Repairs
POS Abbreviation OccurrencesVerb V 258Noun N 521Preposition P 29Adverbial D 322Conjunction C 23Particle T 22Interjection I 20Non-predicate Adj. A 1Foreign Word FW 2Verb: be SHI 27
13January 2002
Related Issues: POS (Repairs vs. Overall Data)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
V N P D C T I A b SHI FW
POS Tags
Per
cent
age
Repair
Overall
14January 2002
Related issues: Prosodic Signalling• reset hypothesis; baseline declination; intonation units• pitch contour; location of editing terms• duration
在 家 的 在 裡面 的 哦 zai jia de zai limian de O At home, inside O
15January 2002
Related Issues: Intonation Unit vs. Repair
16January 2002
Collecting Speech Data
• Types of Speech: read, spontaneous, monologues, dialogues• Scenario Design: daily conversation, direction-giving, instruction,
pattern-description, task-oriented, topic-oriented• Selection of Subjects: age, linguistic and social background, gender,
education• Recording: digital audio tape, MD, video, eye-tracking• Transcription: orthographic transcript, discourse-related function or
annotation, intonation-units• Labelling: phonemic, word, prosodic• Documentation: subjects, recording device, annotation system
17January 2002
Building a Large Mandarin Dialogue Corpus• Content, Size, Style, Topic, Subject (corpus setting)• Dialogue Transcription (computer-aided)• Transcription Programme (speaker, sound file, tags, time,
content)• Convention Systems (depending on research directions)• Content Annotation (computer-aided, automatically merged
into database)• Database Construction (database in Access format)• Speech Labelling (sound file index and time-alignment and
in database available)
18January 2002
Statistics of Corpus• 30 conversational dialogues (37 female/23 male) • Age 16-25: 20 (15f/5m), age 26-35: 19 (9f/10m), age 36-45:
21 (13f/8m)• Total length: 26.5 hr.; each dialogue is about 50 min. long• Topics: family, work, economics, politics, movie, education,
exams, language learning, TV, internet, internet café, children, childhood, traveling, music, jobs, school, dialect, social problems, environmental problems, colleagues, personal experience, computer, traffic, marriage, China, Taipei MRT etc.
19January 2002
Computer-Aided Transcription
• transcribing conversation in Pinyin and in Chinese characters
• including transcriber and subject information• documenting location of audio files• marking start and end time of speech segment in
corresponding audio files• inserting flexible tags to annotate linguistic features• outputting data in database format
20January 2002
Illustration - Interface
21January 2002
Illustration - Database
22January 2002
Extragrammatical Sequences in Human-Human Conversation
• disfluency- prosodic, repair, syntactic, pragmatic
• socio-linguistic phenomena- code switching, new words
• particular vocalisation- lengthening, assimilation, syllable contraction
• unintelligible and non-speech sounds
23January 2002
Annotation Taglist - I
24January 2002
Annotation Taglist -II