structural metadata annotation of speech corpora: comparing broadcast news and broadcast...
TRANSCRIPT
![Page 1: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/1.jpg)
Structural Metadata Annotation of Speech Corpora:
Comparing Broadcast News and Broadcast Conversations
Jáchym Kolář Jan Švec
University of West Bohemia in Pilsen, Czech Republic
![Page 2: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/2.jpg)
29.5.2008 J. Kolar and J. Svec 2
Talk OverviewTalk Overview
• Structural metadata annotation• Speech data• Statistics about fillers• Statistics about edit disfluencies• Statistics about sentence-like units• Summary
![Page 3: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/3.jpg)
29.5.2008 J. Kolar and J. Svec 3
Structural Metadata ExtractionStructural Metadata Extraction
• Metadata Extraction (MDE) research started as part of DARPA EARS program
• Metadata annotation scheme for MDE introduced by LDC (originally for English we have extended it to Czech)
• ULTIMATE GOAL of MDE: Automatic conversion of raw speech recognition output to forms more useful to humans and downstream automatic processes
![Page 4: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/4.jpg)
29.5.2008 J. Kolar and J. Svec 4
MDE Annotation Subtasks MDE Annotation Subtasks
• Boundaries of syntactic/semantic units (SUs)• Statements, Interrogatives, Incompletes• Coordination breaks, Clausal breaks
• Non-content words (fillers):• Filled pauses (FPs)• Discourse markers (DMs)
• Speech disfluencies (edits):• Deletable regions (DelRegs), Interruption points,
Explicit editing terms, Corrections
![Page 5: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/5.jpg)
29.5.2008 J. Kolar and J. Svec 5
MDE Annotation ExampleMDE Annotation Example
but I you know really pre- uh prefer this form of of um presentation she Sheila told me on Tuesday no on Wednesday she didn’t so let’s move on because we don’t have uh don’t have time well do you like this this example
but I you know really [pre-]* uh prefer this form [of]* of um presentation/. [she]* Sheila told me [on Tuesday]* no on Wednesday/, she didn’t/. so let’s move on/, because we [don’t have]* uh don’t have time/. well do you like [this]* this example/?
![Page 6: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/6.jpg)
29.5.2008 J. Kolar and J. Svec 6
Goal of This PaperGoal of This Paper
• Analyse and compare two Czech MDE corpora from different domains in terms of metadata statistics
• Compare Czech Broadcast News (BN) vs. Broadcast Conversations (BC)
• Also compare Czech and English MDE corpora – English Broadcast News and Conversational Telephone Speech (CTS)
![Page 7: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/7.jpg)
29.5.2008 J. Kolar and J. Svec 7
Czech Broadcast News DataCzech Broadcast News Data
• News from 3 TV channels and 4 radio stations
• Both public and commercial broadcast companies
• Differing in presentation style• 26 hours of transcribed speech• ~ 300 speakers• Speech recordings and verbatim
transcripts publicly available from LDC
![Page 8: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/8.jpg)
29.5.2008 J. Kolar and J. Svec 8
Broadcast Conversation DataBroadcast Conversation Data
• 52 recordings of a Czech radio talk show – Radioforum
• 24 hours of transcribed speech• ~ 100 speakers• 1-3 guests spontaneously answer
questions asked by 1-2 interviewers• Mostly political debates• Currently being extended by additional
20 recordings (~10 hours)
![Page 9: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/9.jpg)
29.5.2008 J. Kolar and J. Svec 9
Statistics about FillersStatistics about Fillers
• Filled pauses more frequent in Czech Broadcast Conversations (3.8% of words) than in News (0.5%)
• English MDE: CTS – 2.2%, BN – 1.4%
• Discourse markers also more frequent in Czech Conversations (1.6%) than in News (0.1%)
• English MDE: CTS – 4.4%, BN – 0.5%
![Page 10: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/10.jpg)
29.5.2008 J. Kolar and J. Svec 10
Statistics about Edit DisfluenciesStatistics about Edit Disfluencies
• Deletable regions – 2.8% of words in Conversations and 0.2% in News
• English MDE: 5.4% in CTS and 1.5% in BN
• Percentage of disfluencies having a correction larger in News (94.6%) than in Conversations (83.8%)
• Explicit editing terms rare in both corpora – occur just at 4% of disfluencies
![Page 11: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/11.jpg)
29.5.2008 J. Kolar and J. Svec 11
POS Analysis of Edit DisfluenciesPOS Analysis of Edit Disfluencies
• Tagged the Czech corpora employing an automatic POS tagger
• Czech uses structured tags with 15 positions;we only used the first position distinguishing 10 basic POS
• Computed and compared three POS distributions:1) Whole corpus2) Deletable regions only3) Corrections only
![Page 12: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/12.jpg)
29.5.2008 J. Kolar and J. Svec 12
POS Analysis of Edit DisfluenciesPOS Analysis of Edit Disfluencies
Noun Verb Pronoun Adverb Conj Adject Prep0
0.2
0.4
BC
Noun Verb Pronoun Adverb Conj Adject Prep0
0.2
0.4
BN
AllDelRegsCorr
![Page 13: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/13.jpg)
29.5.2008 J. Kolar and J. Svec 13
Statistics about SUsStatistics about SUs
• Average SU length: Conversations (14.5 words) shows longer SUs than News (13.0)
• English BN (12.5) similar to Czech, but CTS shows much shorter SUs (7.0) than Broadcast Conversations
• SU-internal breaks (clausal and coordination) more frequent in Conversations than in News(49% vs. 31% of all SU symbols)
Complex and compound sentences more common in spontaneous conversations than in prearranged news
![Page 14: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/14.jpg)
29.5.2008 J. Kolar and J. Svec 14
SummarySummary
• Broadcast Conversations contain significantly more fillers and disfluencies than News
• Conversations also show longer SUs and contain a higher number of complex sentences than News
• Deletable regions and corrections in both corpora show different POS distributions in comparison with the general POS distributions
• We plan to make Czech MDE corpora publicly available
![Page 15: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia](https://reader036.vdocuments.site/reader036/viewer/2022083006/56649f335503460f94c4fe0f/html5/thumbnails/15.jpg)
Structural Metadata Annotation of Speech Corpora:
Comparing Broadcast News and Broadcast Conversations
Jáchym Kolář Jan Švec
University of West Bohemia in Pilsen, Czech Republic