prosody in spoken language understanding gina anne levow university of chicago january 4, 2008 nlp...
Post on 17-Dec-2015
220 Views
Preview:
TRANSCRIPT
Prosody in Spoken Language Understanding
Gina Anne LevowUniversity of Chicago
January 4, 2008NLP Winter School 2008
U: Give me the price for AT&T.
U: Give me the price for AT&T.
U: Give me the price for AT&T.
U: Give me the price for American Telephone and Telegraph.
Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½
since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American
Telephone and Telegraph. S: Excuse me?
Roadmap Corrections: A motivating example Defining prosody Why prosody? Challenges in prosody Prosody in language understanding
Recognizing tone and pitch accent Spoken corrections, Topic segmentation
Conclusions
Defining Prosody Prosody
Phonetic phenomena in speech than span more than a single segment-“suprasegmental”
Prosody includes: Stress, focus, tone, intonation, length/pause, rhythm
Prosodic features include: Pitch: perceptual correlate of fundamental
frequency f0: rate of vocal fold vibration
Loudness/intensity, duration, segment quality
Why Prosody?
Prosody plays a crucial role At all levels of language
Lexical, syntactic, pragmatic/discourse Establishes meaning Disambiguates sense and structure
Across languages families Common physiological, articulatory basis
In synthesis and recognition of fluent speech
Prosody and the Lexicon Lexical: Determines word identity
Prosodic effect at the syllable level (minimal unit) Lexical stress: syllable prominence
Combination of length, pitch movement, loudness REcord (N) vs reCORD (V)
Pitch accent can differentiate words in some languages
Lexical tone: tone languages, e.g. Chinese, Punjabi Pitch height (register) and/or shape (contour)
Ma (high): motherMa (rising): hempMa (low): horseMa (falling): scold
Prosody and Syntax
Prosody can disambiguate structure Associated with chunking and attachment
Not identical with syntactic phrase boundaries “Prosody is predictable from syntax, except
when it isn’t” Prosodic phrasing indicated by:
Some combination of pause, change in pitch
Chunking, or “phrasing”
A1: I met Mary and Elena’s mother at the mall yesterday.
A2: I met Mary and Elena’s mother at the mall yesterday.
50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
400
Example from Jennifer Venidetti
Punctuation & Prosody Humor A panda goes into a restaurant and
has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’
Prosody in Pragmatics & Discourse Focus:
Prominence, new information: pitch accent “October eleventh”:
Sentence type, dialogue act: Statement vs. declarative question :“It’s
raining (?)”
Discourse Structure (Topic), Emotion
from Shih, Prosody Learning and Generation
Challenges in Prosody I
Highly variable Actual realization differs from ideal
Speaker variation: Gender, vocal track differences, idiosyncrasy
Tonal coarticulation Neighboring tones influence (like segmental)
Underlying fall can become rise
Parallel encoding Effects at multiple levels realized
simultaneously
Challenges in Prosody II
Challenges for learning Lack of training data
Sparseness: Many prosodic phenomena are infrequent
E.g., non-declarative utterances, topic boundaries, contrastive accents, etc
Challenging for machine learning methods Costs of labeling:
Many prosodic events require expert labeling Need large corpus to attest
Time-consuming, expensive
Strategy: Context Common model across languages
Pure acoustic-prosodic model No word label, POS, lexical stress info
English, Mandarin Chinese (also Cantonese, isiZulu)
Exploit contextual information Features from adjacent syllables, phrase
contour Analyze impact of
Context position, context encoding, context type
> 12.5% reduction in error over no context
Data Collections English: (Ostendorf et al, 95)
Boston University Radio News Corpus, f2b Manually annotated, aligned, syllabified 4 Pitch accent labels, aligned to syllables
Mandarin: TDT2 Voice of America Mandarin Broadcast News Automatically aligned, syllabified 4 main tones, neutral
Local Feature Extraction Uniform representation for tone, pitch accent
Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached
Linear target: height, slope (Xu et al, 99) Base features:
Pitch, Intensity max, mean, min, range (Praat, speaker normalized)
Pitch at 5 points across voiced region Duration Initial, final in phrase
Slope: Linear fit to last half of pitch contour
Context Features Local context:
Extended features Pitch max, mean, adjacent points of preceding,
following syllables Difference features
Difference between Pitch max, mean, mid, slope Intensity max, mean
Of preceding, following and current syllable Phrasal context:
Compute collection average phrase slope Compute scalar pitch values, adjusted for slope
Classification Experiments Classifier: Support Vector Machine
Linear kernel Multiclass formulation
SVMlight (Joachims), LibSVM (Cheng & Lin 01) 4:1 training / test splits
Experiments: Effects of Context position: preceding, following, none,
both Context encoding: Extended/Difference Context type: local, phrasal
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Discussion: Local Context Any context information improves over none
Preceding context information consistently improves over none or following context information
English: Generally more context features are better Mandarin: Following context can degrade
Little difference in encoding (Extend vs Diffs)
Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context
Phrase Context
Mandarin Tone
English Pitch Accent
Phrase 75.5% 81.3%
No Phrase 72% 79.9%
•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve
Strategy: Training Challenge:
Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data?
Exploit semisupervised and unsupervised learning Semi-supervised Laplacian SVM K-means and asymmetric k-lines clustering Substantially outperform baselines
Can approach supervised levels
Semi-supervised Learning Approach:
Employ small amount of labeled data Exploit information from additional – presumably
more available –unlabeled data Few prior examples: several weakly supervised: (Wong
et al, ’05)
Classifier: Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) Semi-supervised variant of SVM
Exploits unlabeled examples RBF kernel, typically 6 nearest neighbors, transductive
Experiments Pitch accent recognition:
Binary classification: Unaccented/Accented 1000 instances, proportionally sampled
Labeled training: 200 unacc, 100 acc 80% accuracy (cf. 84% w/15x labeled SVM)
Mandarin tone recognition: 4-way classification: n(n-1)/2 binary classifiers 400 instances: balanced; 160 labeled
Clean lab speech- in-focus-94% cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training
samples Broadcast news: 70%
Cf. < 50% w/SVM 160 training samples
Unsupervised Learning Question:
Can we identify the tone structure of a language from the acoustic space without training?
Analogous to language acquisition Significant recent research in unsupervised
clustering Established approaches: k-means Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):
asymmetric k-lines Little research for tone
Self-organizing maps (Gauthier et al,2005) Tones identified in lab speech using f0 velocities
Cluster-based bootstrapping (Narayanan et al, 2006) Prominence clustering (Tambourini ’05)
Contrasting Clustering Contrasts:
Clustering: 2-16 clusters, label w/most freq class 3 Spectral approaches:
Perform spectral decomposition of affinity matrix Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) Binary weights, k-lines clustering
K-means: Standard Euclidean distance # of clusters: 2-16
Best results: > 78% 2 clusters: asymmetric k-lines; > 2 clusters:
kmeans Larger # clusters: all similar
Tone Clustering: I Mandarin four tones:
400 samples: balanced 2-phase clustering: 2-5 clusters each Asymmetric k-lines, k-means clustering
Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised)
Broadcast news: 57% (cf. 74% supervised) K-means requires more clusters to reach k-lines level
Tone Structure
First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height
Conclusions Common prosodic framework for tone and
pitch accent recognition
Contextual modeling enhances recognition Local context and broad phrase contour
Carryover coarticulation has larger effect for Mandarin
Exploiting unlabeled examples for recognition Semi- and Un-supervised approaches
Best cases approach supervised levels with less training
Exploits acoustic structure of tone and accent space
Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½
since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American
Telephone and Telegraph. S: Excuse me?
Recognizing Spoken Corrections
Spoken Corrections Recognize user attempts to correct ASR failures Compare original input to repeat corrections Significant differences:
Corrections: increases in duration, pause #/length, final fall Increases in pitch accent for misrecognitions
Automatic recognition with decision trees, boosting Distinguish corrective/not (human level)
Key features: raw/normalized duration, pause Identify specific word being corrected
Key features: highest pitch, widest pitch range
The Problem:Speech Topic Segmentation
Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||
Recognizing Shifts in Topic & Turn
Topic & Turn boundaries in English & Mandarin Initial syllables:
Significantly higher pitch, loudness than final Lexical and prosodic cues:
Cue words, tf*idf similarity; pitch, loudness, silence Automatic recognition with decision trees, boosting
Voting to combine text, prosody, silence: 97% accuracy Key features:
Pause; pitch, loudness contrast between syllables
Conclusions & Opportunities
Prosody Rich source of information for languages Challenging due to variation, paucity of data
Can be successfully employed, with learning, to improve language understanding Pitch accent, tone, dialogue act, turn, topic,…
Unrestricted conversational, multi-party, multimodal speech much more challenging Increased variability, interaction with non-verbal
evidence
Thanks Dinoj Surendran, Siwei Wang, Yi Xu
V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin
This work supported by NSF Grant #0414919
http://people.cs.uchicago.edu/~levow/tai
50
100
150
200
250
300
350
400
Phrasing can disambiguate
I met Mary and Elena’s mother at the mall yesterday
Mary & Elena’s mothermall
One intonation phrase with relatively flat overall pitch range.
50
100
150
200
250
300
350
400
Phrasing can disambiguate
I met Mary and Elena’s mother at the mall yesterday
Marymall
Elena’s mother
Separate phrases, with expanded pitch movements.
Lists of numbers, nouns
twenty.eight.five
ninety.four.three
seventy.three.seven
forty.seven.seven
seventy.seven.seven
coffee cake and cream
chocolate ice cream and cake
fish fingers and bottles
cheese sandwiches and milk
cream buns and chocolate[from Prosody on the Web tutorial on chunking]
Clustering Pitch accent clustering:
4 way distinction: 1000 samples, proportional
2-16 clusters constructed Assign most frequent class label to each cluster
Classifier: Asymmetric k-lines:
context-dependent kernel radii, non-spherical > 78% accuracy:
2 clusters: asymmetric k-lines best Context effects:
Vector w/preceding context vs vector with no context comparable
top related