context and learning in multilingual tone and pitch accent recognition gina-anne levow university of...

Context and Learning in Multilingual Tone and Pitch Accent

Recognition

Gina-Anne Levow

University of Chicago

May 18, 2007

Roadmap

• Challenges for Tone and Pitch Accent– Contextual effects– Training demands

• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition

• Asides: More tones and features

• Reducing Training Demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering

• Conclusion

Challenges: Context

• Tone and Pitch Accent Recognition– Key component of language understanding

• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning

– Non-canonical form (Shen 90, Shih 00, Xu 01)

• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise

– Tone is relative• To speaker range

– High for male may be low for female• To phrase range, other tones

– E.g. downstep

Challenges: Training Demands

• Tone and pitch accent recognition– Exploit data intensive machine learning

• SVMs (Thubthong 01,Levow 05, SLX05)• Boosted and Bagged Decision trees (X. Sun, 02)• HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson

et al, 04,…– Can achieve good results with huge sample sets

• SLX05: ~10K lab syllabic samples -> > 90% accuracy– Training data expensive to acquire

• Time – pitch accent 10s of times real-time• Money – requires skilled labelers• Limits investigation across domains, styles, etc

– Human language acquisition doesn’t use labels

Strategy: Overall

• Common model across languages– Common machine learning classifiers

– Acoustic-prosodic model• No word label, POS, lexical stress info• No explicit tone label sequence model

– English, Mandarin Chinese, isiZulu• (also Cantonese)

Strategy: Context

• Exploit contextual information– Features from adjacent syllables

• Height, shape: direct, relative

– Compensate for phrase contour

– Analyze impact of • Context position, context encoding, context type• > 12.5% reduction in error over no context

Data Collections: I

• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b

– Manually ToBI annotated, aligned, syllabified

– Pitch accent aligned to syllables• Unaccented, High, Downstepped High, Low

– (Sun 02, Ross & Ostendorf 95)

Data Collections: II

• Mandarin: – TDT2 Voice of America Mandarin Broadcast News– Automatically force aligned to anchor scripts

• Automatically segmented, pinyin pronunciation lexicon• Manually constructed pinyin-ARPABET mapping• CU Sonic – language porting

– High, Mid-rising, Low, High falling, Neutral

Data Collections: III

• isiZulu: (Govender et al., 2005)– Sentence text collected from Web

• Selected based on grapheme bigram variation

– Read by male native speaker– Manually aligned, syllabified– Tone labels assigned by 2nd native speaker

• Based only on utterance text

– Tone labels: High, low

Local Feature Extraction

• Uniform representation for tone, pitch accent– Motivated by Pitch Target Approximation Model

• Tone/pitch accent target exponentially approached – Linear target: height, slope (Xu et al, 99)

• Base features: – Pitch, Intensity max, mean, min, range

• (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase

• Slope: – Linear fit to last half of pitch contour

Context Features

• Local context:– Extended features

• Pitch max, mean, adjacent points of preceding, following syllables

– Difference features• Difference between

– Pitch max, mean, mid, slope– Intensity max, mean

• Of preceding, following and current syllable

• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope

Classification Experiments

• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation

• SVMlight (Joachims), LibSVM (Cheng & Lin 01)

– 4:1 training / test splits

• Experiments: Effects of – Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal

Results: Local Context

Context Mandarin Tone English Pitch Accent

isiZulu Tone

Full 74.5% 81.3% 75.9%

Extend PrePost 74% 80.7% 73.8%

Extend Pre 74% 79.9% 73.6%

Extend Post 70.5% 76.7% 72.3%

Diffs PrePost 75.5% 80.7% 75.8%

Diffs Pre 76.5% 79.5% 75.5%

Diffs Post 69% 77.3% 72.8%

Both Pre 76.5% 79.7% 75.5%

Both Post 71.5% 77.6% 72.5%

No context 68.5% 75.9% 72.2%

Discussion: Local Context

• Any context information improves over none

– Preceding context information consistently improves over none or following context information

• English/isiZulu: Generally more context features are better• Mandarin: Following context can degrade

– Little difference in encoding (Extend vs Diffs)

• Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

Results & Discussion: Phrasal Context

Phrase Context Mandarin Tone English Pitch Accent

Phrase 75.5% 81.3%

No Phrase 72% 79.9%

•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve

Context: Summary

• Employ common acoustic representation– Tone (Mandarin,isiZulu), pitch accent (English)

• SVM classifiers - linear kernel: 76%,76%, 81%• Local context effects:

– Up to > 20% relative reduction in error– Preceding context greatest contribution

• Carryover vs anticipatory

• Phrasal context effects:– Compensation for phrasal contour improves recognition

Aside: More Tones

• Cantonese:– CUSENT corpus of read broadcast news text– Same feature extraction & representation – 6 tones:

– High level, high rise, mid level, low fall, low rise, low level

– SVM classification:• Linear kernel: 64%, Gaussian kernel: 68%

– 3,6: 50% - mutually indistinguishable (50% pairwise)» Human levels: no context: 50%; context: 68%

• Augment with syllable phone sequence– 86% accuracy: 90% of syllable w/tone 3 or 6: one

dominates

Aside: Voice Quality & Energy• w/ Dinoj Surendran

• Assess local voice quality and energy features for tone – Not typically associated with tones: Mandarin/isiZulu

• Considered: – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;

Band energy• Useful: Band energy significantly improves

– Mandarin: neutral tone • Supports identification of unstressed syllables

– Spectral balance predicts stress in Dutch

– isiZulu: Using band energy outperforms pitch• In conjunction with pitch -> ~78%

Roadmap

• Challenges for Tone and Pitch Accent– Contextual effects– Training demands

• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition

• Reducing Training Demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering

• Conclusion

Strategy: Training

• Challenge: – Can we use the underlying acoustic structure of the

language – through unlabeled examples – to reduce the need for expensive labeled training data?

• Exploit semi-supervised and unsupervised learning– Semi-supervised Laplacian SVM– K-means and asymmetric k-lines clustering– Substantially outperform baselines

• Can approach supervised levels

Data Collections & Processing• English: (as before)

– Boston University Radio News Corpus, f2b• Binary: Unaccented vs accented• 4-way: Unaccented, High, Downstepped High, Low

• Mandarin:– Lab speech data: (Xu, 1999)

• 5 syllable utterances: vary tone, focus position– In-focus, pre-focus, post-focus

– TDT2 Voice of America Mandarin Broadcast News– 4-way: High, Mid-rising, Low, High falling

• isiZulu: (as before)– Read web sentences

• 2-way: High vs low

Semi-supervised Learning

• Approach: – Employ small amount of labeled data– Exploit information from additional – presumably more

available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05)

• Classifier: – Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)– Semi-supervised variant of SVM

• Exploits unlabeled examples – RBF kernel, typically 6 nearest neighbors, transductive

Experiments

• Pitch accent recognition:– Binary classification: Unaccented/Accented– 1000 instances, proportionally sampled

• Labeled training: 200 unacc, 100 acc– 80% accuracy (cf. 84% w/15x labeled SVM)

• Mandarin tone recognition:– 4-way classification: n(n-1)/2 binary classifiers– 400 instances: balanced; 160 labeled

• Clean lab speech- in-focus-94%– cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples

• Broadcast news: 70% – Cf. < 50% w/SVM 160 training samples

Unsupervised Learning

• Question: – Can we identify the tone structure of a language from the acoustic

space without training?• Analogous to language acquisition

• Significant recent research in unsupervised clustering• Established approaches: k-means• Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):

asymmetric k-lines

– Little research for tone• Self-organizing maps (Gauthier et al,2005)

– Tones identified in lab speech using f0 velocities

• Cluster-based bootstrapping (Narayanan et al, 2006)• Prominence clustering (Tambourini ’05)

Clustering

• Pitch accent clustering:– 4 way distinction: 1000 samples, proportional

• 2-16 clusters constructed– Assign most frequent class label to each cluster

• Classifier: – Asymmetric k-lines:

» context-dependent kernel radii, non-spherical

– > 78% accuracy: • 2 clusters: asymmetric k-lines best

– Context effects:• Vector w/preceding context vs vector with no context

comparable

Contrasting Clustering

• Contrasts:– Clustering:

• 3 Spectral approaches:– Perform spectral decomposition of affinity matrix

» Asymmetric k-lines (Fischer & Poland 2004)» Symmetric k-lines (Fischer & Poland 2004)» Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)» Binary weights, k-lines clustering

• K-means: Standard Euclidean distance– # of clusters: 2-16

• Best results: > 78%– 2 clusters: asymmetric k-lines; > 2 clusters: kmeans

• Larger # clusters: all similar

Contrasting Learners

Tone Clustering: I

• Mandarin four tones:• 400 samples: balanced• 2-phase clustering: 2-5 clusters each• Asymmetric k-lines, k-means clustering

– Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised)

– Broadcast news: 57% (cf. 74% supervised)– K-means requires more clusters to reach k-lines level

Tone Structure

First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height

Tone Clustering: II

• isiZulu High/Low tones• 3225 samples: no labels• Proportional: ~62% low, 38% high• K-means clustering: 2 clusters

– Read speech, web-based sentences• 70% accuracy (vs 76% fully-supervised)

Conclusions

• Common prosodic framework for tone and pitch accent recognition

– Contextual modeling enhances recognition• Local context and broad phrase contour

– Carryover coarticulation has larger effect for Mandarin

– Exploiting unlabeled examples for recognition• Semi- and Un-supervised approaches

– Best cases approach supervised levels with less training– Exploits acoustic structure of tone and accent space

Current and Future Work

• Interactions of tone and intonation– Recognition of topic and turn boundaries– Effects of topic and turn cues on tone real’n

• Child-directed speech & tone learning• Support for Computer-assisted tone learning• Structured sequence models for tone

– Sub-syllable segmentation & modeling

• Feature assessment– Band energy and intensity in tone recognition

Thanks

• Dinoj Surendran, Siwei Wang, Yi Xu

• Natasha Govender and Etienne Barnard

• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin

• This work supported by NSF Grant #0414919

• http://people.cs.uchicago.edu/~levow/tai

context and learning in multilingual tone and pitch accent recognition gina-anne levow university of...

Documents

context tone

context context

context slide

syllabified pitch accent

time pitch accent

low slide

lexical tone

training demands tone