unsupervised and semi-supervised learning of tone and pitch accent

42
Unsupervised and Semi- Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006

Upload: sukey

Post on 02-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent. Gina-Anne Levow University of Chicago June 6, 2006. Roadmap. Challenges for Tone and Pitch Accent Variation and Learning Data collections & processing Learning with less Semi-supervised learning Unsupervised clustering - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Unsupervised and Semi-Supervised Learning

of Tone and Pitch AccentGina-Anne Levow

University of Chicago

June 6, 2006

Page 2: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Roadmap

• Challenges for Tone and Pitch Accent– Variation and Learning

• Data collections & processing

• Learning with less– Semi-supervised learning– Unsupervised clustering

• Approaches, structure, and context

• Conclusion

Page 3: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Challenges: Tone and Variation

• Tone and Pitch Accent Recognition– Key component of language understanding

• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning

– Non-canonical form (Shen 90, Shih 00, Xu 01)

• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise

– Tone is relative• To speaker range

– High for male may be low for female• To phrase range, other tones

– E.g. downstep

Page 4: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Challenges: Training Demands

• Tone and pitch accent recognition– Exploit data intensive machine learning

• SVMs (Thubthong 01,Levow 05, SLX05)• Boosted and Bagged Decision trees (X. Sun, 02)• HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson

et al, 04,…)– Can achieve good results with large sample sets

• ~10K lab syllabic samples -> > 90% accuracy– Training data expensive to acquire

• Time – pitch accent 10s of time real-time• Money – requires skilled labelers• Limits investigation across domains, styles, etc

– Human language acquisition doesn’t use labels

Page 5: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Strategy: Training

• Challenge: – Can we use the underlying acoustic structure of the

language – through unlabeled examples – to reduce the need for expensive labeled training data?

• Exploit semi-supervised and unsupervised learning– Semi-supervised Laplacian SVM– K-means and asymmetric k-lines clustering– Substantially outperform baselines

• Can approach supervised levels

Page 6: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Data Collections I: English

• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b– Manually ToBI annotated, aligned, syllabified– Pitch accent aligned to syllables

• 4-way: Unaccented, High, Downstepped High, Low – (Sun 02, Ross & Ostendorf 95)

• Binary: Unaccented vs Accented

Page 7: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Data Collections II: Mandarin

• Mandarin: – Lexical tones:

• High, Mid-rising, Low, High falling, Neutral

Page 8: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Data Collections III: Mandarin

• Mandarin Chinese:– Lab speech data: (Xu, 1999)

• 5 syllable utterances: vary tone, focus position– In-focus, pre-focus, post-focus

– TDT2 Voice of America Mandarin Broadcast News• Automatically force aligned to anchor scripts

– Automatically segmented, pinyin pronunciation lexicon

– Manually constructed pinyin-ARPABET mapping

– CU Sonic – language porting

– 4-way: High, Mid-rising, Low, High falling

Page 9: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Local Feature Extraction

• Motivated by Pitch Target Approximation Model• Tone/pitch accent target exponentially approached

– Linear target: height, slope (Xu et al, 99)

• Scalar features: – Pitch, Intensity max, mean (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase

• Slope: – Linear fit to last half of pitch contour

Page 10: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Context Features

• Local context:– Extended features

• Pitch max, mean, adjacent points of adjacent syllable

– Difference features wrt adjacent syllable• Difference between

– Pitch max, mean, mid, slope

– Intensity max, mean

• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope

Page 11: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Experimental Configuration

• English Pitch Accent:– Proportionally sampled: 1000 examples

• 4-way and binary classification– Contextualization representation, preceding syllables

• Mandarin Tone:– Balanced tone sets: 400 examples

• Vary data set difficulty: clean lab -> broadcast• 4 tone classification

– Simple local pitch only features

» Prior lab speech experiments effective with local features

Page 12: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Semi-supervised Learning

• Approach: – Employ small amount of labeled data– Exploit information from additional – presumably more

available –unlabeled data• Few prior examples: EM, co-& self-training: Ostendorf ‘05

• Classifier:– Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)– Semi-supervised variant of SVM

• Exploits unlabeled examples – RBF kernel, typically 6 nearest neighbors

Page 13: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Experiments

• Pitch accent recognition:– Binary classification: Unaccented/Accented– 1000 instances, proportionally sampled

• Labeled training: 200 unacc, 100 acc

– >80% accuracy (cf. 84% w/15x labeled SVM)

• Mandarin tone recognition:– 4-way classification: n(n-1)/2 binary classifiers– 400 instances: balanced; 160 labeled

• Clean lab speech- in-focus-94%– cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples

• Broadcast news: 70% – Cf. <50% w/supervised SVM 160 training samples; 74% 4x training

Page 14: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Unsupervised Learning

• Question: – Can we identify the tone structure of a language from

the acoustic space without training?• Analogous to language acquisition

• Significant recent research in unsupervised clustering

• Established approaches: k-means• Spectral clustering: Eigenvector decomposition of affinity

matrix– (Shih & Malik 2000, Fischer & Poland 2004, BNS 2004)

– Little research for tone• Self-organizing maps (Gauthier et al,2005)

– Tones identified in lab speech using f0 velocities

Page 15: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Unsupervised Pitch Accent

• Pitch accent clustering:– 4 way distinction: 1000 samples, proportional

• 2-16 clusters constructed– Assign most frequent class label to each cluster

• Learner: – Asymmetric k-lines clustering (Fischer & Poland ’05):

» Context-dependent kernel radii, non-spherical clusters

– > 78% accuracy– Context effects:

• Vector w/context vs vector with no context comparable

Page 16: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Contrasting Clustering

• Approaches– 3 Spectral approaches:

• Asymmetric k-lines (Fischer & Poland 2004)• Symmetric k-lines (Fischer & Poland 2004)• Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)

– Binary weights, k-lines clustering

– K-means: Standard Euclidean distance– # of clusters: 2-16

• Best results: > 78%– 2 clusters: asymmetric k-lines; > 2 clusters: kmeans

• Larger # of clusters more similar

Page 17: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Contrasting Learners

Page 18: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Tone Clustering

• Mandarin four tones:• 400 samples: balanced• 2-phase clustering: 2-3 clusters each• Asymmetric k-lines

– Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised)

– Broadcast news: 57% (cf. 74% supervised)

• Contrast:– K-means: In-focus syllables: 74.75%

• Requires more clusters to reach asymm. k-lines level

Page 19: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Tone Structure

First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height, or slope

Page 20: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Conclusions

• Exploiting unlabeled examples for tone and pitch accent– Semi- and Un-supervised approaches

• Best cases approach supervised levels with less training

– Leveraging both labeled & unlabeled examples best– Both spectral approaches and k-means effective

» Contextual information less well-exploited than in supervised case

• Exploit acoustic structure of tone and accent space

Page 21: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Future Work

• Additional languages, tone inventories– Cantonese - 6 tones, – Bantu family languages – truly rare data

• Language acquisition– Use of child directed speech as input– Determination of number of clusters

Page 22: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Thanks

• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin

• Dinoj Surendran, Siwei Wang, Yi Xu

• This work supported by NSF Grant #0414919

• http://people.cs.uchicago.edu/~levow/tai

Page 23: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Spectral Clustering in a Nutshell

• Basic spectral clustering– Build affinity matrix– Determine dominant eigenvectors and

eigenvalues of the affinity matrix– Compute clustering based on them

• Approaches differ in:– Affinity matrix construction

• Binary weights, conductivity, heat weights

– Clustering: cut, k-means, k-lines

Page 24: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

K-Lines Clustering Algorithm

• Due to Fischer & Poland 2005• 1. Initialize vectors m1...mK (e.g. randomly, or

as the ¯first K eigenvectors of the spectraldata yi)

• 2. for j=1 . . .K:– Define Pj as the set of indices of all points yi that are

closest to the line defined by mj , and create the matrix Mj = [yi], i in Pi whose columns are the corresponding vectors yi

• 3. Compute the new value of every mj as the ¯first eigenvector of MjMTj

• 4. Repeat from 2 until mj 's do not change

Page 25: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Asymmetric Clustering

• Replace Gaussian kernel of fixed width– (Fischer & Poland TR-ISDIA-12-04, p. 12), – Where tau = 2d+ 1 or 10, largely insensitive to tau

Page 26: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Laplacian SVM

• Manifold regularization framework– Hypothesize intrinsic (true) data lies on a low

dimensional manifold, • Ambient (observed) data lies in a possibly high

dimensional space• Preserves locality:

– Points close in ambient space should be close in intrinsic

– Use labeled and unlabeled data to warp function space

– Run SVM on warped space

Page 27: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Laplacian SVM (Sindhwani)

Page 28: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

• Input : l labeled and u unlabeled examples• Output :• Algorithm :

– Contruct adjacency Graph. Compute Laplacian.– Choose Kernel K(x,y). Compute Gram matrix K.– Compute– And

Page 29: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Current and Future Work

• Interactions of tone and intonation– Recognition of topic and turn boundaries– Effects of topic and turn cues on tone real’n

• Child-directed speech & tone learning• Support for Computer-assisted tone learning• Structured sequence models for tone

– Sub-syllable segmentation & modeling

• Feature assessment– Band energy and intensity in tone recognition

Page 30: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Related Work

• Tonal coarticulation: – Xu & Sun,02; Xu 97;Shih & Kochanski 00

• English pitch accent– X. Sun, 02; Hasegawa-Johnson et al, 04;

Ross & Ostendorf 95

• Lexical tone recognition– SVM recognition of Thai tone: Thubthong 01– Context-dependent tone models

• Wang & Seneff 00, Zhou et al 04

Page 31: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Pitch Target Approximation Model

• Pitch target:– Linear model:

– Exponentially approximated:

– In practice, assume target well-approximated by mid-point (Sun, 02)

battT )(

battty )exp()(

Page 32: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Classification Experiments

• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation

• SVMlight (Joachims), LibSVM (Cheng & Lin 01)

– 4:1 training / test splits

• Experiments: Effects of – Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal

Page 33: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Results: Local Context

Context Mandarin Tone English Pitch Accent

Full 74.5% 81.3%

Extend PrePost 74% 80.7%

Extend Pre 74% 79.9%

Extend Post 70.5% 76.7%

Diffs PrePost 75.5% 80.7%

Diffs Pre 76.5% 79.5%

Diffs Post 69% 77.3%

Both Pre 76.5% 79.7%

Both Post 71.5% 77.6%

No context 68.5% 75.9%

Page 34: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Results: Local Context

Context Mandarin Tone English Pitch Accent

Full 74.5% 81.3%

Extend PrePost 74.0% 80.7%

Extend Pre 74.0% 79.9%

Extend Post 70.5% 76.7%

Diffs PrePost 75.5% 80.7%

Diffs Pre 76.5% 79.5%

Diffs Post 69.0% 77.3%

Both Pre 76.5% 79.7%

Both Post 71.5% 77.6%

No context 68.5% 75.9%

Page 35: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Results: Local Context

Context Mandarin Tone English Pitch Accent

Full 74.5% 81.3%

Extend PrePost 74% 80.7%

Extend Pre 74% 79.9%

Extend Post 70.5% 76.7%

Diffs PrePost 75.5% 80.7%

Diffs Pre 76.5% 79.5%

Diffs Post 69% 77.3%

Both Pre 76.5% 79.7%

Both Post 71.5% 77.6%

No context 68.5% 75.9%

Page 36: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Discussion: Local Context

• Any context information improves over none

– Preceding context information consistently improves over none or following context information

• English: Generally more context features are better• Mandarin: Following context can degrade

– Little difference in encoding (Extend vs Diffs)

• Consistent with phonological analysis (Xu) that carryover coarticulation is greater than anticipatory

Page 37: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Results & Discussion: Phrasal Context

Phrase Context Mandarin Tone English Pitch Accent

Phrase 75.5% 81.3%

No Phrase 72% 79.9%

•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve

Page 38: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Context: Summary

• Employ common acoustic representation– Tone (Mandarin), pitch accent (English)

• SVM classifiers - linear kernel: 76%, 81%• Local context effects:

– Up to > 20% relative reduction in error– Preceding context greatest contribution

• Carryover vs anticipatory

• Phrasal context effects:– Compensation for phrasal contour improves recognition

Page 39: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Aside: More Tones

• Cantonese:– CUSENT corpus of read broadcast news text– Same feature extraction & representation – 6 tones:

– High level, high rise, mid level, low fall, low rise, low level

– SVM classification:• Linear kernel: 64%, Gaussian kernel: 68%

– 3,6: 50% - mutually indistinguishable (50% pairwise)» Human levels: no context: 50%; context: 68%

• Augment with syllable phone sequence– 86% accuracy: 90% of syllable w/tone 3 or 6: one

dominates

Page 40: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Aside: Voice Quality & Energy

• By Dinoj Surendran

• Assess local voice quality and energy features for tone – Not typically associated with Mandarin

• Considered: – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;

Band energy

• Useful: Band energy significantly improves– Esp. neutral tone

• Supports identification of unstressed syllables– Spectral balance predicts stress in Dutch

Page 41: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Roadmap

• Challenges for Tone and Pitch Accent– Contextual effects– Training demands

• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition

• Reducing Training demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering

• Conclusion

Page 42: Unsupervised and Semi-Supervised Learning  of Tone and Pitch Accent

Strategy: Context

• Exploit contextual information– Features from adjacent syllables

• Height, shape: direct, relative

– Compensate for phrase contour

– Analyze impact of • Context position, context encoding, context type• > 20% relative improvement over no context