cuhk system for the spoken web search task at mediaeval 2012

Overview System Description System performance Conclusion Acknowledgement

The CUHK Systems for Spoken Web Search task atMediaEval 2012

Haipeng Wang and Tan Lee

Department of Electronic EngineeringThe Chinese University of Hong Kong

September 30, 2012


Outline

1 Overview

2 System DescriptionPTDTW frameworkTokenizersDTW detectionPseudo-relevance Feedback and Score Normalization

3 System configuration and performance

4 Conclusion

5 Acknowledgement


Overview

2012 Spoken Web Search task [Metze et al., 2012]QbyE STD: Audio search using audio queries.Multilingual: Four South African languages.Low-resource: Less than 4-hour DEV audio data in total.Extreme case: One example for each query term.

Overview of our systemsAiming at language-independent QbyE STD system.Multiple resources:1) the DEV audio data; 2) rich-resource languages.Combine different resources: PTDTW framework.Pseudo-relevance feedback (PRF).Score normalization.


Posteriorgram-based template matching

QueryExample

TestUtterance

QueryPosteriorgrams

TestPosteriorgrams

DETECT by DTW

DetectionScore

TrainingResources

Tokenizer

Figure: Posteriorgram-based template matching[Hazen et al., 2009]Training resources: audio data with or without transcriptions.

Tokenizer: if trained without transcriptions, unsupervised;otherwise, supervised.

Posteriorgrams: more robust than spectral features.

How to effectively combine different resources?


PTDTW framework

QueryExample

TestUtterance

QueryPosteriorgrams 1

TestPosteriorgrams 1

DETECT by DTW

Raw Detection

Score

Tokenizer 1DTW

distance Matrix D1

DTW Distance Matrix D

QueryPosteriorgrams 2

TestPosteriorgrams 2

Tokenizer 2DTW

distance Matrix D2

QueryPosteriorgrams N

TestPosteriorgrams N

Tokenizer NDTW

distance Matrix DN

Figure: PTDTW FrameworkParallel tokenizers followed by DTW detection (PTDTW).

Modified from the posteriorgram-based template matchingapproach.

Key idea: Combining DTW distance matrices.


Unsupervised tokenizers

MFCC-GMM tokenizer [Zhang and Glass, 2009]Unsupervised training from the DEV data without transcription.1024 Gaussian components.39-dim MFCC + MVN + VTLN

MFCC-ASM tokenizer [Lee et al., 1988, Wang et al., 2012]Acoustic segment model, also named as self-organized unit(SOU) [Siu et al., 2010].Unsupervised training from the DEV data without transcription.256 ASM units. Each unit has 3 state, with 16 gaussiancomponents for each state.39-dim MFCC + MVN + VTLN


Phoneme recognizers

Czech, Hungarian, Russian phoneme recognizersdeveloped by BUT [Schwarz, 2009].trained from SpeechDat-E corpora.

Mandarin phoneme recognizer179 tonal phonemes.About 15-hour training data from CallHome corpus andCallFriend corpus.

English phoneme recognizer40 phonemes.About 15-hour training data from Fisher corpus and SwichboardCellular corpus.


Phoneme recognizers

Phoneme

Recognizers

Input

Data

Gaussian

Posteriorgrams

Taking

Logarithm

PCA

Transform GMM

Figure: Tandem Structure

256 Gaussian components trained on the DEV data.

Using tandem structure, we have 5 tokenizers:CZ-GMM, HU-GMM, RU-GMM, MA-GMM and EN-GMM.


DTW detection

DTW detection is performed with a sliding window.

Find the path minimizing the normalized distance:

d = minK,i(k),j(k)

∑K1 d(i(k), j(k))wk

Z(w)

where d(i(k), j(k)) is set to the inner-product distance, wk = 1,and Z(w) = K.

Additional constraint: |i(k)− j(k)| ≤ R.

Due to the large variation of the query length, R is not set to afixed number, but in proportional to the query length I:R = α× I. (α = 1

3 in our systems).


Pseudo-relevance Feedback and Score Normalization

Pseudo-revelance Feedback for each query:1) The top H hits from all the test utterances were selected as therelevance examples. Selection criterion included: a) H ≤ 3; b)raw detection score should be larger than a pre-set threshold.2) The relevance examples were used to score the top H (H = 2for this task) hits from each test utterance.3) The scores obtained by the relevance examples were linearlyfused with the scores of the original query examples.

Score normalization for each query:sq,t = (sq,t − µq)/δq

sq,t is the score of the qth query on the tth hit region.µq and δ2

q are the mean and variance of the scores for the qth

query estimated from the development data.


System Configuration and Performance

Table: System Configurations and ATWV performances.

System No. 1 2 3 4 5MFCC-GMM

√ √ √ √

MFCC-ASM√ √ √ √

PHNREC-GMM1 √ √ √

PRF√ √

Score Normalization√ √ √ √ √

devQ - devC 0.68 0.63 0.73 0.78 0.74devQ - evlC 0.60 0.55 0.70 0.75 0.70evlQ - devC 0.68 0.65 0.73 0.77 0.75evlQ - evlC 0.64 0.59 0.72 0.74 0.74

System 1 and 2 belong to the require run condition.

System 3, 4 and 5 belong to the general run condition.

The best performance (system 4) is achieved when all the tokenizers, PRF andScore normalization are used.

1PHNREC-GMM denotes the combination of the five used tandem tokenizers: CZ-GMM,

HU-GMM, RU-GMM, MA-GMM, and EN-GMM.





√ √ √ √


PHNREC-GMM√ √ √

PRF√ √



Supervised tokenizers perform better than the unsupervised tokenizers.

Training resources for unsupervised tokenizers are limited in this task, but notlimited for supervised tokenizers.

The PTDTW framework provides a flexible way to combine all these resources.





√ √ √ √


PHNREC-GMM√ √ √

PRF√ √



Combination of supervised tokenizers and unsupervised tokenizers leads toconsistent improvement.

Pseudo-relevance Feedback provides consistent improvement.


Conclusion

A PTDTW framework was proposed for the query-by-exampleSTD task in this evaluation.

Supervised tokenizers performed better than unsupervisedtokenizers for this task. The combination of supervised andunsupervised tokenizers provided consistent gain.

Pseudo-relevance feedback and score normalization were used.


Acknowledgement

Thank Cheung-Chi Leung from IIR for helpful discussions.

Thank the organizers for organizing this evaluation.

Thank BUT for sharing the phoneme recognizers and scripts.

This research is partially supported by the General ResearchFunds (Ref: 414010 and 413811) from the Hong Kong ResearchGrants Council.


Thank you!


Reference

Hazen, T., Shen, W., and White, C. (2009).Query-by-example spoken term detection using phonetic posteriorgram templates.In ASRU.

Lee, C., Soong, F., and Juang, B. (1988).A segment model based approach to speech recognition.In ICASSP.

Metze, F., Barnard, E., Davel, M., van Heerden, C., Anguera, X., Gravier, G., and Rajput, N. (2012).The spoken web search task.In MediaEval 2012 Workshop.

Schwarz, P. (2009).Phoneme recognition based on long temporal context, PhD thesis.

Siu, M., Gish, H., Chan, A., and Belfield, W. (2010).Improved topic classification and keyword discovery using an hmm-based speech recognizer trained withoutsupervision.In INTERSPEECH.

Wang, H., C.Leung, LEE, T., Li, H., and Ma, B. (2012).An acoustic segment modeling approach to query-by-example spoken term detection.In ICASSP.

Zhang, Y. and Glass, J. (2009).Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams.In ASRU.

cuhk system for the spoken web search task at mediaeval 2012

Technology