cuhk system for the spoken web search task at mediaeval 2012
DESCRIPTION
TRANSCRIPT
Overview System Description System performance Conclusion Acknowledgement
The CUHK Systems for Spoken Web Search task atMediaEval 2012
Haipeng Wang and Tan Lee
Department of Electronic EngineeringThe Chinese University of Hong Kong
September 30, 2012
Overview System Description System performance Conclusion Acknowledgement
Outline
1 Overview
2 System DescriptionPTDTW frameworkTokenizersDTW detectionPseudo-relevance Feedback and Score Normalization
3 System configuration and performance
4 Conclusion
5 Acknowledgement
Overview System Description System performance Conclusion Acknowledgement
Overview
2012 Spoken Web Search task [Metze et al., 2012]QbyE STD: Audio search using audio queries.Multilingual: Four South African languages.Low-resource: Less than 4-hour DEV audio data in total.Extreme case: One example for each query term.
Overview of our systemsAiming at language-independent QbyE STD system.Multiple resources:1) the DEV audio data; 2) rich-resource languages.Combine different resources: PTDTW framework.Pseudo-relevance feedback (PRF).Score normalization.
Overview System Description System performance Conclusion Acknowledgement
Posteriorgram-based template matching
QueryExample
TestUtterance
QueryPosteriorgrams
TestPosteriorgrams
DETECT by DTW
DetectionScore
TrainingResources
Tokenizer
Figure: Posteriorgram-based template matching[Hazen et al., 2009]Training resources: audio data with or without transcriptions.
Tokenizer: if trained without transcriptions, unsupervised;otherwise, supervised.
Posteriorgrams: more robust than spectral features.
How to effectively combine different resources?
Overview System Description System performance Conclusion Acknowledgement
PTDTW framework
QueryExample
TestUtterance
QueryPosteriorgrams 1
TestPosteriorgrams 1
DETECT by DTW
Raw Detection
Score
Tokenizer 1DTW
distance Matrix D1
DTW Distance Matrix D
QueryPosteriorgrams 2
TestPosteriorgrams 2
Tokenizer 2DTW
distance Matrix D2
QueryPosteriorgrams N
TestPosteriorgrams N
Tokenizer NDTW
distance Matrix DN
Figure: PTDTW FrameworkParallel tokenizers followed by DTW detection (PTDTW).
Modified from the posteriorgram-based template matchingapproach.
Key idea: Combining DTW distance matrices.
Overview System Description System performance Conclusion Acknowledgement
Unsupervised tokenizers
MFCC-GMM tokenizer [Zhang and Glass, 2009]Unsupervised training from the DEV data without transcription.1024 Gaussian components.39-dim MFCC + MVN + VTLN
MFCC-ASM tokenizer [Lee et al., 1988, Wang et al., 2012]Acoustic segment model, also named as self-organized unit(SOU) [Siu et al., 2010].Unsupervised training from the DEV data without transcription.256 ASM units. Each unit has 3 state, with 16 gaussiancomponents for each state.39-dim MFCC + MVN + VTLN
Overview System Description System performance Conclusion Acknowledgement
Phoneme recognizers
Czech, Hungarian, Russian phoneme recognizersdeveloped by BUT [Schwarz, 2009].trained from SpeechDat-E corpora.
Mandarin phoneme recognizer179 tonal phonemes.About 15-hour training data from CallHome corpus andCallFriend corpus.
English phoneme recognizer40 phonemes.About 15-hour training data from Fisher corpus and SwichboardCellular corpus.
Overview System Description System performance Conclusion Acknowledgement
Phoneme recognizers
Phoneme
Recognizers
Input
Data
Gaussian
Posteriorgrams
Taking
Logarithm
PCA
Transform GMM
Figure: Tandem Structure
256 Gaussian components trained on the DEV data.
Using tandem structure, we have 5 tokenizers:CZ-GMM, HU-GMM, RU-GMM, MA-GMM and EN-GMM.
Overview System Description System performance Conclusion Acknowledgement
DTW detection
DTW detection is performed with a sliding window.
Find the path minimizing the normalized distance:
d = minK,i(k),j(k)
∑K1 d(i(k), j(k))wk
Z(w)
where d(i(k), j(k)) is set to the inner-product distance, wk = 1,and Z(w) = K.
Additional constraint: |i(k)− j(k)| ≤ R.
Due to the large variation of the query length, R is not set to afixed number, but in proportional to the query length I:R = α× I. (α = 1
3 in our systems).
Overview System Description System performance Conclusion Acknowledgement
Pseudo-relevance Feedback and Score Normalization
Pseudo-revelance Feedback for each query:1) The top H hits from all the test utterances were selected as therelevance examples. Selection criterion included: a) H ≤ 3; b)raw detection score should be larger than a pre-set threshold.2) The relevance examples were used to score the top H (H = 2for this task) hits from each test utterance.3) The scores obtained by the relevance examples were linearlyfused with the scores of the original query examples.
Score normalization for each query:sq,t = (sq,t − µq)/δq
sq,t is the score of the qth query on the tth hit region.µq and δ2
q are the mean and variance of the scores for the qth
query estimated from the development data.
Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5MFCC-GMM
√ √ √ √
MFCC-ASM√ √ √ √
PHNREC-GMM1 √ √ √
PRF√ √
Score Normalization√ √ √ √ √
devQ - devC 0.68 0.63 0.73 0.78 0.74devQ - evlC 0.60 0.55 0.70 0.75 0.70evlQ - devC 0.68 0.65 0.73 0.77 0.75evlQ - evlC 0.64 0.59 0.72 0.74 0.74
System 1 and 2 belong to the require run condition.
System 3, 4 and 5 belong to the general run condition.
The best performance (system 4) is achieved when all the tokenizers, PRF andScore normalization are used.
1PHNREC-GMM denotes the combination of the five used tandem tokenizers: CZ-GMM,
HU-GMM, RU-GMM, MA-GMM, and EN-GMM.
Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5MFCC-GMM
√ √ √ √
MFCC-ASM√ √ √ √
PHNREC-GMM√ √ √
PRF√ √
Score Normalization√ √ √ √ √
devQ - devC 0.68 0.63 0.73 0.78 0.74devQ - evlC 0.60 0.55 0.70 0.75 0.70evlQ - devC 0.68 0.65 0.73 0.77 0.75evlQ - evlC 0.64 0.59 0.72 0.74 0.74
Supervised tokenizers perform better than the unsupervised tokenizers.
Training resources for unsupervised tokenizers are limited in this task, but notlimited for supervised tokenizers.
The PTDTW framework provides a flexible way to combine all these resources.
Overview System Description System performance Conclusion Acknowledgement
System Configuration and Performance
Table: System Configurations and ATWV performances.
System No. 1 2 3 4 5MFCC-GMM
√ √ √ √
MFCC-ASM√ √ √ √
PHNREC-GMM√ √ √
PRF√ √
Score Normalization√ √ √ √ √
devQ - devC 0.68 0.63 0.73 0.78 0.74devQ - evlC 0.60 0.55 0.70 0.75 0.70evlQ - devC 0.68 0.65 0.73 0.77 0.75evlQ - evlC 0.64 0.59 0.72 0.74 0.74
Combination of supervised tokenizers and unsupervised tokenizers leads toconsistent improvement.
Pseudo-relevance Feedback provides consistent improvement.
Overview System Description System performance Conclusion Acknowledgement
Conclusion
A PTDTW framework was proposed for the query-by-exampleSTD task in this evaluation.
Supervised tokenizers performed better than unsupervisedtokenizers for this task. The combination of supervised andunsupervised tokenizers provided consistent gain.
Pseudo-relevance feedback and score normalization were used.
Overview System Description System performance Conclusion Acknowledgement
Acknowledgement
Thank Cheung-Chi Leung from IIR for helpful discussions.
Thank the organizers for organizing this evaluation.
Thank BUT for sharing the phoneme recognizers and scripts.
This research is partially supported by the General ResearchFunds (Ref: 414010 and 413811) from the Hong Kong ResearchGrants Council.
Overview System Description System performance Conclusion Acknowledgement
Thank you!
Overview System Description System performance Conclusion Acknowledgement
Reference
Hazen, T., Shen, W., and White, C. (2009).Query-by-example spoken term detection using phonetic posteriorgram templates.In ASRU.
Lee, C., Soong, F., and Juang, B. (1988).A segment model based approach to speech recognition.In ICASSP.
Metze, F., Barnard, E., Davel, M., van Heerden, C., Anguera, X., Gravier, G., and Rajput, N. (2012).The spoken web search task.In MediaEval 2012 Workshop.
Schwarz, P. (2009).Phoneme recognition based on long temporal context, PhD thesis.
Siu, M., Gish, H., Chan, A., and Belfield, W. (2010).Improved topic classification and keyword discovery using an hmm-based speech recognizer trained withoutsupervision.In INTERSPEECH.
Wang, H., C.Leung, LEE, T., Li, H., and Ma, B. (2012).An acoustic segment modeling approach to query-by-example spoken term detection.In ICASSP.
Zhang, Y. and Glass, J. (2009).Unsupervised spoken keyword spotting via segmental dtw on gaussian posteriorgrams.In ASRU.