ibm research spoken term detection workshop nist, gaithersburg, dec.. 14-15, 2006 © 2006 ibm...
TRANSCRIPT
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation
The IBM 2006 Spoken Term Detection System
Olivier Siohan
Bhuvana Ramabhadran
IBM T. J. Watson Research Center
Jonathan Mamou
IBM Haifa Research Labs
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation2
Outline
• System description
• Indexing
• Audio processing for each source type: generation of CTM, word confusion networks (WCN) and phone transcripts
• Index generation and storage
• Search
• Experiments/Results
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation3
System Overview
INDEXERMerging
Term List
STD List
Word Transcript
Phone Transcript
Word Index
Posting list
extracting
Scoring
OOV
In Voc
SEARCHER
Deciding
Phone Index
term
result
ASR
Systems
OFFLINE INDEXING
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation4
Audio Processing
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation5
Broadcast News Transcription System (BN)
Acoustic Training Data
430-hour corpus (1996, 1997 BN Speech collection and TDT4 Multilingual BN Speech corpus)
Language Model Training Data
198K word corpus (1996, 1997 BN News Transcripts, EARS BN 03 Closed Captions and TDT4 Multilingual BN Speech corpus, GALE Y1Q1 and Y2Q2 English closed captions)
Lexicon 67K words (based on frequency in the LM training data + named entities (mentions) identified in the TDT4 corpus)
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation6
Conversational Telephone Speech Transcription System (CTS)
Acoustic Training Data
Training Data: 2100 hours comprising of Fisher parts 1-7,SWB-1, BBN/CTRAN SWB-2, SWB Cellular, CallHome English
Language Model Training Data
Switchboard-1,2 Switchboard Cellular and Callhome English, Fisher parts 1-7, BN news transcripts, web data from UW
Lexicon 30.5K words (extended RT-02 lexicon to cover the 5000 most frequent words in the Fisher data)
D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig, "fMPE: Discriminatively Trained Features for Speech Recognition", in Proceedings International Conference on Acoustics Speech and Signal Processing, Philadelphia, PA, 2005.
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation7
Meeting Transcription System (confmtg)
Acoustic Training Data
470-hours of MDM training data comprising of: ICSI meeting (70 hours) ,NIST meeting pilot (15 hours), RT04 dev/eval (2.5 hours), RT05 dev, excluding CHIL’05 eval (4.5 hours), AMI seminars (16 hours), CHIL06 dev (3 hours)
Language Model Training Data
Meeting transcripts (1.5M words), conference paper text (37M words) and Fisher data (3M words)
Lexicon 37K word, words in meeting and Fisher data, and 20K most frequent words in the other text corpus
Huang, J. et al, “The IBM RT06S Speech-To-Text Evaluation System", NIST TR06S Workshop, May 3-4, 2006.
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation8
Phonetic Lattice Generation
O. Siohan, M. Bacchiani, "Fast vocabulary independent audio search using path based graph indexing", Proceedings of Interspeech 2005, Lisbon, pp. 53-56.
Two-step algorithm:
1. Generate sub-word lattices using word fragments as decoding units
2. Convert word-fragment lattices into phonetic lattices
Required resources:
1. A word-fragment inventory
2. A word-fragment lexicon
3. A word-fragment language model
Main Issue: designing a fragment inventory
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation9
Fragment based system design
1. Use a word-based system to convert the training material to phone strings
2. Train a phone n-gram with “large n” (say 5)
3. Prune the phone n-gram using entropy based pruning
A. Stolcke, "Entropy-based pruning of backoff languge models", in Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA, Feb. 1998.
4. Use the retained n-grams as the selected fragments (n-gram structure ensures coverage of all strings)
5. Phonetic pronunciations for word fragments are trivial
6. Train a fragment-based n-gram model for use in the fragment-based ASR system
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation10
Indexing
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation11
Indexing
Indices are stored using Juru storage
– Juru is a full-text search library written in Java, developed at IBM• D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek, Y.
Petruschka, and A. Soffer, "Juru at TREC 10 - Experiments with Index Pruning", Proceedings of TREC-10, NIST 2001.
– We have adapted the Juru storage model in order to store speech related data (e.g. begin time, duration)
– The posting lists are compressed using classical index compression techniques (d-gap)• Gerard Salton and Michael J. McGill, Introduction to modern
information retrieval, 1983.
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation12
Indexing Algorithm
Input: a corpus of word/sub-word transcripts
Process:
1. Extract units of indexing from the transcript
2. For each unit of indexing (word or sub-word), store in the index its posting
- transcript/speaker identifier (tid)
- begin time (bt)
- duration
- For WCN- posterior probability- rank relative to the other hypotheses
Output: an index on the corpus of transcripts
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation13
Search System
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation14
In-Vocabulary Search
Miss probability can be reduced by expanding the 1-best transcript with extra words, taken from the other alternatives provided by WCN transcript.
Such an expansion will probably reduce miss probability while increasing FA probability!
We need an appropriate scoring model in order to decrease the FA probability by punishing “bad” results
J. Mamou, D. Carmel and R. Hoory, "Spoken Document Retrieval from Call-center conversations", Proceedings of SIGIR, 2006
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation15
Improving Retrieval Effectiveness for In Voc search
Our scoring model is based on two pieces of information provided by WCN:
– the posterior probability of the hypothesis given the signal: it reflects the confidence level of the ASR in the hypothesis.
– the rank of the hypothesis among the other alternatives: it reflects the relative importance of the occurrence.
),,Pr(*),,(),,( iiiiii bttidqbttidqrankbttidqscore
),,()...,,,...(1
11
N
iiiNN bttidqscorebtbttidqqQscore
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation16
Improving Retrieval Effectiveness with OOV search
BN model: 39 OOV queries
CTS model: 117 OOV queries
CONFMTG model: 89 OOV queries
Since the accuracy of phone transcript is worse than the accuracy of the word transcript, we use phone transcript only for OOV keywords
It tends to reduce miss probability without affecting FA probability too much
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation17
Grapheme-to-phoneme conversion
OOV keywords are converted to phone sequence using a joint Maximum Entropy N-gram model
– Given a letter sequence L, find the phone sequence P* that maximizes Pr(L,P)
Details in
– Stanley Chen, “Conditional and Joint Models for Grapheme-to-Phoneme Conversion”, in Proc. of Eurospeech 2003.
),Pr(maxarg* PLPP
PC
LCC
CPL
)(phones
,)(letters:
)Pr(),Pr(
)...|Pr()Pr( 111
i
m
ii cccC
with
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation18
Search Algorithm
Input: a query term, word based index , sub-word based index
Process:
1. Extract the query keywords
2. For In-Voc query keywords, extract the posting lists from the word based index
3. For OOV query keywords, convert the keywords to sub-words and extract the posting list of each sub-word from the sub-word index
4. Merge the different posting lists according to the timestamp of the occurrences in order to create results matching the query
- check that the words and sub-words appear in the right order according to their begin times
- check that the words/sub-words are adjacent (less that 0.5 sec for word-word, word-phoneme and less than 0.2 sec for phoneme-phoneme)
Output: the set of all the matches of the given term
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation19
Search Algorithm
Extract Terms
in the Query
Set of matches for allterms in the query
In-Voc
Query
Extract PostingList from Word
Index
Extract PostingList from Phone
Index
Merge based onbegin time and
adjacency
Word-Word, Word-Phone: < 0.5sPhone-Phone: < 0.2s
OOV
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation20
Scoring for hard-decision
Decision thresholds are set according to the analysis of the DET curve obtained on the development set.
– We have used different threshold values per source type
BN CTS CONFMTG
0.4 0.61 0.91
We have boosted the score of multiple-words terms
),,()...,,,...(1
11
N
iiiNN bttidqscorebtbttidqqQscore
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation21
Primary and Contrast system differences
Primary system (WCN): WCN for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
Contrastive 1 (WCN-C): same as P except for the WCN of CONFMTG that was generated using the CONFMTG model
Contrastive 2 (CTM): CTM for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
Contrastive 3 (1-best-WCN): 1-best path extracted from WCN, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation22
Experiments/Results
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation23
Data Results
system BN CTS CONFMTGWER Dry-Run 12.4 19.6 47.1TWV Dry-Run WCN 0.8498 0.6597 0.2921ATWV 0.8485 0.7392 0.2365MTWV 0.8532 0.7408 0.2508ATWV 0.8485 0.7392 0.0016MTWV 0.8532 0.7408 0.0115ATWV 0.8293 0.6763 0.1092MTWV 0.8293 0.6763 0.1092ATWV 0.8279 0.7101 0.2381MTWV 0.8319 0.7117 0.2514
Eval WCN
Eval CTM
Eval WCN-BN
Eval 1-best-WCNRetrieval performances are improved
using WCNs, relatively to 1-best path
using 1-best from WCN than CTM
Our ATWV is close to the MTWV; we have used appropriate thresholds for punishing bad results.
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation24
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation25
Condition performance
In general we performed better on long terms.
quantile 0-0.25 0.25-0.5 0.5-0.75 0.75-1ATWV: 0.7396
ATWV: 0.8789
ATWV: 0.8942
ATWV: 0.9061
MTWV: 0.7627
MTWV: 0.8877
MTWV: 0.9002
MTWV: 0.9061
ATWV: 0.6413
ATWV: 0.8163
ATWV: 0.8135
ATWV: 0.8379
MTWV: 0.6417
MTWV: 0.8334
MTWV: 0.8251
MTWV: 0.8663
ATWV: 0.1276
ATWV: 0.3585
ATWV: 0.4178
ATWV: 0.2937
MTWV: 0.1636
MTWV: 0.3855
MTWV: 0.4400
MTWV: 0.3095
BN
CTS
CONFMTG
quantile 0-33 33-66 66-100ATWV: 0.7655
ATWV: 0.8794
ATWV: 0.9088
MTWV: 0.7819
MTWV: 0.8914
MTWV: 0.9124
ATWV: 0.6545
ATWV: 0.8308
ATWV: 0.8378
MTWV: 0.6551
MTWV: 0.8727
MTWV: 0.8479
ATWV: 0.1677
ATWV: 0.3493
ATWV: 0.3651
MTWV: 0.1955
MTWV: 0.4109
MTWV: 0.3880
BN
CTS
CONFMTG
duration character
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation26
System characteristics (Eval)
Index size: 0.3267 MB/HP
– Compression of the index storage
Indexing time: 7.5627 HP/HS
Search speed: 0.0041 sec.P/HS
Index Memory Usage: 1653.4297 MB
Search Memory Usage: 269.1250 MB
IBM Research
Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation27
Conclusion
Our system combines a word retrieval approach with a phonetic retrieval approach
Our work exploits additional information provided by WCNs
– Extending the 1-best transcript with all the hypotheses of the WCN, considering confidence levels and boosting by term rank.
ATWV is increased compared to the 1-best transcript
– Miss probability is significantly improved by indexing all the hypotheses provided by the WCN.
– Decision score are set to NO for “bad” results in order to attenuate the effect of FA added by WCN.