ibm research spoken term detection workshop nist, gaithersburg, dec.. 14-15, 2006 © 2006 ibm...

27
IBM Research Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation The IBM 2006 Spoken Term Detection System Olivier Siohan Bhuvana Ramabhadran IBM T. J. Watson Research Center Jonathan Mamou IBM Haifa Research Labs

Upload: abagail-hoxworth

Post on 15-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation

The IBM 2006 Spoken Term Detection System

Olivier Siohan

Bhuvana Ramabhadran

IBM T. J. Watson Research Center

Jonathan Mamou

IBM Haifa Research Labs

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation2

Outline

• System description

• Indexing

• Audio processing for each source type: generation of CTM, word confusion networks (WCN) and phone transcripts

• Index generation and storage

• Search

• Experiments/Results

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation3

System Overview

INDEXERMerging

Term List

STD List

Word Transcript

Phone Transcript

Word Index

Posting list

extracting

Scoring

OOV

In Voc

SEARCHER

Deciding

Phone Index

term

result

ASR

Systems

OFFLINE INDEXING

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation4

Audio Processing

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation5

Broadcast News Transcription System (BN)

Acoustic Training Data

430-hour corpus (1996, 1997 BN Speech collection and TDT4 Multilingual BN Speech corpus)

Language Model Training Data

198K word corpus (1996, 1997 BN News Transcripts, EARS BN 03 Closed Captions and TDT4 Multilingual BN Speech corpus, GALE Y1Q1 and Y2Q2 English closed captions)

Lexicon 67K words (based on frequency in the LM training data + named entities (mentions) identified in the TDT4 corpus)

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation6

Conversational Telephone Speech Transcription System (CTS)

Acoustic Training Data

Training Data: 2100 hours comprising of Fisher parts 1-7,SWB-1, BBN/CTRAN SWB-2, SWB Cellular, CallHome English

Language Model Training Data

Switchboard-1,2 Switchboard Cellular and Callhome English, Fisher parts 1-7, BN news transcripts, web data from UW

Lexicon 30.5K words (extended RT-02 lexicon to cover the 5000 most frequent words in the Fisher data)

D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, G. Zweig, "fMPE: Discriminatively Trained Features for Speech Recognition", in Proceedings International Conference on Acoustics Speech and Signal Processing, Philadelphia, PA, 2005.

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation7

Meeting Transcription System (confmtg)

Acoustic Training Data

470-hours of MDM training data comprising of: ICSI meeting (70 hours) ,NIST meeting pilot (15 hours), RT04 dev/eval (2.5 hours), RT05 dev, excluding CHIL’05 eval (4.5 hours), AMI seminars (16 hours), CHIL06 dev (3 hours)

Language Model Training Data

Meeting transcripts (1.5M words), conference paper text (37M words) and Fisher data (3M words)

Lexicon 37K word, words in meeting and Fisher data, and 20K most frequent words in the other text corpus

Huang, J. et al, “The IBM RT06S Speech-To-Text Evaluation System", NIST TR06S Workshop, May 3-4, 2006.

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation8

Phonetic Lattice Generation

O. Siohan, M. Bacchiani, "Fast vocabulary independent audio search using path based graph indexing", Proceedings of Interspeech 2005, Lisbon, pp. 53-56.

Two-step algorithm:

1. Generate sub-word lattices using word fragments as decoding units

2. Convert word-fragment lattices into phonetic lattices

Required resources:

1. A word-fragment inventory

2. A word-fragment lexicon

3. A word-fragment language model

Main Issue: designing a fragment inventory

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation9

Fragment based system design

1. Use a word-based system to convert the training material to phone strings

2. Train a phone n-gram with “large n” (say 5)

3. Prune the phone n-gram using entropy based pruning

A. Stolcke, "Entropy-based pruning of backoff languge models", in Proceedings DARPA Broadcast News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA, Feb. 1998.

4. Use the retained n-grams as the selected fragments (n-gram structure ensures coverage of all strings)

5. Phonetic pronunciations for word fragments are trivial

6. Train a fragment-based n-gram model for use in the fragment-based ASR system

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation10

Indexing

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation11

Indexing

Indices are stored using Juru storage

– Juru is a full-text search library written in Java, developed at IBM• D. Carmel, E. Amitay, M. Herscovici, Y. S. Maarek, Y.

Petruschka, and A. Soffer, "Juru at TREC 10 - Experiments with Index Pruning", Proceedings of TREC-10, NIST 2001.

– We have adapted the Juru storage model in order to store speech related data (e.g. begin time, duration)

– The posting lists are compressed using classical index compression techniques (d-gap)• Gerard Salton and Michael J. McGill, Introduction to modern

information retrieval, 1983.

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation12

Indexing Algorithm

Input: a corpus of word/sub-word transcripts

Process:

1. Extract units of indexing from the transcript

2. For each unit of indexing (word or sub-word), store in the index its posting

- transcript/speaker identifier (tid)

- begin time (bt)

- duration

- For WCN- posterior probability- rank relative to the other hypotheses

Output: an index on the corpus of transcripts

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation13

Search System

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation14

In-Vocabulary Search

Miss probability can be reduced by expanding the 1-best transcript with extra words, taken from the other alternatives provided by WCN transcript.

Such an expansion will probably reduce miss probability while increasing FA probability!

We need an appropriate scoring model in order to decrease the FA probability by punishing “bad” results

J. Mamou, D. Carmel and R. Hoory, "Spoken Document Retrieval from Call-center conversations", Proceedings of SIGIR, 2006

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation15

Improving Retrieval Effectiveness for In Voc search

Our scoring model is based on two pieces of information provided by WCN:

– the posterior probability of the hypothesis given the signal: it reflects the confidence level of the ASR in the hypothesis.

– the rank of the hypothesis among the other alternatives: it reflects the relative importance of the occurrence.

),,Pr(*),,(),,( iiiiii bttidqbttidqrankbttidqscore

),,()...,,,...(1

11

N

iiiNN bttidqscorebtbttidqqQscore

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation16

Improving Retrieval Effectiveness with OOV search

BN model: 39 OOV queries

CTS model: 117 OOV queries

CONFMTG model: 89 OOV queries

Since the accuracy of phone transcript is worse than the accuracy of the word transcript, we use phone transcript only for OOV keywords

It tends to reduce miss probability without affecting FA probability too much

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation17

Grapheme-to-phoneme conversion

OOV keywords are converted to phone sequence using a joint Maximum Entropy N-gram model

– Given a letter sequence L, find the phone sequence P* that maximizes Pr(L,P)

Details in

– Stanley Chen, “Conditional and Joint Models for Grapheme-to-Phoneme Conversion”, in Proc. of Eurospeech 2003.

),Pr(maxarg* PLPP

PC

LCC

CPL

)(phones

,)(letters:

)Pr(),Pr(

)...|Pr()Pr( 111

i

m

ii cccC

with

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation18

Search Algorithm

Input: a query term, word based index , sub-word based index

Process:

1. Extract the query keywords

2. For In-Voc query keywords, extract the posting lists from the word based index

3. For OOV query keywords, convert the keywords to sub-words and extract the posting list of each sub-word from the sub-word index

4. Merge the different posting lists according to the timestamp of the occurrences in order to create results matching the query

- check that the words and sub-words appear in the right order according to their begin times

- check that the words/sub-words are adjacent (less that 0.5 sec for word-word, word-phoneme and less than 0.2 sec for phoneme-phoneme)

Output: the set of all the matches of the given term

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation19

Search Algorithm

Extract Terms

in the Query

Set of matches for allterms in the query

In-Voc

Query

Extract PostingList from Word

Index

Extract PostingList from Phone

Index

Merge based onbegin time and

adjacency

Word-Word, Word-Phone: < 0.5sPhone-Phone: < 0.2s

OOV

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation20

Scoring for hard-decision

Decision thresholds are set according to the analysis of the DET curve obtained on the development set.

– We have used different threshold values per source type

BN CTS CONFMTG

0.4 0.61 0.91

We have boosted the score of multiple-words terms

),,()...,,,...(1

11

N

iiiNN bttidqscorebtbttidqqQscore

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation21

Primary and Contrast system differences

Primary system (WCN): WCN for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.

Contrastive 1 (WCN-C): same as P except for the WCN of CONFMTG that was generated using the CONFMTG model

Contrastive 2 (CTM): CTM for all the types, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.

Contrastive 3 (1-best-WCN): 1-best path extracted from WCN, CONFMTG transcripts generated using the BN model. Combination with phonetic 1-best transcripts for BN and CTS.

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation22

Experiments/Results

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation23

Data Results

system BN CTS CONFMTGWER Dry-Run 12.4 19.6 47.1TWV Dry-Run WCN 0.8498 0.6597 0.2921ATWV 0.8485 0.7392 0.2365MTWV 0.8532 0.7408 0.2508ATWV 0.8485 0.7392 0.0016MTWV 0.8532 0.7408 0.0115ATWV 0.8293 0.6763 0.1092MTWV 0.8293 0.6763 0.1092ATWV 0.8279 0.7101 0.2381MTWV 0.8319 0.7117 0.2514

Eval WCN

Eval CTM

Eval WCN-BN

Eval 1-best-WCNRetrieval performances are improved

using WCNs, relatively to 1-best path

using 1-best from WCN than CTM

Our ATWV is close to the MTWV; we have used appropriate thresholds for punishing bad results.

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation24

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation25

Condition performance

In general we performed better on long terms.

quantile 0-0.25 0.25-0.5 0.5-0.75 0.75-1ATWV: 0.7396

ATWV: 0.8789

ATWV: 0.8942

ATWV: 0.9061

MTWV: 0.7627

MTWV: 0.8877

MTWV: 0.9002

MTWV: 0.9061

ATWV: 0.6413

ATWV: 0.8163

ATWV: 0.8135

ATWV: 0.8379

MTWV: 0.6417

MTWV: 0.8334

MTWV: 0.8251

MTWV: 0.8663

ATWV: 0.1276

ATWV: 0.3585

ATWV: 0.4178

ATWV: 0.2937

MTWV: 0.1636

MTWV: 0.3855

MTWV: 0.4400

MTWV: 0.3095

BN

CTS

CONFMTG

quantile 0-33 33-66 66-100ATWV: 0.7655

ATWV: 0.8794

ATWV: 0.9088

MTWV: 0.7819

MTWV: 0.8914

MTWV: 0.9124

ATWV: 0.6545

ATWV: 0.8308

ATWV: 0.8378

MTWV: 0.6551

MTWV: 0.8727

MTWV: 0.8479

ATWV: 0.1677

ATWV: 0.3493

ATWV: 0.3651

MTWV: 0.1955

MTWV: 0.4109

MTWV: 0.3880

BN

CTS

CONFMTG

duration character

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation26

System characteristics (Eval)

Index size: 0.3267 MB/HP

– Compression of the index storage

Indexing time: 7.5627 HP/HS

Search speed: 0.0041 sec.P/HS

Index Memory Usage: 1653.4297 MB

Search Memory Usage: 269.1250 MB

IBM Research

Spoken Term Detection Workshop NIST, Gaithersburg, Dec.. 14-15, 2006 © 2006 IBM Corporation27

Conclusion

Our system combines a word retrieval approach with a phonetic retrieval approach

Our work exploits additional information provided by WCNs

– Extending the 1-best transcript with all the hypotheses of the WCN, considering confidence levels and boosting by term rank.

ATWV is increased compared to the 1-best transcript

– Miss probability is significantly improved by indexing all the hypotheses provided by the WCN.

– Decision score are set to NO for “bad” results in order to attenuate the effect of FA added by WCN.