arf @ mediaeval 2012: a romanian asr-based approach to spoken term detection

Motivation Spoken Term Detection trough ASR Based on the Romanian ASR for continuous speech:

acoustic model trained with 64h of speech language model trained with 170 million words 18% WER on clean speech

Adaptation of Romanian ASR to Lwazi language Provided searching algorithms based on different

outputs of ASR

ASR adaptation Tuning the Romanian ASR to minimize PhER at 8KHz 77 African phones mapped to 28 Romanian phones Romanian to Lwazi phone mapping rules:

1) directly by IPA classification2) to the closest phone according to IPA full chart3) based on the confusion matrix

MAP adaptation of acoustic model with the development data set

ASR accuracy

Adaptation steps PhER [%]Romanian ASR for continuous speech 36.8Romanian ASR - beam width tuned 31.4Romanian ASR - language model tuned 25.3African speech with Romanian ASR 61.2MAP adaptation with Lwazi dev set 48.1

Searching techniques The ASR output can be:

String of characters Lattice Confusion Networks

Character comparison based techniques: DTW String Search (DTWSS) Sausage Technique (ST)

Acoustics based technique: Lattice Grammar (LG)

DTWSS Sliding window length proportional to the query

lengths Shorter DTW matches are given higher score Longer queries are given higher scores The score formula:

)1)(1)(1(Q

SW

QmQM

QmQ

LLL

LLLL

PhERs

Sausage Technique (ST)

Lattice Grammar (LG)

Recognition of the query Building of a finite state grammar (FSG) from the

lattice (query) output of the ASR Recognition of the contents with the FSG. Calculation of the likelihood probability Normalization of the likelihood probability and use it

as decision score

Results on evaluation data set

Results on all data set

ATWV evalQ-evalC

evalQ-devC

devQ-evalC

devQ-devC

DTWSS (α=0.8 β=0.4) 0.31 0.47 0.33 0.49DTWSS (α=0.6 β=0.6) 0.31 0.48 0.33 0.47DTWSS (α=0.1 β=0.4) 0.27 0.44 0.32 0.47

ST 0.12 0.22 0.17 0.25LG 0 0.02 0 -

Conclusions

The Romanian ASR is adapted to recognize African phones

DTWSS obtains by far the best results The penalization of long DTW matches and short

query lengths helped increase the ATWV ST and LG methods suffer the low PhER (48%)

arf @ mediaeval 2012: a romanian asr-based approach to spoken term detection

Technology