the tum cumulative dtw approach for the mediaeval 2012 spoken web search task

Post on 18-Dec-2014

459 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

The TUM Cumulative DTW Approach for the Spoken Web

Search Task

Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller

Institute for Human-Machine CommunicationTechnische Universität München

Summary

• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative

training into DTW

Mediaeval 2012 Workshop 2

Cumulative DTW (CDTW)

• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning

• Idea: – Use different local cost functions for each step – Automatic learning of these functions as

combination of general features

Mediaeval 2012 Workshop 3

From DTW to CDTW

• Local cost function:

Mediaeval 2012 Workshop 4

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝛼1

𝛼2

𝛼3

• Dynamic Programming:

From DTW to CDTW

• Local step function:

Mediaeval 2012 Workshop 5

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)

• Dynamic Programming:

From DTW to CDTW

• Local step function:

Mediaeval 2012 Workshop 6

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)

• Dynamic Programming:

Softmax?

+ Differentiable– Allow for an optimization of the

+ Combine several alignment paths– More robust to local changes

- Only give a score (not the optimal path)

Mediaeval 2012 Workshop 7

Features

• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization

• Features (k=1…D):– Local distance

– “Local self-similarity”

; – Distance / product of the self-similarities

Mediaeval 2012 Workshop 8

Decision

• Are the two sequences instances of the same word/expression?

• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set

Mediaeval 2012 Workshop 9

Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽

Search Procedure

Given query and utterance

1) Feature extraction

2) Candidate search in

3) CDTW comparison

4) Score post-processing

Mediaeval 2012 Workshop 10

Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2

Mediaeval 2012 Workshop 11

Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2

Mediaeval 2012 Workshop 12

CDTW Score Post-Processing

• Same decision function as for learning– Many false positives– Bias toward some queries

• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW

scores for that query

Mediaeval 2012 Workshop 13

Results

run devQ-devC evalQ-devC devQ-evalC evalQ-evalC

P(miss) 55.6% 59.5% 60.2% 54.5%

P(FA) 1.18% 1.13% 1.17% 1.13%

ATWV 0.263 0.333 0.164 0.290

Mediaeval 2012 Workshop 14

• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC

• ATWV scores depend on the run

Results

• DET curves similar

• CDTW seems to generalize well

• Decision function has to be improved

Mediaeval 2012 Workshop 15

Conclusion

• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization

• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search

Mediaeval 2012 Workshop 16

Thank you.

Cyril.Joder@tum.de

Mediaeval 2012 Workshop 17

top related