the tum cumulative dtw approach for the mediaeval 2012 spoken web search task

The TUM Cumulative DTW Approach for the Spoken Web

Search Task

Cyril Joder, Felix Weninger, Martin Wöllmer, Björn Schuller

Institute for Human-Machine CommunicationTechnische Universität München

Summary

• Not a „system“• Low-level features only• No ASR• Little „engineering“• Method of integrating discriminative

training into DTW

Mediaeval 2012 Workshop 2

Cumulative DTW (CDTW)

• Limitations of DTW: – Only one local cost function (distance)– Usually manual parameter tuning

• Idea: – Use different local cost functions for each step – Automatic learning of these functions as

combination of general features

From DTW to CDTW

• Local cost function:

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

• Dynamic Programming:

From DTW to CDTW

• Local step function:

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)

From DTW to CDTW

• Local step function:

(𝑖 , 𝑗)

(𝑖 , 𝑗−1)

(𝑖−1 , 𝑗 )

(𝑖−1 , 𝑗−1)

𝑠1(𝑖 , 𝑗)

𝑠2(𝑖 , 𝑗)

𝑠3( 𝑖 , 𝑗)

Softmax?

+ Differentiable– Allow for an optimization of the

+ Combine several alignment paths– More robust to local changes

- Only give a score (not the optimal path)

Features

• Acoustic descriptors: MFCC++ (D=36) – HTK, 25 ms, CMN, Global normalization

• Features (k=1…D):– Local distance

– “Local self-similarity”

; – Distance / product of the self-similarities

Decision

• Are the two sequences instances of the same word/expression?

• Learning of the parameters.– Backpropagation (stochastic gradient descent)– Training data: queries/utterances of dev set

Decision𝑆( 𝐼 , 𝐽 )𝐼+ 𝐽

Search Procedure

Given query and utterance

1) Feature extraction

2) Candidate search in

3) CDTW comparison

4) Score post-processing

Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2

Candidate Search

• Align query with entire utterance– CDTW with backtracking– “Scores” for each point

• Extract potential starts and ends– Peak-picking of scores

• Filter by duration– Only allow warping factors < 2

CDTW Score Post-Processing

• Same decision function as for learning– Many false positives– Bias toward some queries

• Heuristic post-processing:– For each query, subtract a specific threshold– Threshold: 90-th percentile of the CDTW

scores for that query

Results

run devQ-devC evalQ-devC devQ-evalC evalQ-evalC

P(miss) 55.6% 59.5% 60.2% 54.5%

P(FA) 1.18% 1.13% 1.17% 1.13%

ATWV 0.263 0.333 0.164 0.290

• Great improvement over naive DTW– ATWV = 0.065 on devQ-devC

• ATWV scores depend on the run

Results

• DET curves similar

• CDTW seems to generalize well

• Decision function has to be improved

Conclusion

• CDTW: promising results– Data-based approach with satisfactory results– Significantly outperforms (naive) DTW– Good generalization

• Future work:– Decision function– Acoustic descriptors– Integrate „hard“ path constraints into search

Thank you.

Cyril.Joder@tum.de

the tum cumulative dtw approach for the mediaeval 2012 spoken web search task

Technology

dtw deals - sept 2011

mediaeval 2015 - usemp: finding diverse images at mediaeval...

07 edit distance-dtw

mediaeval 2016 - eumssi team at the mediaeval person...

dtw 2 mar 10

toronto dtw cool space

mediaeval 2015 - overview of the mediaeval 2015 drone...

mediaeval 2015 - certh/cea list at mediaeval placing task...

mediaeval 2015 - the placing task at mediaeval 2015

word spotting dtw. word spot dtw introduction the basic idea...

dtw alintempdinam asal

dtw brokers letters

dtw mar 10

dtw presentation smart lob

dtw 2 - sept 2011

dtw flyers

dtw small jan 2011

dtw visual cv-oct12

mediaeval life

mediaeval 2015 - recod @ mediaeval 2015: diverse social...