t he spl - it query by example search on speech system for mediaeval 2014
TRANSCRIPT
Jorge Proença 1,2
Arlindo Veiga 1,2
Fernando Perdigão 1,2
The SPL-IT Query by Example Search on Speech
system for MediaEval 2014
The 2014 Query by Example Search on Speech (QUESST)
1 Instituto de Telecomunicações,
Coimbra, Portugal
2 Electrical and Computer Eng.
Department,
University of Coimbra, Portugal
2
SPL-IT system
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Overview of the system:
Fuses Dynamic Time Warping (DTW) modifications
Fuses results from systems with phonetic recognizers for 3
languages
3
Phonetic Recognizer
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Hard to extract good posteriorgrams with an HMM system (our in-
house system).
Used 3 systems/languages (for 8 kHz) based on long temporal context
and neural networks from Brnu University of Technology (BUT):
Czech
Hungarian
Russian
Output: posteriorgrams (3 states per phoneme).
Leading and trailing silence/noise removed
Ph
on
em
e S
tate
Frame
State Posteriorgram example for one query
4
Dynamic Time Warping
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Local Distance matrix:
Dot Product of Query and Audio posterior probability vectors;
Back-off with l =10-4
, logD q x q x
Distance Matrix of Query vs Audio
5
Dynamic Time Warping
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Basic DTW strategy (A1):
Smallest distance in identically
weighted unitary jumps:
Distance Matrix (top) and accumulated Distance matrix (bottom) of Query vs Audio
6
DTW Modifications
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
4 additional approaches:
(A2) – Cutting up to 250ms at the end of the query,
keeping the segment above 500ms
(A3) – Cutting up to 250ms at the beginning of the query,
keeping the segment above 500ms
Que
ryQ
ue
ry
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A2 (bottom)
7
DTW Modifications
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
(A4) – Allowing one jump in the path up to ½ Query’s length,
can’t occur at initial and final 250ms of the query
can’t occur for queries shorter than 800msQ
ue
ryQ
ue
ry
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A4 (bottom)
8
DTW Modifications
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
(A5) – Swaps: accounting for re-ordering of words.
Backtrack the best 5 candidates from (A1) from the end,
Find the best path for the beginning of the query, ahead of the
end of the first one, with restrictions similar to (A4).Q
ue
ryQ
ue
ry
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A5 (bottom)
9
Fusing systems
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Different approaches:
Minimum of the approaches – not the best.
Harmonic mean found to be a good compromise.
Per-query normalization (standard score):
Different languages:
Arithmetic mean of the 3 scores.
X
10
Submissions and Results
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Primary: fusing (A1) and (A2) (basic and cutting the end)
Late: fusing the 5 approaches.
Late provided worse overall results
primary late
Cnxe, MinCnxe - Dev 0.6797, 0.5438 0.7106, 0.5881
Cnxe, MinCnxe - Eval 0.6588, 0.5080 0.6708, 0.5240
ATWV, MTWV - Dev 0.4494, 0.4494 0.4051, 0.4052
ATWV, MTWV - Eval 0.4399, 0.4423 0.3918, 0.4218
11
Submissions and Results (cont.)
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Primary: fusing (A1) and (A2) (basic and cutting the end)
Late: fusing the 5 approaches.
Cnxe for isolated approaches on Eval:
A1: 0.6823, A2: 0.6721, A3: 0.6947, A4: 0.6957 A5: 0.6999
For Type 3 queries, late system was better:
0.8049 Cnxe on primary to 0.7865 Cnxe on late
primary late
Cnxe, MinCnxe - Eval 0.6588, 0.5080 0.6708, 0.5240
12
Conclusions
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Although this year’s task has an added difficulty, a simple DTW still works
well for most cases.
Cutting queries at the end revealed to be the best strategy, and fusing it
with A1 was even better.
Including the possibility of jumps and re-orders increased False Positives
overall, since these special cases are a small part of the database.
We lacked an optimization method for Cnxe
Which would greatly improve the results.
13
END – Thank You
MediaEval 2014
| October 16-17 2014, Barcelona, SPAIN
Processing Speed:
Hardware – CRAY CX1 Cluster, running windows server 2008 HPC, using 16 of 56
cores (7 nodes with double Intel Xeon 5520 2.27GHz quad-core and 24GB RAM per
node).
Indexing Speed Factor – 1.4
Searching Speed Factor – 0.0029 per sec and per language
Peak Memory – 0.098 GB