daghstuhl seminar 13451 (computational audio analysis) inspirational talk
DESCRIPTION
This is a 5 minutes presentation I was invited to give in a Daghstuhl seminar about low/zero resources processing. in November 2013TRANSCRIPT
Multimedia analysis for the poor(in training resources)
Xavier Anguera
Telefonica research
Daghstuhl Seminar 13451 - Inspirational talk
Does this affect me?
• You work in areas where there is not much training data available
– Maybe it exists in domains other than your test data.
• The task you are pursuing does not have a well annotated corpus for training
– E.g. finding structure in signals
• It is difficult / you do not know how to define training “units” in your task
• You like to work in complicated stuff
Typical Speech paper diagram
Labeled training data
My favorite ML technique
“I am a model”
My favorite decoding technique
Testing data My result
…making it as complicated as you would like to
Resource-free technologies
• Summarization– Acoustic word cloud of most repeated acoustic items– Repetition-based summarization (MODIS software @
INRIA-Rennes)
• Structure analysis in music• Audio-visual unsupervised learning (e.g. the
Google cats)• Acquisition of unknown sounds (e.g. Tuomo’s
talk)• Exemplar-based ASR (Leuven Univ.)
EXAMPLE: Spoken Audio Search (or Query-by-Example Spoken-Term Detection)
Given a single spoken query we search for instances at lexical level within spoken documents
It is similar to Spoken Term Detection (NIST STD2006, OpenKWS 2013) but…
Queries are spoken
Different speakers
Different acoustic conditions
No prior knowledge of the
language(s) might be available
Mediaeval SWS 2013• 9 languages in different acoustic contexts: 4 African
languages (isixhosa, isizulu, sepedi, setswana), Albanian, Basque, Czech, non-native English, Romanian
#utts time Avg. length/utt.
Search corpus 10762 19:57:55 6.67s
Dev Queries 505 0:11:26h 1.35s
Extended dev* 1046 0:08:42h 0.49s
Eval Queries 503 0:11:37h 1.38s
Extended eval* 1037 0:08:57h 0.51s
Total 13853 20:38:37h*Only Basque (3x) and Czech (10x) queries have extended versions
5 10 20 40 60 80 90 95 98.0001.001
.004.01
.02.05
.1.2
.51
25
1020
40
Miss probability (in %)
False Alarm probability (in %
)
Primary system
s (evaluation)
Random Perform
anceGTTS (M
TWV=0.399, Thr=5.243)
L2F (MTW
V=0.342, Thr=3.551)CUHK (M
TWV=0.306, Thr=0.618)
BUT (MTW
V=0.297, Thr=0.914)CM
TECHETAL (MTW
V=0.257, Thr=18.153)IIITH (M
TWV=0.224, Thr=2.721)
ELIRF (MTW
V=0.159, Thr=2.759)TID (M
TWV=0.093, Thr=5.051)
GTC (MTW
V=0.084, Thr=3.341)SPEED (M
TWV=0.059, Thr=0.923)
LIA-Late (MTW
V=0.000, Thr=1079.003)UNIZA-Late (M
TWV=0.001, Thr=1.000)
TUKE-Late (MTW
V=0.000, Thr=3.000)
Mediaeval SWS 2013
Mediaeval SWS 2013 (results per language)
How do children learn?(from someone who is not a parent…)
1. They hear their environment and identify/isolate particular audio-visual stimuli they do not know
2. An expert (parent/grandparent) tells them the “meaning” of those stimuli.
– If the stimuli appears in different forms (or the child is not sharp) they will need to repeat it a couple of times…
3. The child learns and is able to identify this stimuli from then on.
book
book
book
book
book
Machine earning Machine earning
“book” model “?” model
• How to incorporate acoustic modeling into dynamic programming techniques?
• How to describe the acoustic space (or whatever space) in an unsupervised (but robust) manner?
• How do we discriminate between “interesting/relevant” and “filler” events
• Does it all make any sense? (maybe we could consider we will always have enough training data?)