daghstuhl seminar 13451 (computational audio analysis) inspirational talk

Multimedia analysis for the poor(in training resources)

Xavier Anguera

Telefonica research

Daghstuhl Seminar 13451 - Inspirational talk

Does this affect me?

• You work in areas where there is not much training data available

– Maybe it exists in domains other than your test data.

• The task you are pursuing does not have a well annotated corpus for training

– E.g. finding structure in signals

• It is difficult / you do not know how to define training “units” in your task

• You like to work in complicated stuff

Typical Speech paper diagram

Labeled training data

My favorite ML technique

“I am a model”

My favorite decoding technique

Testing data My result

…making it as complicated as you would like to

Resource-free technologies

• Summarization– Acoustic word cloud of most repeated acoustic items– Repetition-based summarization (MODIS software @

INRIA-Rennes)

• Structure analysis in music• Audio-visual unsupervised learning (e.g. the

Google cats)• Acquisition of unknown sounds (e.g. Tuomo’s

talk)• Exemplar-based ASR (Leuven Univ.)

EXAMPLE: Spoken Audio Search (or Query-by-Example Spoken-Term Detection)

Given a single spoken query we search for instances at lexical level within spoken documents

It is similar to Spoken Term Detection (NIST STD2006, OpenKWS 2013) but…

Queries are spoken

Different speakers

Different acoustic conditions

No prior knowledge of the

language(s) might be available

Mediaeval SWS 2013• 9 languages in different acoustic contexts: 4 African

languages (isixhosa, isizulu, sepedi, setswana), Albanian, Basque, Czech, non-native English, Romanian

#utts time Avg. length/utt.

Search corpus 10762 19:57:55 6.67s

Dev Queries 505 0:11:26h 1.35s

Extended dev* 1046 0:08:42h 0.49s

Eval Queries 503 0:11:37h 1.38s

Extended eval* 1037 0:08:57h 0.51s

Total 13853 20:38:37h*Only Basque (3x) and Czech (10x) queries have extended versions

5 10 20 40 60 80 90 95 98.0001.001

.004.01

.02.05

.1.2

.51

25

1020

40

Miss probability (in %)

False Alarm probability (in %

)

Primary system

s (evaluation)

Random Perform

anceGTTS (M

TWV=0.399, Thr=5.243)

L2F (MTW

V=0.342, Thr=3.551)CUHK (M

TWV=0.306, Thr=0.618)

BUT (MTW

V=0.297, Thr=0.914)CM

TECHETAL (MTW

V=0.257, Thr=18.153)IIITH (M

TWV=0.224, Thr=2.721)

ELIRF (MTW

V=0.159, Thr=2.759)TID (M

TWV=0.093, Thr=5.051)

GTC (MTW

V=0.084, Thr=3.341)SPEED (M

TWV=0.059, Thr=0.923)

LIA-Late (MTW

V=0.000, Thr=1079.003)UNIZA-Late (M

TWV=0.001, Thr=1.000)

TUKE-Late (MTW

V=0.000, Thr=3.000)

Mediaeval SWS 2013

Mediaeval SWS 2013 (results per language)

How do children learn?(from someone who is not a parent…)

1. They hear their environment and identify/isolate particular audio-visual stimuli they do not know

2. An expert (parent/grandparent) tells them the “meaning” of those stimuli.

– If the stimuli appears in different forms (or the child is not sharp) they will need to repeat it a couple of times…

3. The child learns and is able to identify this stimuli from then on.

book

book

book

book

book

Machine earning Machine earning

“book” model “?” model

• How to incorporate acoustic modeling into dynamic programming techniques?

• How to describe the acoustic space (or whatever space) in an unsupervised (but robust) manner?

• How do we discriminate between “interesting/relevant” and “filler” events

• Does it all make any sense? (maybe we could consider we will always have enough training data?)

daghstuhl seminar 13451 (computational audio analysis) inspirational talk

Technology

l2f mtwv

cuhk mtwv

cmtechetal mtwv

iiith mtwv

elirf mtwv

tid mtwv

gtc mtwv

speed mtwv