spatio-temporal analysis of multimodal speaker activity guillaume lathoud, idiap supervised by dr...

54
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

Upload: roderick-davidson

Post on 01-Jan-2016

236 views

Category:

Documents


2 download

TRANSCRIPT

Spatio-Temporal Analysis of Multimodal Speaker Activity

Guillaume Lathoud, IDIAP

Supervised by Dr Iain McCowan, IDIAP

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

• Approach: based on speaker location.

Context

• Spontaneous multi-party speech.

• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.

• Approach: based on speaker location.

• Multisource problem (overlaps, noise).

How?

• Audio location: microphone array

• Audio content: speaker identification.• Video: one or several cameras.• Combination.

How?

• Audio location: microphone array

• Audio content: speaker identification.• Video: one or several cameras.• Combination.

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Task 1: AV16.3 corpus

• At IDIAP: 16 microphones, 3 cameras.

• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).

Task 1: AV16.3 corpus

• At IDIAP: 16 microphones, 3 cameras.

• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).

• 3D mouth annotation.

• Used in the AMI project.

• http://mmm.idiap.ch/Lathoud/av16.3_v6

AV16.3 corpus

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

Sector-based Approach

Question: is there at least one active source in a given sector?

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

• Sectors (coarse-to-fine).

Task 2: Multisource Localization

• Problem: – Detect: how many speakers?– Localize: where?

• Sectors (coarse-to-fine).

• Tested on real data: AV16.3 corpus.

• To do:– Finalize (optimization, multi-level).– Compare with existing.

Task 2: Single Speaker Example

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100%

Average

nb detected

2.0

2 loudspeakers simultaneously active

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100% 100%

Average

nb detected

2.0 1.9

2 loudspeakers simultaneously active

Task 2: Multiple Loudspeakers

Metric Ideal Result

>=1 detected 100% 100%

Average

nb detected

2.0 1.9

>=1 detected 100% 99.8%

Average

nb detected

3.0 2.5

3 loudspeakers simultaneously active

Real data: Humans

Metric Ideal Result

>=1 detected ~89.4% 90.8%

Average

nb detected

~1.3 1.3

2 speakers simultaneously active (includes short silences)

Real data: Humans

Metric Ideal Result

>=1 detected ~89.4% 90.8%

Average

nb detected

~1.3 1.3

3 speakers simultaneously active (includes short silences)

>=1 detected ~96.5% 95.1%

Average

nb detected

~2.0 1.6

Audio: Global PictureMultiple waveforms

Resolution 0.0000625 s

Task 1: data acquisition and annotation (complete)

Task 2: develop robust multisource strategies (in progress)

Active speakers’ locations

Resolution 0.016 s

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

• Alternative: short-term clustering.

Task 3: Segmentation/Tracking

• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).

• Alternative: short-term clustering.

• Short-term = 0.25 s.

• Threshold-free, online, unsupervised.

• Unknown number of objects.

Example: iteration 1 (partition)

Example: iteration 1 (merge)

Example: iteration 2 (partition)

Example: iteration 2 (merge)

Example: iteration 3 (partition)

Example: iteration 3 (merge)

Example: iteration 4 (partition)

Example: iteration 4 (merge)

Example: result

Example: result

Example: result

Task 3: Application

Link locations across space and time

Resolution 0.016 s

Links up to 0.25 s

Task 3: joint segmentation and tracking (complete)

Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch

Single source localization

Application (2)

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

Task 3: Metrics

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.

Task 3: Metrics

•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.

•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.

•F-measure: F = 2 * PRC * RCL

PRC + RCL

Task 3: Metrics

Task 3: Results

Entire data:

Proposed Lapel baseline

PRC 79.7% 84.3%

RCL 94.6% 93.3%

F 86.5% 88.6%F = 2 * PRC * RCL

PRC + RCL

Task 3: Results

Entire data:

Proposed Lapel baseline

PRC 79.7% 84.3%

RCL 94.6% 93.3%

F 86.5% 88.6%

Overlaps only:

Proposed Lapel baseline

PRC 55.4% 46.6%

RCL 84.8% 66.4%

F 67.0% 54.7%

F = 2 * PRC * RCL

PRC + RCL

Conclusion

• Spontaneous speech = multisource problem.

• AV16.3 corpus recorded, annotated.

• Approach: detect, localize, track, segment.

• Location is not identity!– Fusion with monochannel analysis.– Fusion with video.

Thank you!

Detection: Energy and Localization

Task 2: Delay-sum vs Proposed (1/3)

With optimized centroids (this work)

With delay-sum centroids (this work)

Task 2: Delay-sum vs Proposed (2/3)

Metric Ideal Delay-sum Proposed

>=1 detected 100% 99.9% 100%

Average nb detected

2.0 1.8 1.9

2 loudspeakers simultaneously active

>=1 detected 100% 99.2% 99.8%

Average nb detected

3.0 1.9 2.5

3 loudspeakers simultaneously active

Task 2: Delay-sum vs Proposed (3/3)

Metric Ideal Delay-sum Proposed

>=1 detected ~89.4% 80.0% 90.8%

Average nb detected

~1.3 1.0 1.3

2 humans simultaneously active

>=1 detected ~96.5% 86.7% 95.1%

Average nb detected

~2.0 1.4 1.6

3 humans simultaneously active