spatio-temporal analysis of multimodal speaker activity guillaume lathoud, idiap supervised by dr...
TRANSCRIPT
Spatio-Temporal Analysis of Multimodal Speaker Activity
Guillaume Lathoud, IDIAP
Supervised by Dr Iain McCowan, IDIAP
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
• Approach: based on speaker location.
Context
• Spontaneous multi-party speech.
• Goal: extract salient information:– Who? What? When? Where?– Automatic meeting annotation/transcription.– Speaker tracking, speech acquisition.– Surveillance.
• Approach: based on speaker location.
• Multisource problem (overlaps, noise).
How?
• Audio location: microphone array
• Audio content: speaker identification.• Video: one or several cameras.• Combination.
How?
• Audio location: microphone array
• Audio content: speaker identification.• Video: one or several cameras.• Combination.
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Task 1: AV16.3 corpus
• At IDIAP: 16 microphones, 3 cameras.
• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).
Task 1: AV16.3 corpus
• At IDIAP: 16 microphones, 3 cameras.
• 40 short recordings, about 1h30 overall.– ‘meeting’: seated.– ‘surveillance’: standing.– pathological test cases (A, V, AV).
• 3D mouth annotation.
• Used in the AMI project.
• http://mmm.idiap.ch/Lathoud/av16.3_v6
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Task 2: Multisource Localization
• Problem: – Detect: how many speakers?– Localize: where?
• Sectors (coarse-to-fine).
Task 2: Multisource Localization
• Problem: – Detect: how many speakers?– Localize: where?
• Sectors (coarse-to-fine).
• Tested on real data: AV16.3 corpus.
• To do:– Finalize (optimization, multi-level).– Compare with existing.
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100%
Average
nb detected
2.0
2 loudspeakers simultaneously active
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100% 100%
Average
nb detected
2.0 1.9
2 loudspeakers simultaneously active
Task 2: Multiple Loudspeakers
Metric Ideal Result
>=1 detected 100% 100%
Average
nb detected
2.0 1.9
>=1 detected 100% 99.8%
Average
nb detected
3.0 2.5
3 loudspeakers simultaneously active
Real data: Humans
Metric Ideal Result
>=1 detected ~89.4% 90.8%
Average
nb detected
~1.3 1.3
2 speakers simultaneously active (includes short silences)
Real data: Humans
Metric Ideal Result
>=1 detected ~89.4% 90.8%
Average
nb detected
~1.3 1.3
3 speakers simultaneously active (includes short silences)
>=1 detected ~96.5% 95.1%
Average
nb detected
~2.0 1.6
Audio: Global PictureMultiple waveforms
Resolution 0.0000625 s
Task 1: data acquisition and annotation (complete)
Task 2: develop robust multisource strategies (in progress)
Active speakers’ locations
Resolution 0.016 s
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
• Alternative: short-term clustering.
Task 3: Segmentation/Tracking
• Speech:– Short and sporadic utterances.– Overlaps.– Filtering is difficult (Kalman, PF).
• Alternative: short-term clustering.
• Short-term = 0.25 s.
• Threshold-free, online, unsupervised.
• Unknown number of objects.
Task 3: Application
Link locations across space and time
Resolution 0.016 s
Links up to 0.25 s
Task 3: joint segmentation and tracking (complete)
Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch
Single source localization
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
Task 3: Metrics
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.
Task 3: Metrics
•Precision (PRC):–An active speaker is detected in the result.–PRC = probability that he is truly active.
•Recall (RCL):–A speaker is truly active.–RCL = probability to detect him in the result.
•F-measure: F = 2 * PRC * RCL
PRC + RCL
Task 3: Metrics
Task 3: Results
Entire data:
Proposed Lapel baseline
PRC 79.7% 84.3%
RCL 94.6% 93.3%
F 86.5% 88.6%F = 2 * PRC * RCL
PRC + RCL
Task 3: Results
Entire data:
Proposed Lapel baseline
PRC 79.7% 84.3%
RCL 94.6% 93.3%
F 86.5% 88.6%
Overlaps only:
Proposed Lapel baseline
PRC 55.4% 46.6%
RCL 84.8% 66.4%
F 67.0% 54.7%
F = 2 * PRC * RCL
PRC + RCL
Conclusion
• Spontaneous speech = multisource problem.
• AV16.3 corpus recorded, annotated.
• Approach: detect, localize, track, segment.
• Location is not identity!– Fusion with monochannel analysis.– Fusion with video.
Task 2: Delay-sum vs Proposed (1/3)
With optimized centroids (this work)
With delay-sum centroids (this work)
Task 2: Delay-sum vs Proposed (2/3)
Metric Ideal Delay-sum Proposed
>=1 detected 100% 99.9% 100%
Average nb detected
2.0 1.8 1.9
2 loudspeakers simultaneously active
>=1 detected 100% 99.2% 99.8%
Average nb detected
3.0 1.9 2.5
3 loudspeakers simultaneously active