learning to separate object sounds by watching unlabeled...

Learning to Separate Object Sounds by Watching Unlabeled Video

Ruohan Gao1, Rogerio Feris2, Kristen Grauman1

1The University of Texas at Austin, 2IBM Research

Sight and SoundCVPR 2018 workshop, June 2018

Listening to learn

woof meow ring

Goal: a repertoire of objects and their sounds

Challenge: a single audio channel usuallymixes sounds of multiple objects

Listening to learn

clatter

Visually-guided audio source separation

Traditional approach: • Detect low-level correlations within a single video• Learn from clean single audio source examples

[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

visual

Learning to separate object sounds

Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources

Unlabeled video Object sound models

Violin

DogCat

Disentangle

Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds

Non-negativematrix

factorization

Visual predictions(ResNet-152

objects)

GuitarSaxophone

Output: Group of audio basis vectors per object class

Visualframes

Audio Audio basis vectors

Top visual detections

Our approach: training

MIMLUnlabeled video

AudioEstimate

activations

Violin Sound Piano Sound

Novel video

Frames

Initialize audio basis matrix

Violin bases Piano bases

ViolinPiano

Our approach: inferenceGiven a novel video, use discovered object sound models to guide audio source separation.

Visual predictions(ResNet-152

objects)

Semi-supervisedsource separation

using NMF

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009

Results

Failure cases

Visually-aided audio source separation (SDR)

Results

Our method achieves large gains, andmatch the separated source to meaningfulvideo.

Our method achieves large gains, and it also hasthe capability to match the separated source tomeaningful acoustic objects in the video.

Visually-aided audio denoising (NSDR)

Concurrent Work on Audio-VisualSource Separation

⎼ Owens & Efros , Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, 2018

⎼ Ephrat et al., Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, 2018

⎼ Zhao et al. , The Sound of Pixels, 2018

⎼ Afouras et al., The Conversation: Deep Audio-Visual SpeechEnhancement, 2018

We learn from uncurated multi-object multi-sourcevideos, and study a diverse set of object categories.

Conclusion

⎼ Learn object sound models from unlabeled videosto perform audio-visual source separation

⎼ Integrate localized object detections and motion

Unlabeled video Object sound models

Violin

DogCat

Disentangle

learning to separate object sounds by watching unlabeled...

Documents

classification of unlabeled data:

consumer knowledge, perception and attitudes of unlabeled

bootstrapping information extraction with unlabeled data

extracting adaptive contextual cues from unlabeled regions

ranking models in unlabeled new environments

transfer learning for structures spotting in unlabeled

learning from labelled and unlabeled data

watching unlabeled video helps learn new human actions...

representation learning theory with unlabeled data

learning from positive and unlabeled examples

techniques for exploiting unlabeled data mugizi robert

unlabeled far-field deeply subwavelength topological

positive unlabeled learning for deceptive reviews detection

set 1 unlabeled

learning to separate object sounds by watching unlabeled...

software defect prediction on unlabeled datasets

watching unlabeled video helps learn new human actions from...

positive and unlabeled learning in categorical data

negative-unlabeled tensor factorization for location...

learning to separate object sounds by watching unlabeled...