ambient sound provides supervision for visual...
TRANSCRIPT
-
Ambient Sound Provides Supervisionfor Visual Learning
Andrew Owens1, Jiajun Wu1, Josh H. McDermott1,William T. Freeman1,2, and Antonio Torralba1
1MIT & 2Google Research
ECCV 2016
Presented by An T. Nguyen
1
-
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
-
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
-
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
-
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
-
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
-
Data
Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)
I 360,000 video subset.
I Sample one image per 10sec.
I Extract 3.75 sec of sound around.
I 1.8 mil. train examples.
3
-
Data
Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)
I 360,000 video subset.
I Sample one image per 10sec.
I Extract 3.75 sec of sound around.
I 1.8 mil. train examples.
3
-
Examples 1(flickr.com/photos/41894173046@N01/4530333858)
Sound
Video
4
-
Examples 2(flickr.com/photos/42035325@N00/8029349128)
Sound
Video
5
-
Examples 3(flickr.com/photos/zen/2479982751)
Sound
Video
6
-
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
-
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
-
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
-
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
-
Training
Convolutional Neural Network
I Similar to (Krizhevsky et. al. 2012).
I Implemented in Caffe.
9
-
Training
Convolutional Neural Network
I Similar to (Krizhevsky et. al. 2012).
I Implemented in Caffe.
9
-
Training
10
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
-
Visualizing neurons
12
-
Visualizing neurons
12
-
Visualizing neurons
12
-
Visualizing neurons
13
-
Visualizing neurons
13
-
Visualizing neurons
13
-
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
-
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
-
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
-
Observations
I Each method learn some kinds of representations...
I ...depend on the pretext task.
Representation learned from sound
I Objects with distinctive sound.
I Complementary to other methods.
15
-
Observations
I Each method learn some kinds of representations...
I ...depend on the pretext task.
Representation learned from sound
I Objects with distinctive sound.
I Complementary to other methods.
15
-
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
-
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
-
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
-
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
-
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
-
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
-
Discussion
Sound
I is abundant.
I can learn good representations.
I complementary to visual info.
Future work
I Other sound representations.
I What object/scene detectable by sound?
18
-
Discussion
Sound
I is abundant.
I can learn good representations.
I complementary to visual info.
Future work
I Other sound representations.
I What object/scene detectable by sound?
18
-
Bonus: Visually Indicative Sound(Owens et. al. 2016, vis.csail.mit.edu)
19