ambient sound provides supervision for visual...

Ambient Sound Provides Supervisionfor Visual Learning

Andrew Owens1, Jiajun Wu1, Josh H. McDermott1,William T. Freeman1,2, and Antonio Torralba1

1MIT & 2Google Research

ECCV 2016

Presented by An T. Nguyen

1

Introduction

Problem

I Learn Image Representation without labels ...

I ... that useful for a real task (e.g. Object Recognition).

Idea

I Set up a pretext task.

I To solve pretext task, model must learn good representation.

Learn to predict a “natural signal”...

I ...that available for ‘free’.

I This paper: Sound.

I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

2

Introduction

Problem

Idea

2

Introduction

Problem

Idea

2

Introduction

Problem

Idea

2

Introduction

Problem

Idea

2

Data

Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)

I 360,000 video subset.

I Sample one image per 10sec.

I Extract 3.75 sec of sound around.

I 1.8 mil. train examples.

3

Data

Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)

I 360,000 video subset.

I Sample one image per 10sec.

I Extract 3.75 sec of sound around.

I 1.8 mil. train examples.

3

Examples 1(flickr.com/photos/41894173046@N01/4530333858)

Sound

Video

4

Examples 2(flickr.com/photos/42035325@N00/8029349128)

Sound

Video

5

Examples 3(flickr.com/photos/zen/2479982751)

Sound

Video

6

Challenges

I Sound is sometimes indicative of image.

I But sometimes not.

Sound producing objects

I outside image.

I not always produce sound.

Video

I is edited.

I has noisy, background sound.

Question: What representation can we learn?

7

Challenges

I outside image.

Video

I is edited.

7

Challenges

I outside image.

Video

I is edited.

7

Challenges

I outside image.

Video

I is edited.

7

Represent sound

Pre-process

I Filter waveform ... (mimic human ear).

I Compute statistics (e.g. mean of each freq. channel).

I → sound texture: 502-dim vector.

Two labeling models

1. Cluster sound texture (k-mean).

2. PCA, 30 projections, threshold → binary codes.

Given an image

1. Predict sound cluster.

2. Predict 30 binary codes (multi-label classification).

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Represent sound

Pre-process

Two labeling models

Given an image

8

Training

Convolutional Neural Network

I Similar to (Krizhevsky et. al. 2012).

I Implemented in Caffe.

9

Training

Convolutional Neural Network

I Similar to (Krizhevsky et. al. 2012).

I Implemented in Caffe.

9

Training

10

Visualizing neurons (in upper layers)

Method: for each neuron

1. Find images with large activation.

2. Find locations with large contribution to activation.

3. Highlight these regions.

4. Show to human on AMT.

11

Visualizing neurons

12

Visualizing neurons

12

Visualizing neurons

12

Visualizing neurons

13

Visualizing neurons

13

Visualizing neurons

13

Detectors HistogramSound

Ego Motion

Labeled Scenes (supervised)

14

Ego Motion

14

Ego Motion

14

Observations

I Each method learn some kinds of representations...

I ...depend on the pretext task.

Representation learned from sound

I Objects with distinctive sound.

I Complementary to other methods.

15

Observations

I Each method learn some kinds of representations...

I ...depend on the pretext task.

Representation learned from sound

I Objects with distinctive sound.

I Complementary to other methods.

15

Object/Scene Recognition (1-vs-rest SVM)

Comparable Performance to Others

1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

Comparable Performance to Others

1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

Comparable Performance to Others1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

Object Detection (Pretrain Fast-RCNN)

Similar Performance to Motion

1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

17

Similar Performance to Motion

1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

17

Similar Performance to Motion1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

17

Discussion

Sound

I is abundant.

I can learn good representations.

I complementary to visual info.

Future work

I Other sound representations.

I What object/scene detectable by sound?

18

Discussion

Sound

I is abundant.

I can learn good representations.

I complementary to visual info.

Future work

I Other sound representations.

I What object/scene detectable by sound?

18

Bonus: Visually Indicative Sound(Owens et. al. 2016, vis.csail.mit.edu)

19

ambient sound provides supervision for visual...

Documents