ambient sound provides supervision for visual...

51
Ambient Sound Provides Supervision for Visual Learning Andrew Owens 1 , Jiajun Wu 1 , Josh H. McDermott 1 , William T. Freeman 1,2 , and Antonio Torralba 1 1 MIT & 2 Google Research ECCV 2016 Presented by An T. Nguyen 1

Upload: others

Post on 25-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Ambient Sound Provides Supervisionfor Visual Learning

    Andrew Owens1, Jiajun Wu1, Josh H. McDermott1,William T. Freeman1,2, and Antonio Torralba1

    1MIT & 2Google Research

    ECCV 2016

    Presented by An T. Nguyen

    1

  • Introduction

    Problem

    I Learn Image Representation without labels ...

    I ... that useful for a real task (e.g. Object Recognition).

    Idea

    I Set up a pretext task.

    I To solve pretext task, model must learn good representation.

    Learn to predict a “natural signal”...

    I ...that available for ‘free’.

    I This paper: Sound.

    I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

    2

  • Introduction

    Problem

    I Learn Image Representation without labels ...

    I ... that useful for a real task (e.g. Object Recognition).

    Idea

    I Set up a pretext task.

    I To solve pretext task, model must learn good representation.

    Learn to predict a “natural signal”...

    I ...that available for ‘free’.

    I This paper: Sound.

    I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

    2

  • Introduction

    Problem

    I Learn Image Representation without labels ...

    I ... that useful for a real task (e.g. Object Recognition).

    Idea

    I Set up a pretext task.

    I To solve pretext task, model must learn good representation.

    Learn to predict a “natural signal”...

    I ...that available for ‘free’.

    I This paper: Sound.

    I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

    2

  • Introduction

    Problem

    I Learn Image Representation without labels ...

    I ... that useful for a real task (e.g. Object Recognition).

    Idea

    I Set up a pretext task.

    I To solve pretext task, model must learn good representation.

    Learn to predict a “natural signal”...

    I ...that available for ‘free’.

    I This paper: Sound.

    I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

    2

  • Introduction

    Problem

    I Learn Image Representation without labels ...

    I ... that useful for a real task (e.g. Object Recognition).

    Idea

    I Set up a pretext task.

    I To solve pretext task, model must learn good representation.

    Learn to predict a “natural signal”...

    I ...that available for ‘free’.

    I This paper: Sound.

    I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)

    2

  • Data

    Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)

    I 360,000 video subset.

    I Sample one image per 10sec.

    I Extract 3.75 sec of sound around.

    I 1.8 mil. train examples.

    3

  • Data

    Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)

    I 360,000 video subset.

    I Sample one image per 10sec.

    I Extract 3.75 sec of sound around.

    I 1.8 mil. train examples.

    3

  • Examples 1(flickr.com/photos/41894173046@N01/4530333858)

    Sound

    Video

    4

  • Examples 2(flickr.com/photos/42035325@N00/8029349128)

    Sound

    Video

    5

  • Examples 3(flickr.com/photos/zen/2479982751)

    Sound

    Video

    6

  • Challenges

    I Sound is sometimes indicative of image.

    I But sometimes not.

    Sound producing objects

    I outside image.

    I not always produce sound.

    Video

    I is edited.

    I has noisy, background sound.

    Question: What representation can we learn?

    7

  • Challenges

    I Sound is sometimes indicative of image.

    I But sometimes not.

    Sound producing objects

    I outside image.

    I not always produce sound.

    Video

    I is edited.

    I has noisy, background sound.

    Question: What representation can we learn?

    7

  • Challenges

    I Sound is sometimes indicative of image.

    I But sometimes not.

    Sound producing objects

    I outside image.

    I not always produce sound.

    Video

    I is edited.

    I has noisy, background sound.

    Question: What representation can we learn?

    7

  • Challenges

    I Sound is sometimes indicative of image.

    I But sometimes not.

    Sound producing objects

    I outside image.

    I not always produce sound.

    Video

    I is edited.

    I has noisy, background sound.

    Question: What representation can we learn?

    7

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Represent sound

    Pre-process

    I Filter waveform ... (mimic human ear).

    I Compute statistics (e.g. mean of each freq. channel).

    I → sound texture: 502-dim vector.

    Two labeling models

    1. Cluster sound texture (k-mean).

    2. PCA, 30 projections, threshold → binary codes.

    Given an image

    1. Predict sound cluster.

    2. Predict 30 binary codes (multi-label classification).

    8

  • Training

    Convolutional Neural Network

    I Similar to (Krizhevsky et. al. 2012).

    I Implemented in Caffe.

    9

  • Training

    Convolutional Neural Network

    I Similar to (Krizhevsky et. al. 2012).

    I Implemented in Caffe.

    9

  • Training

    10

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons (in upper layers)

    Method: for each neuron

    1. Find images with large activation.

    2. Find locations with large contribution to activation.

    3. Highlight these regions.

    4. Show to human on AMT.

    11

  • Visualizing neurons

    12

  • Visualizing neurons

    12

  • Visualizing neurons

    12

  • Visualizing neurons

    13

  • Visualizing neurons

    13

  • Visualizing neurons

    13

  • Detectors HistogramSound

    Ego Motion

    Labeled Scenes (supervised)

    14

  • Detectors HistogramSound

    Ego Motion

    Labeled Scenes (supervised)

    14

  • Detectors HistogramSound

    Ego Motion

    Labeled Scenes (supervised)

    14

  • Observations

    I Each method learn some kinds of representations...

    I ...depend on the pretext task.

    Representation learned from sound

    I Objects with distinctive sound.

    I Complementary to other methods.

    15

  • Observations

    I Each method learn some kinds of representations...

    I ...depend on the pretext task.

    Representation learned from sound

    I Objects with distinctive sound.

    I Complementary to other methods.

    15

  • Object/Scene Recognition (1-vs-rest SVM)

    Comparable Performance to Others

    1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

  • Object/Scene Recognition (1-vs-rest SVM)

    Comparable Performance to Others

    1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

  • Object/Scene Recognition (1-vs-rest SVM)

    Comparable Performance to Others1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16

  • Object Detection (Pretrain Fast-RCNN)

    Similar Performance to Motion

    1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

    17

  • Object Detection (Pretrain Fast-RCNN)

    Similar Performance to Motion

    1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

    17

  • Object Detection (Pretrain Fast-RCNN)

    Similar Performance to Motion1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015

    17

  • Discussion

    Sound

    I is abundant.

    I can learn good representations.

    I complementary to visual info.

    Future work

    I Other sound representations.

    I What object/scene detectable by sound?

    18

  • Discussion

    Sound

    I is abundant.

    I can learn good representations.

    I complementary to visual info.

    Future work

    I Other sound representations.

    I What object/scene detectable by sound?

    18

  • Bonus: Visually Indicative Sound(Owens et. al. 2016, vis.csail.mit.edu)

    19