future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf ·...

68
Future directions in computer vision Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

Upload: others

Post on 26-Dec-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Future directions in

computer visionLarry Davis

Computer Vision Laboratory

University of Maryland

College Park MD USA

Page 2: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Presentation overview

Future Directions Workshop on Computer

Vision

Object detection using CNN’s without object

proposals

Incorporating context onto detection

Scale dependent pooling to detect small

object instances

Resolving referring expressions using context

Summary

Page 3: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Strategic Directions Workshop on

“Visual Commonsense” Nov 12-

13 in D.C.

• Sponsored by OSTP in the US

• Poggio, Malik, Zhu, Berg (Alex), Kohli, Hoeim,

Grauman, Zitnick, Gupta, Fox, Tellex, Oliva,

Scholl (absent), Domingos, Daume.

• Organized by me, Fei Fei Li and Devi Parikh

Page 4: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

The computer vision landscape

• Breakthroughs in CV (and AI generally) would

clearly be “disruptive.” This has been known

“forever.”

• Our field has more than doubled in size in less

than a decade and there are currently more than

175 startups in computer vision worldwide

according to chrunchbase.

• Feeding frenzy in self driving cars

• So, has the field finally progressed to the point

where real vision problems can be solved?

Page 5: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

So, what has changed?Deep learning

Page 6: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

So, what has changed?Deep learning

SFM and stereo

Page 7: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

So, what has changed?Deep learning

SFM and stereo

Human pose estimation and tracking

Page 8: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

So, what has changed?Deep learning

SFM and stereo

Human pose estimation and tracking

Computing infrastructure

Big Data

Crowd sourcing

GPU’s

Cloud computing and “free” storage

Page 9: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

So, what has changed?Deep learning

SFM and stereo

Human pose estimation and tracking

Computing infrastructure

Big Data

Crowd sourcing

GPU’s

Cloud computing and “free” storage

Open source software

Page 10: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Commercial indicators

Driving aids and autonomous driving - Mobileye

Face recognition under the hood

at social media companies

Image search – Tineye,

Clarifai

Google self driving cars – 1.5 M miles and going

Page 11: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

And what about the next

10?So what do you think the future of the field

is?

Here are some of the workshop

recommendations.

Page 12: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Workshop

recommendationsDevelop the field of “social perception”

Understand the “internal state” of people as they interact with each other and with the world

Crucial for human robot interaction

Perceptual Robotics – and testbeds for measurement of progress in situated vision research.

Visual Search – intelligent sampling of the visual world

Acquisition and Representation of Visual Commonsense from Observation and Interaction

Vision and Language

Page 13: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Language and vision - How to test ability to

accumulate and integrate knowledge?VQA Dataset

• Many useful challenges

– Where to look to answer a question?

– How to relate existing detectors, pose estimators, attribute classifiers, etc. to this task?

– How to combine general knowledge with vision?

Page 14: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Workshop

recommendations

Structured prediction

Relationship between parts, objects and

scenes

The hierarchical structure of human

behavior- movement, goals, actions and

events

“Explainable” perception. Don’t just classify, explain your answer

Page 15: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Workshop

recommendationsDeep learning.

Why/when does it work?

Why are all local minima created equal?

Visual learning with minimal (no) supervision

Developmental learning (NEIL)

Page 16: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Are object proposals

necessarily the answer?

G-CNN – an iterative grid based object detector

Mahyar Najibi and Mohammad Rastegari

CVPR 2016

Page 17: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Object detection

Localization – bounding box, segmentation

masks

Classification

Page 18: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

In your camera – sliding window

detection

Sliding Window

Extracted Boxes

Multi class

Classifier

horse = 0.6

horse = 0.0

horse = 0.0

horse = 0.5

horse = 0.9

person = 0.3

person = 0.5

person = 0.8

person = 0.9

person = 0.0

Page 19: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Object proposals

Sliding windows are slow – scale, orientation, ..

Object proposals are (learning-based) multi- segmentation algorithms that generate fewer regions for classification (typically boxes).

Consensus is that region proposals are crucial to SOA detection systems whether they are given to the network or constructed by the network

However localization is poor, so (class-dependent) post-processing is typically employed

Regressor

Page 20: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Object proposals and CNN’sR-CNN - push each proposal through the CNN; slow because the

network is run multiple times.

SPP-Net [1] computes filter responses only once for each image and

pools from them to form features for the proposals.

Fast R-CNN [2] builds on this and packs all stages of the system

except the region proposal into one CNN.

Fast R-CNN

1. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition."

Computer Vision–ECCV 2014. Springer International Publishing, 2014. 346-361

2. Girshick, Ross. "Fast R-CNN." ICCV (2015).

Page 21: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Region Proposal Stage

These methods use an external object proposalstage (e.g. selective search with ~2Kproposals/image)

In Fast R-CNN, computing object proposals is thebottleneck, taking around 2 sec/image time.

Faster R-CNN [3] increases efficiency by reducingthe number of proposed bounding boxes.

Jointly learns proposal generator and features

Fast and accurate3. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal

networks." NIPS (2015).

Page 22: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

G-CNN Training Network

Structure

Page 23: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

G-CNN: Training

Training set for step 1

Page 24: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

G-CNN: Training

Added samples for step 2

Page 25: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

G-CNN Detection

*The highest

scoring class

is car.

*The highest

scoring class

is car.

CarRegressor

*The highest

scoring class

is car

CarRegressor

*The highest

scoring class

is car.

CarRegressor

Iteratively update the position of the initial bounding boxes with the

regressor corresponding to the class with the highest score.

Page 26: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

G-CNN structure in detection

time

To reduce detection time, the G-CNN network is

divided into two parts:

• The global part is called only once for each image.

• The regression part is called Stest times, one for each

step.

Page 27: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Experimental Setup

• Experiments are performed on VOC 2007 and VOC

2012 datasets.

• G-CNN is trained with S=3 steps over an initial grid

with three scales [2,5,10] and overlaps [0.9,0.8,0.7] at

each scale.

• At test time, use a coarser grid with overlaps

[0.7,0.5,0.0] (around 180 initial boxes)

• after 5 iterations achieves the same mAP as Fast

R-CNN with around 2K bounding boxes.

Page 28: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

VOC2012 using VGG16

Page 29: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

How effective are the

regressors?IoU histogram of the best overlapping boxes to ground truth

boxes at each iteration.

Page 30: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them
Page 31: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

How can a neural network

learn and utilize context?

Mahyar Najibi, Mohammad Rastegari,

Abhinav Gupta, Ali Farhadi – Deep

Saccadic Detectors

Page 32: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Top choices of FRCNN are very accurate

Page 33: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Detection with GTS

Method Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow

FRCNN SS 66.4 71.6 53.8 43.3 24.7 69.2 69.7 71.5 31.1 63.4

FRCNN SS+GT 68.2 74.1 56 50.6 31.5 72.6 72.8 73.1 34.8 63.8

FRCNN GT 83 84.1 78.7 81.5 73.7 85.5 88 83.5 69.9 75.4

Dining table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV Monitor Average

59.8 62.2 73.1 65.9 57 26 52 56.4 67.8 57.7 57.1

61.1 63.5 76.8 68.5 63.7 29.7 54.4 57.8 70.5 61 60.2

80.2 78.1 81.9 85.1 87.7 83.2 71.7 78.5 88.8 88.5 81.3

Methods are trained on VOC2007 trainval.

AlexNet is employed as the CNN structure.

FRCNN GT: Only GT boxes are

used.

FRCNN SS: Fast RCNN using

selective search proposals.

FRCNN SS+GT: GT boxes are

added to SS boxes.

Page 34: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Sequential detection

This suggests a simple strategy for

detection

Commit to the most confident

detection

Use it as context for determining the

next most confident detection,

And so on

All integrated into one CNN architecture

Page 35: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Deep Sequential DetectionInput

Convolutional

Layers

RO

I P

ooli

ng

ROI info

Lin

ear (

fc6)

ReL

U

Lin

ear

(fc7

)

ReL

U

Reg

ress

or

Act

ive

Sel

ecto

r

Lin

ear

(h1)

Cla

ss-b

ase

d

ou

tpu

t

ReL

U

Lin

ear

(h2)

ReL

U

Cla

ssif

ier

Hidden

State

Selector

Con

cat

Cla

ssif

ica

tio

n

Ou

tpu

t

Active

select input

Hidden

select input

MAX

NMS

Page 36: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Datasets• Pascal VOC2007

• 20 Classes

• ~10K images

• Pascal VOC2012

• 20 Classes

• ~15K images

• MSCOCO (2015)

• 80 Classes

• ~300K images

Page 37: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

VOC 2012

Page 38: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

MSCOCO

Precision and Recall

Methods are trained on the train-set and evaluated on the validation-set.

Top 2K selective search proposals are used for the methods.

Class-based Relative Improvement

Page 39: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Scale dependent pooling – Fan

Yang (CVPR 2016)

Goal - detect (even small) objects effectively

and efficiently using CNNs + object

proposals

61

scale variancehuge number

of proposals

Page 40: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Scale-dependent pooling

Pool proposals of different scales from different

conv layers: n-branch structure

Small instances of objects are well represented using

features pooled from lower conv layers

Page 41: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Scale-dependent poolingDivide proposals into groups based on their size

Pool small proposals at lower conv layers and larger

ones at higher conv layers

Train the entire system end-to-end

small proposal

s

Pooling Pooling

large proposal

s

Page 42: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

ExperimentsKITTI (mAP)

Inner-city (mAP)

Page 43: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Detection as a function of size - Kitti

Car Pedestrian Cyclist mAP

Methods Inputs S1 S2 S3 S4 S S1 S2 S3 S4 S S1 S2 S3 S4 S S

FRCNN+AlexN

et4 52.8 60.7 75.8 55.5 61.6 19.7 47.5 88.4 24.1 61.4 42 51.6 44.9 0 46.5 56.5

FRCNN+VGG1

6

1

(400) 33.9 68.3 82.8 68.8 57.3 7.9 50.4 95.3 55.8 64.6 19 63.8 66.6 0 42.3 54.7

1

(500) 42.2 70 85.1 65.9 62.3 12.6 55.9 94.6 44.9 66.8 29.1 63.8 68.7 0 48.8 59.3

1

(800) 47.6 70 84.8 60.5 64.5 14.7 54.5 94.5 47.2 66.4 34.9 61.2 67.4 0 50.4 60.4

2 47.4 70.2 83.1 54.5 64.1 14.9 55.2 94.5 63.1 66.5 35.8 61.2 65.9 0 50.4 60.3

SDP

1

(400)59.1 73.8 84.7 73.6 70.7 12.6 54.8 94.9 70.7 65.7 29.3 65.6 71.7 0 49.4 61.9

1

(500)64.2 74.4 86 68.4 73.7 17.3 58.4 94.9 44.8 66.9 37.5 67.3 68.6 0 54.6 65.1

1

(800)65.2 73.5 86 61 73.8 16.9 57.1 94.3 44.1 65.5 36.5 61.5 61.9 0 49.9 63.1

SDP+CRC1

(500)63.9 74.3 85.8 68.2 73.5 17.5 52 93.7 45.9 65.5 35.1 65.7 69.2 0 52.9 64

SDP+CRC ft1

(500) 63.9 74.2 85.5 62.9 73.7 17.6 50 93.4 61 65.9 35.8 66.5 67.6 0 53.1 64.2

Page 44: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context between Objects for

Understanding Referring Expressions

Varun Nagaraja, Vlad Morariu, Larry

Davis

ECCV 2016

Page 45: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Man sitting on the left holding a game controller

Woman in the middle sitting on the bed

Man wearing a red jacket and blue jeans sitting on the right

Descriptions that identify a particular object instance

Referring Expressions

Page 46: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Referring expressions rely on attributes and context

Blonde fluffy dog

Tan colored sofa

Giraffe bending down

Person riding a blue motorcycle

Plant on the right side of the TV

Page 47: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Problem Formulation

Sentence:Girl wearing a red jacket

Image I

Input Output

Page 48: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Solution Framework

Generation and Comprehension of Unambiguous Object Descriptions

J. Mao et al., CVPR 2016

Hypothesize a set of region candidates

Page 49: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Solution Framework

Generation and Comprehension of Unambiguous Object Descriptions

J. Mao et al., CVPR 2016

Pick the region candidate with the highest probability

of generating the query referring expression

Page 50: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Baseline Method

LST

M

unit

Girl

<BOS>

Region CNN features

Image CNN features

Bounding box features

Word

embedding

wearingLST

M

unit

Girl

aLST

M

unit

wearing

red

LST

M

unit

a

jacket

LST

M

unit

red

LST

M

unit

<EOS>

jacket

Generation and Comprehension of Unambiguous Object Descriptions

J. Mao et al., CVPR 2016

Modeling referring expression probability using an LSTM

Page 51: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Max-margin Method

The baseline method can be improved by training the

model to have lower probability for negative regions

Referred region Negative regions

Generation and Comprehension of Unambiguous Object Descriptions

J. Mao et al., CVPR 2016

Girl wearing a red jacket

Page 52: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context

The plant on the right side of the TV

Previous methods do not model locations of contextual

objects

Page 53: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context

LSTM

Word Embedding

Region CNN features

Image features

Baseline and Max-margin architecture

Region BBox

Page 54: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context

Context model architecture

LSTM

Word Embedding

Region CNN features

Context region features

Region BBox

Context region BBox

Page 55: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

LSTM

Word Embedding

Region1 CNN features

Region2 CNN features

Region1 BBox

Region2 BBox

Region1

Region2

Modeling Context

Page 56: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

LSTM

Word Embedding

Region1 CNN features

Region3 CNN features

Region1 BBox

Region3 BBox

Region1

Region3

Modeling Context

Page 57: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

LSTM

Word Embedding

Region1 CNN features

Region4 CNN features

Region1 BBox

Region4 BBox

Region1Region4

Modeling Context

Page 58: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context

Pooling context from multiple pairs of regions

Page 59: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Modeling Context

We can also use noisy-or pooling which is more robust

Noisy-or

Noisy-or

Page 60: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Training the Context Model

The challenge is that there are no annotations available for

context objects

The plant on the right side of the TV

Page 61: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Multiple Instance Learning

So we use a MIL based technique and use the annotation

of the referred object as weak supervision

The plant on the right side of the TV

Page 62: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Experiments

Implemented in Caffe

Region and Image features

• VGG16 fc8 layer - fine-tuned.

Bounding box features

• scaled <xmin, ymin, xmax, ymax, area>

Word embedding size – 1024

LSTM hidden dimension – 1024

Region candidates – MCG technique

Region filtering process

• Obtain scores from Fast-RCNN and select regions above a

threshold

Page 63: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Google RefExp Results

Method \ Proposals GT MC

G

Max Likelihood [Mao et al] 57.5 42.4

Max margin [Mao et al] 65.7 47.8

Ours, Neg. Bag margin 68.4 49.5

Ours, Pos. & Neg. Bag

margin

68.4 50.0

All results are from noisy-or pooling

A detection is considered true positive if the IOU score is greater than 0.5

Google RefExp Validation Partition

Page 64: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

Google RefExp Results

The chair closest to the lady

Groundtruth Image context only Noisy-or pooling

A white truck in front of a yellow truck

Page 65: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

UNC RefExp Results

Method \ Proposals GT MCG

Max Likelihood [Mao et al] 70.6 50.0

Max margin [Mao et al] 76.3 55.1

Ours, Neg. Bag margin 78.0 56.4

Ours, Pos. & Neg. Bag

margin

76.1 56.3

TestB Partition (Object centric)

Page 66: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

UNC RefExp Results

Groundtruth Image context only Noisy-or pooling

Elephant towards the back

Food on the far back on the plate

TestB Partition (Object centric)

Page 67: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

A few closing observationsSuccess depends on region proposal algorithm including

candidates for the correct referred and context objects

Much more demanding than just requiring a candidate

for the referred object

Ameliorated somewhat by having the entire image as a

candidate context object

Straightforward extension to include additional context

objects (language can be deeply nested) intractable

(Methodological) – would like to evaluate performance

restricted to “relevant” referring expressions, but difficult

to specify correct criteria for selection

Page 68: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them

SummaryIntellectual landscape of computer vision has changed dramatically over the past decade

Many of the “future research directions” identified by the workshop are already well underway

And there are still huge performance shortfalls on basic problems like detection and recognition (compare MSCOCO vs VOC)

My favorite future research directions

Context – sooner or later it has to make a difference

Visual search

Tasking visual surveillance systems – compositional models and video analysis (structured prediction)