future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf ·...

Future directions in

computer visionLarry Davis

Computer Vision Laboratory

University of Maryland

College Park MD USA

Presentation overview

Future Directions Workshop on Computer

Vision

Object detection using CNN’s without object

proposals

Incorporating context onto detection

Scale dependent pooling to detect small

object instances

Resolving referring expressions using context

Summary

Strategic Directions Workshop on

“Visual Commonsense” Nov 12-

13 in D.C.

• Sponsored by OSTP in the US

• Poggio, Malik, Zhu, Berg (Alex), Kohli, Hoeim,

Grauman, Zitnick, Gupta, Fox, Tellex, Oliva,

Scholl (absent), Domingos, Daume.

• Organized by me, Fei Fei Li and Devi Parikh

The computer vision landscape

• Breakthroughs in CV (and AI generally) would

clearly be “disruptive.” This has been known

“forever.”

• Our field has more than doubled in size in less

than a decade and there are currently more than

175 startups in computer vision worldwide

according to chrunchbase.

• Feeding frenzy in self driving cars

• So, has the field finally progressed to the point

where real vision problems can be solved?

So, what has changed?Deep learning


SFM and stereo


SFM and stereo

Human pose estimation and tracking


SFM and stereo


Computing infrastructure

Big Data

Crowd sourcing

GPU’s

Cloud computing and “free” storage


SFM and stereo


Computing infrastructure

Big Data

Crowd sourcing

GPU’s

Cloud computing and “free” storage

Open source software

Commercial indicators

Driving aids and autonomous driving - Mobileye

Face recognition under the hood

at social media companies

Image search – Tineye,

Clarifai

Google self driving cars – 1.5 M miles and going

And what about the next

10?So what do you think the future of the field

is?

Here are some of the workshop

recommendations.

Workshop

recommendationsDevelop the field of “social perception”

Understand the “internal state” of people as they interact with each other and with the world

Crucial for human robot interaction

Perceptual Robotics – and testbeds for measurement of progress in situated vision research.

Visual Search – intelligent sampling of the visual world

Acquisition and Representation of Visual Commonsense from Observation and Interaction

Vision and Language

Language and vision - How to test ability to

accumulate and integrate knowledge?VQA Dataset

• Many useful challenges

– Where to look to answer a question?

– How to relate existing detectors, pose estimators, attribute classifiers, etc. to this task?

– How to combine general knowledge with vision?

Workshop

recommendations

Structured prediction

Relationship between parts, objects and

scenes

The hierarchical structure of human

behavior- movement, goals, actions and

events

“Explainable” perception. Don’t just classify, explain your answer

Workshop

recommendationsDeep learning.

Why/when does it work?

Why are all local minima created equal?

Visual learning with minimal (no) supervision

Developmental learning (NEIL)

Are object proposals

necessarily the answer?

G-CNN – an iterative grid based object detector

Mahyar Najibi and Mohammad Rastegari

CVPR 2016

Object detection

Localization – bounding box, segmentation

masks

Classification

In your camera – sliding window

detection

Sliding Window

Extracted Boxes

Multi class

Classifier

horse = 0.6

horse = 0.0

horse = 0.0

horse = 0.5

horse = 0.9

person = 0.3

person = 0.5

person = 0.8

person = 0.9

person = 0.0

Object proposals

Sliding windows are slow – scale, orientation, ..

Object proposals are (learning-based) multi- segmentation algorithms that generate fewer regions for classification (typically boxes).

Consensus is that region proposals are crucial to SOA detection systems whether they are given to the network or constructed by the network

However localization is poor, so (class-dependent) post-processing is typically employed

Regressor

Object proposals and CNN’sR-CNN - push each proposal through the CNN; slow because the

network is run multiple times.

SPP-Net [1] computes filter responses only once for each image and

pools from them to form features for the proposals.

Fast R-CNN [2] builds on this and packs all stages of the system

except the region proposal into one CNN.

Fast R-CNN

1. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition."

Computer Vision–ECCV 2014. Springer International Publishing, 2014. 346-361

2. Girshick, Ross. "Fast R-CNN." ICCV (2015).

Region Proposal Stage

These methods use an external object proposalstage (e.g. selective search with ~2Kproposals/image)

In Fast R-CNN, computing object proposals is thebottleneck, taking around 2 sec/image time.

Faster R-CNN [3] increases efficiency by reducingthe number of proposed bounding boxes.

Jointly learns proposal generator and features

Fast and accurate3. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal

networks." NIPS (2015).

G-CNN Training Network

Structure

G-CNN: Training

Training set for step 1

G-CNN: Training

Added samples for step 2

G-CNN Detection

*The highest

scoring class

is car.

*The highest

scoring class

is car.

CarRegressor

*The highest

scoring class

is car

CarRegressor

*The highest

scoring class

is car.

CarRegressor

Iteratively update the position of the initial bounding boxes with the

regressor corresponding to the class with the highest score.

G-CNN structure in detection

time

To reduce detection time, the G-CNN network is

divided into two parts:

• The global part is called only once for each image.

• The regression part is called Stest times, one for each

step.

Experimental Setup

• Experiments are performed on VOC 2007 and VOC

2012 datasets.

• G-CNN is trained with S=3 steps over an initial grid

with three scales [2,5,10] and overlaps [0.9,0.8,0.7] at

each scale.

• At test time, use a coarser grid with overlaps

[0.7,0.5,0.0] (around 180 initial boxes)

• after 5 iterations achieves the same mAP as Fast

R-CNN with around 2K bounding boxes.

VOC2012 using VGG16

How effective are the

regressors?IoU histogram of the best overlapping boxes to ground truth

boxes at each iteration.

How can a neural network

learn and utilize context?

Mahyar Najibi, Mohammad Rastegari,

Abhinav Gupta, Ali Farhadi – Deep

Saccadic Detectors

Top choices of FRCNN are very accurate

Detection with GTS

Method Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow

FRCNN SS 66.4 71.6 53.8 43.3 24.7 69.2 69.7 71.5 31.1 63.4

FRCNN SS+GT 68.2 74.1 56 50.6 31.5 72.6 72.8 73.1 34.8 63.8

FRCNN GT 83 84.1 78.7 81.5 73.7 85.5 88 83.5 69.9 75.4

Dining table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV Monitor Average

59.8 62.2 73.1 65.9 57 26 52 56.4 67.8 57.7 57.1

61.1 63.5 76.8 68.5 63.7 29.7 54.4 57.8 70.5 61 60.2

80.2 78.1 81.9 85.1 87.7 83.2 71.7 78.5 88.8 88.5 81.3

Methods are trained on VOC2007 trainval.

AlexNet is employed as the CNN structure.

FRCNN GT: Only GT boxes are

used.

FRCNN SS: Fast RCNN using

selective search proposals.

FRCNN SS+GT: GT boxes are

added to SS boxes.

Sequential detection

This suggests a simple strategy for

detection

Commit to the most confident

detection

Use it as context for determining the

next most confident detection,

And so on

All integrated into one CNN architecture

Deep Sequential DetectionInput

Convolutional

Layers

RO

I P

ooli

ng

ROI info

Lin

ear (

fc6)

ReL

U

Lin

ear

(fc7

)

ReL

U

Reg

ress

or

Act

ive

Sel

ecto

r

Lin

ear

(h1)

Cla

ss-b

ase

d

ou

tpu

t

ReL

U

Lin

ear

(h2)

ReL

U

Cla

ssif

ier

Hidden

State

Selector

Con

cat

Cla

ssif

ica

tio

n

Ou

tpu

t

Active

select input

Hidden

select input

MAX

NMS

Datasets• Pascal VOC2007

• 20 Classes

• ~10K images

• Pascal VOC2012

• 20 Classes

• ~15K images

• MSCOCO (2015)

• 80 Classes

• ~300K images

VOC 2012

MSCOCO

Precision and Recall

Methods are trained on the train-set and evaluated on the validation-set.

Top 2K selective search proposals are used for the methods.

Class-based Relative Improvement

Scale dependent pooling – Fan

Yang (CVPR 2016)

Goal - detect (even small) objects effectively

and efficiently using CNNs + object

proposals

61

scale variancehuge number

of proposals

Scale-dependent pooling

Pool proposals of different scales from different

conv layers: n-branch structure

Small instances of objects are well represented using

features pooled from lower conv layers

Scale-dependent poolingDivide proposals into groups based on their size

Pool small proposals at lower conv layers and larger

ones at higher conv layers

Train the entire system end-to-end

small proposal

s

Pooling Pooling

large proposal

s

ExperimentsKITTI (mAP)

Inner-city (mAP)

Detection as a function of size - Kitti

Car Pedestrian Cyclist mAP

Methods Inputs S1 S2 S3 S4 S S1 S2 S3 S4 S S1 S2 S3 S4 S S

FRCNN+AlexN

et4 52.8 60.7 75.8 55.5 61.6 19.7 47.5 88.4 24.1 61.4 42 51.6 44.9 0 46.5 56.5

FRCNN+VGG1

6

1

(400) 33.9 68.3 82.8 68.8 57.3 7.9 50.4 95.3 55.8 64.6 19 63.8 66.6 0 42.3 54.7

1

(500) 42.2 70 85.1 65.9 62.3 12.6 55.9 94.6 44.9 66.8 29.1 63.8 68.7 0 48.8 59.3

1

(800) 47.6 70 84.8 60.5 64.5 14.7 54.5 94.5 47.2 66.4 34.9 61.2 67.4 0 50.4 60.4

2 47.4 70.2 83.1 54.5 64.1 14.9 55.2 94.5 63.1 66.5 35.8 61.2 65.9 0 50.4 60.3

SDP

1

(400)59.1 73.8 84.7 73.6 70.7 12.6 54.8 94.9 70.7 65.7 29.3 65.6 71.7 0 49.4 61.9

1

(500)64.2 74.4 86 68.4 73.7 17.3 58.4 94.9 44.8 66.9 37.5 67.3 68.6 0 54.6 65.1

1

(800)65.2 73.5 86 61 73.8 16.9 57.1 94.3 44.1 65.5 36.5 61.5 61.9 0 49.9 63.1

SDP+CRC1

(500)63.9 74.3 85.8 68.2 73.5 17.5 52 93.7 45.9 65.5 35.1 65.7 69.2 0 52.9 64

SDP+CRC ft1

(500) 63.9 74.2 85.5 62.9 73.7 17.6 50 93.4 61 65.9 35.8 66.5 67.6 0 53.1 64.2

Modeling Context between Objects for

Understanding Referring Expressions

Varun Nagaraja, Vlad Morariu, Larry

Davis

ECCV 2016

Man sitting on the left holding a game controller

Woman in the middle sitting on the bed

Man wearing a red jacket and blue jeans sitting on the right

Descriptions that identify a particular object instance

Referring Expressions

Referring expressions rely on attributes and context

Blonde fluffy dog

Tan colored sofa

Giraffe bending down

Person riding a blue motorcycle

Plant on the right side of the TV

Problem Formulation

Sentence:Girl wearing a red jacket

Image I

Input Output

Solution Framework

Generation and Comprehension of Unambiguous Object Descriptions

J. Mao et al., CVPR 2016

Hypothesize a set of region candidates

Solution Framework



Pick the region candidate with the highest probability

of generating the query referring expression

Baseline Method

LST

M

unit

Girl

<BOS>

Region CNN features

Image CNN features

Bounding box features

Word

embedding

wearingLST

M

unit

Girl

aLST

M

unit

wearing

red

LST

M

unit

a

jacket

LST

M

unit

red

LST

M

unit

<EOS>

jacket



Modeling referring expression probability using an LSTM

Max-margin Method

The baseline method can be improved by training the

model to have lower probability for negative regions

Referred region Negative regions



Girl wearing a red jacket

Modeling Context

The plant on the right side of the TV

Previous methods do not model locations of contextual

objects

Modeling Context

LSTM

Word Embedding

Region CNN features

Image features

Baseline and Max-margin architecture

Region BBox

Modeling Context

Context model architecture

LSTM

Word Embedding

Region CNN features

Context region features

Region BBox

Context region BBox

LSTM

Word Embedding

Region1 CNN features


Region1 BBox

Region2 BBox

Region1

Region2

Modeling Context

LSTM

Word Embedding



Region1 BBox

Region3 BBox

Region1

Region3

Modeling Context

LSTM

Word Embedding



Region1 BBox

Region4 BBox

Region1Region4

Modeling Context

Modeling Context

Pooling context from multiple pairs of regions

Modeling Context

We can also use noisy-or pooling which is more robust

Noisy-or

Noisy-or

Training the Context Model

The challenge is that there are no annotations available for

context objects


Multiple Instance Learning

So we use a MIL based technique and use the annotation

of the referred object as weak supervision


Experiments

Implemented in Caffe

Region and Image features

• VGG16 fc8 layer - fine-tuned.

Bounding box features

• scaled <xmin, ymin, xmax, ymax, area>

Word embedding size – 1024

LSTM hidden dimension – 1024

Region candidates – MCG technique

Region filtering process

• Obtain scores from Fast-RCNN and select regions above a

threshold

Google RefExp Results

Method \ Proposals GT MC

G

Max Likelihood [Mao et al] 57.5 42.4

Max margin [Mao et al] 65.7 47.8

Ours, Neg. Bag margin 68.4 49.5

Ours, Pos. & Neg. Bag

margin

68.4 50.0

All results are from noisy-or pooling

A detection is considered true positive if the IOU score is greater than 0.5

Google RefExp Validation Partition

Google RefExp Results

The chair closest to the lady

Groundtruth Image context only Noisy-or pooling

A white truck in front of a yellow truck

UNC RefExp Results

Method \ Proposals GT MCG

Max Likelihood [Mao et al] 70.6 50.0

Max margin [Mao et al] 76.3 55.1

Ours, Neg. Bag margin 78.0 56.4

Ours, Pos. & Neg. Bag

margin

76.1 56.3

TestB Partition (Object centric)

UNC RefExp Results

Groundtruth Image context only Noisy-or pooling

Elephant towards the back

Food on the far back on the plate

TestB Partition (Object centric)

A few closing observationsSuccess depends on region proposal algorithm including

candidates for the correct referred and context objects

Much more demanding than just requiring a candidate

for the referred object

Ameliorated somewhat by having the entire image as a

candidate context object

Straightforward extension to include additional context

objects (language can be deeply nested) intractable

(Methodological) – would like to evaluate performance

restricted to “relevant” referring expressions, but difficult

to specify correct criteria for selection

SummaryIntellectual landscape of computer vision has changed dramatically over the past decade

Many of the “future research directions” identified by the workshop are already well underway

And there are still huge performance shortfalls on basic problems like detection and recognition (compare MSCOCO vs VOC)

My favorite future research directions

Context – sooner or later it has to make a difference

Visual search

Tasking visual surveillance systems – compositional models and video analysis (structured prediction)

future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf ·...

Documents