future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf ·...
TRANSCRIPT
Future directions in
computer visionLarry Davis
Computer Vision Laboratory
University of Maryland
College Park MD USA
Presentation overview
Future Directions Workshop on Computer
Vision
Object detection using CNN’s without object
proposals
Incorporating context onto detection
Scale dependent pooling to detect small
object instances
Resolving referring expressions using context
Summary
Strategic Directions Workshop on
“Visual Commonsense” Nov 12-
13 in D.C.
• Sponsored by OSTP in the US
• Poggio, Malik, Zhu, Berg (Alex), Kohli, Hoeim,
Grauman, Zitnick, Gupta, Fox, Tellex, Oliva,
Scholl (absent), Domingos, Daume.
• Organized by me, Fei Fei Li and Devi Parikh
The computer vision landscape
• Breakthroughs in CV (and AI generally) would
clearly be “disruptive.” This has been known
“forever.”
• Our field has more than doubled in size in less
than a decade and there are currently more than
175 startups in computer vision worldwide
according to chrunchbase.
• Feeding frenzy in self driving cars
• So, has the field finally progressed to the point
where real vision problems can be solved?
So, what has changed?Deep learning
So, what has changed?Deep learning
SFM and stereo
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
Computing infrastructure
Big Data
Crowd sourcing
GPU’s
Cloud computing and “free” storage
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
Computing infrastructure
Big Data
Crowd sourcing
GPU’s
Cloud computing and “free” storage
Open source software
Commercial indicators
Driving aids and autonomous driving - Mobileye
Face recognition under the hood
at social media companies
Image search – Tineye,
Clarifai
Google self driving cars – 1.5 M miles and going
And what about the next
10?So what do you think the future of the field
is?
Here are some of the workshop
recommendations.
Workshop
recommendationsDevelop the field of “social perception”
Understand the “internal state” of people as they interact with each other and with the world
Crucial for human robot interaction
Perceptual Robotics – and testbeds for measurement of progress in situated vision research.
Visual Search – intelligent sampling of the visual world
Acquisition and Representation of Visual Commonsense from Observation and Interaction
Vision and Language
Language and vision - How to test ability to
accumulate and integrate knowledge?VQA Dataset
• Many useful challenges
– Where to look to answer a question?
– How to relate existing detectors, pose estimators, attribute classifiers, etc. to this task?
– How to combine general knowledge with vision?
Workshop
recommendations
Structured prediction
Relationship between parts, objects and
scenes
The hierarchical structure of human
behavior- movement, goals, actions and
events
“Explainable” perception. Don’t just classify, explain your answer
Workshop
recommendationsDeep learning.
Why/when does it work?
Why are all local minima created equal?
Visual learning with minimal (no) supervision
Developmental learning (NEIL)
Are object proposals
necessarily the answer?
G-CNN – an iterative grid based object detector
Mahyar Najibi and Mohammad Rastegari
CVPR 2016
Object detection
Localization – bounding box, segmentation
masks
Classification
In your camera – sliding window
detection
Sliding Window
Extracted Boxes
Multi class
Classifier
horse = 0.6
horse = 0.0
horse = 0.0
horse = 0.5
horse = 0.9
person = 0.3
person = 0.5
person = 0.8
person = 0.9
person = 0.0
Object proposals
Sliding windows are slow – scale, orientation, ..
Object proposals are (learning-based) multi- segmentation algorithms that generate fewer regions for classification (typically boxes).
Consensus is that region proposals are crucial to SOA detection systems whether they are given to the network or constructed by the network
However localization is poor, so (class-dependent) post-processing is typically employed
Regressor
Object proposals and CNN’sR-CNN - push each proposal through the CNN; slow because the
network is run multiple times.
SPP-Net [1] computes filter responses only once for each image and
pools from them to form features for the proposals.
Fast R-CNN [2] builds on this and packs all stages of the system
except the region proposal into one CNN.
Fast R-CNN
1. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition."
Computer Vision–ECCV 2014. Springer International Publishing, 2014. 346-361
2. Girshick, Ross. "Fast R-CNN." ICCV (2015).
Region Proposal Stage
These methods use an external object proposalstage (e.g. selective search with ~2Kproposals/image)
In Fast R-CNN, computing object proposals is thebottleneck, taking around 2 sec/image time.
Faster R-CNN [3] increases efficiency by reducingthe number of proposed bounding boxes.
Jointly learns proposal generator and features
Fast and accurate3. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal
networks." NIPS (2015).
G-CNN Training Network
Structure
G-CNN: Training
Training set for step 1
G-CNN: Training
Added samples for step 2
G-CNN Detection
*The highest
scoring class
is car.
*The highest
scoring class
is car.
CarRegressor
*The highest
scoring class
is car
CarRegressor
*The highest
scoring class
is car.
CarRegressor
Iteratively update the position of the initial bounding boxes with the
regressor corresponding to the class with the highest score.
G-CNN structure in detection
time
To reduce detection time, the G-CNN network is
divided into two parts:
• The global part is called only once for each image.
• The regression part is called Stest times, one for each
step.
Experimental Setup
• Experiments are performed on VOC 2007 and VOC
2012 datasets.
• G-CNN is trained with S=3 steps over an initial grid
with three scales [2,5,10] and overlaps [0.9,0.8,0.7] at
each scale.
• At test time, use a coarser grid with overlaps
[0.7,0.5,0.0] (around 180 initial boxes)
• after 5 iterations achieves the same mAP as Fast
R-CNN with around 2K bounding boxes.
VOC2012 using VGG16
How effective are the
regressors?IoU histogram of the best overlapping boxes to ground truth
boxes at each iteration.
How can a neural network
learn and utilize context?
Mahyar Najibi, Mohammad Rastegari,
Abhinav Gupta, Ali Farhadi – Deep
Saccadic Detectors
Top choices of FRCNN are very accurate
Detection with GTS
Method Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
FRCNN SS 66.4 71.6 53.8 43.3 24.7 69.2 69.7 71.5 31.1 63.4
FRCNN SS+GT 68.2 74.1 56 50.6 31.5 72.6 72.8 73.1 34.8 63.8
FRCNN GT 83 84.1 78.7 81.5 73.7 85.5 88 83.5 69.9 75.4
Dining table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV Monitor Average
59.8 62.2 73.1 65.9 57 26 52 56.4 67.8 57.7 57.1
61.1 63.5 76.8 68.5 63.7 29.7 54.4 57.8 70.5 61 60.2
80.2 78.1 81.9 85.1 87.7 83.2 71.7 78.5 88.8 88.5 81.3
Methods are trained on VOC2007 trainval.
AlexNet is employed as the CNN structure.
FRCNN GT: Only GT boxes are
used.
FRCNN SS: Fast RCNN using
selective search proposals.
FRCNN SS+GT: GT boxes are
added to SS boxes.
Sequential detection
This suggests a simple strategy for
detection
Commit to the most confident
detection
Use it as context for determining the
next most confident detection,
And so on
All integrated into one CNN architecture
Deep Sequential DetectionInput
Convolutional
Layers
RO
I P
ooli
ng
ROI info
Lin
ear (
fc6)
ReL
U
Lin
ear
(fc7
)
ReL
U
Reg
ress
or
Act
ive
Sel
ecto
r
Lin
ear
(h1)
Cla
ss-b
ase
d
ou
tpu
t
ReL
U
Lin
ear
(h2)
ReL
U
Cla
ssif
ier
Hidden
State
Selector
Con
cat
Cla
ssif
ica
tio
n
Ou
tpu
t
Active
select input
Hidden
select input
MAX
NMS
Datasets• Pascal VOC2007
• 20 Classes
• ~10K images
• Pascal VOC2012
• 20 Classes
• ~15K images
• MSCOCO (2015)
• 80 Classes
• ~300K images
VOC 2012
MSCOCO
Precision and Recall
Methods are trained on the train-set and evaluated on the validation-set.
Top 2K selective search proposals are used for the methods.
Class-based Relative Improvement
Scale dependent pooling – Fan
Yang (CVPR 2016)
Goal - detect (even small) objects effectively
and efficiently using CNNs + object
proposals
61
scale variancehuge number
of proposals
Scale-dependent pooling
Pool proposals of different scales from different
conv layers: n-branch structure
Small instances of objects are well represented using
features pooled from lower conv layers
Scale-dependent poolingDivide proposals into groups based on their size
Pool small proposals at lower conv layers and larger
ones at higher conv layers
Train the entire system end-to-end
small proposal
s
Pooling Pooling
large proposal
s
ExperimentsKITTI (mAP)
Inner-city (mAP)
Detection as a function of size - Kitti
Car Pedestrian Cyclist mAP
Methods Inputs S1 S2 S3 S4 S S1 S2 S3 S4 S S1 S2 S3 S4 S S
FRCNN+AlexN
et4 52.8 60.7 75.8 55.5 61.6 19.7 47.5 88.4 24.1 61.4 42 51.6 44.9 0 46.5 56.5
FRCNN+VGG1
6
1
(400) 33.9 68.3 82.8 68.8 57.3 7.9 50.4 95.3 55.8 64.6 19 63.8 66.6 0 42.3 54.7
1
(500) 42.2 70 85.1 65.9 62.3 12.6 55.9 94.6 44.9 66.8 29.1 63.8 68.7 0 48.8 59.3
1
(800) 47.6 70 84.8 60.5 64.5 14.7 54.5 94.5 47.2 66.4 34.9 61.2 67.4 0 50.4 60.4
2 47.4 70.2 83.1 54.5 64.1 14.9 55.2 94.5 63.1 66.5 35.8 61.2 65.9 0 50.4 60.3
SDP
1
(400)59.1 73.8 84.7 73.6 70.7 12.6 54.8 94.9 70.7 65.7 29.3 65.6 71.7 0 49.4 61.9
1
(500)64.2 74.4 86 68.4 73.7 17.3 58.4 94.9 44.8 66.9 37.5 67.3 68.6 0 54.6 65.1
1
(800)65.2 73.5 86 61 73.8 16.9 57.1 94.3 44.1 65.5 36.5 61.5 61.9 0 49.9 63.1
SDP+CRC1
(500)63.9 74.3 85.8 68.2 73.5 17.5 52 93.7 45.9 65.5 35.1 65.7 69.2 0 52.9 64
SDP+CRC ft1
(500) 63.9 74.2 85.5 62.9 73.7 17.6 50 93.4 61 65.9 35.8 66.5 67.6 0 53.1 64.2
Modeling Context between Objects for
Understanding Referring Expressions
Varun Nagaraja, Vlad Morariu, Larry
Davis
ECCV 2016
Man sitting on the left holding a game controller
Woman in the middle sitting on the bed
Man wearing a red jacket and blue jeans sitting on the right
Descriptions that identify a particular object instance
Referring Expressions
Referring expressions rely on attributes and context
Blonde fluffy dog
Tan colored sofa
Giraffe bending down
Person riding a blue motorcycle
Plant on the right side of the TV
Problem Formulation
Sentence:Girl wearing a red jacket
Image I
Input Output
Solution Framework
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Hypothesize a set of region candidates
Solution Framework
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Pick the region candidate with the highest probability
of generating the query referring expression
Baseline Method
LST
M
unit
Girl
<BOS>
Region CNN features
Image CNN features
Bounding box features
Word
embedding
wearingLST
M
unit
Girl
aLST
M
unit
wearing
red
LST
M
unit
a
jacket
LST
M
unit
red
LST
M
unit
<EOS>
jacket
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Modeling referring expression probability using an LSTM
Max-margin Method
The baseline method can be improved by training the
model to have lower probability for negative regions
Referred region Negative regions
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Girl wearing a red jacket
Modeling Context
The plant on the right side of the TV
Previous methods do not model locations of contextual
objects
Modeling Context
LSTM
Word Embedding
Region CNN features
Image features
Baseline and Max-margin architecture
Region BBox
Modeling Context
Context model architecture
LSTM
Word Embedding
Region CNN features
Context region features
Region BBox
Context region BBox
LSTM
Word Embedding
Region1 CNN features
Region2 CNN features
Region1 BBox
Region2 BBox
Region1
Region2
Modeling Context
LSTM
Word Embedding
Region1 CNN features
Region3 CNN features
Region1 BBox
Region3 BBox
Region1
Region3
Modeling Context
LSTM
Word Embedding
Region1 CNN features
Region4 CNN features
Region1 BBox
Region4 BBox
Region1Region4
Modeling Context
Modeling Context
Pooling context from multiple pairs of regions
Modeling Context
We can also use noisy-or pooling which is more robust
Noisy-or
Noisy-or
Training the Context Model
The challenge is that there are no annotations available for
context objects
The plant on the right side of the TV
Multiple Instance Learning
So we use a MIL based technique and use the annotation
of the referred object as weak supervision
The plant on the right side of the TV
Experiments
Implemented in Caffe
Region and Image features
• VGG16 fc8 layer - fine-tuned.
Bounding box features
• scaled <xmin, ymin, xmax, ymax, area>
Word embedding size – 1024
LSTM hidden dimension – 1024
Region candidates – MCG technique
Region filtering process
• Obtain scores from Fast-RCNN and select regions above a
threshold
Google RefExp Results
Method \ Proposals GT MC
G
Max Likelihood [Mao et al] 57.5 42.4
Max margin [Mao et al] 65.7 47.8
Ours, Neg. Bag margin 68.4 49.5
Ours, Pos. & Neg. Bag
margin
68.4 50.0
All results are from noisy-or pooling
A detection is considered true positive if the IOU score is greater than 0.5
Google RefExp Validation Partition
Google RefExp Results
The chair closest to the lady
Groundtruth Image context only Noisy-or pooling
A white truck in front of a yellow truck
UNC RefExp Results
Method \ Proposals GT MCG
Max Likelihood [Mao et al] 70.6 50.0
Max margin [Mao et al] 76.3 55.1
Ours, Neg. Bag margin 78.0 56.4
Ours, Pos. & Neg. Bag
margin
76.1 56.3
TestB Partition (Object centric)
UNC RefExp Results
Groundtruth Image context only Noisy-or pooling
Elephant towards the back
Food on the far back on the plate
TestB Partition (Object centric)
A few closing observationsSuccess depends on region proposal algorithm including
candidates for the correct referred and context objects
Much more demanding than just requiring a candidate
for the referred object
Ameliorated somewhat by having the entire image as a
candidate context object
Straightforward extension to include additional context
objects (language can be deeply nested) intractable
(Methodological) – would like to evaluate performance
restricted to “relevant” referring expressions, but difficult
to specify correct criteria for selection
SummaryIntellectual landscape of computer vision has changed dramatically over the past decade
Many of the “future research directions” identified by the workshop are already well underway
And there are still huge performance shortfalls on basic problems like detection and recognition (compare MSCOCO vs VOC)
My favorite future research directions
Context – sooner or later it has to make a difference
Visual search
Tasking visual surveillance systems – compositional models and video analysis (structured prediction)