speed/accuracy trade-offs for modern convolutional...
TRANSCRIPT
R-FCN: OBJECT DETECTION VIA
REGION-BASED FULLY CONVOLUTIONAL NETWORKS JIFENG DAI YI LI KAIMING HE JIAN SUN
MICROSOFT RESEARCH TSINGHUA UNIVERSITY MICROSOFT RESEARCH MICROSOFT RESEARCH
SPEED/ACCURACY TRADE-OFFS FOR MODERN
CONVOLUTIONAL OBJECT DETECTORSJONATHAN HUANG VIVEK RATHOD CHEN SUN MENGLONG ZHU ANOOP KORATTIKARA
ALIREZA FATHI IAN FISCHER ZBIGNIEW WOJNA YANG SONG SERGIO GUADARRAMA
KEVIN MURPHY
Gilad Uziel
Netzah Calamaro
Deep Learning Seminar
Tel-Aviv university
Instructor: Dr. Raja Giryes
IntroductionThere are two family methods for object detection
region - based - (two stages)
single - shot (one stage)
R-FCN is hybrid of both
Use Region Proposal Network (RPN)
Work on entire image simultaneously
Feature maps
ROI ROI
ROI
ROI
3.2
0.2
1.3
k kk
k
k
k
k
k
k
k
2.-
1.2
2.1
4.4
1.1.
0.8
4.8
0.8
0.6
0.1 2
-1.2
1.1 2.5
2.4
-3 1. 7
4
2.3 5
0.2
0.7 6.1
2.2
0.3 2.4
1.9
1.2 7
4.2
-1 4
3.2
1.2 0.33.2
1.6 -13.8
1.6 2.45.2
2.2 1.24.8
k
0.125
2.8
1.35
0.3
4.875
2.875
1.65
0.7254.25
3.65
0.225
1.2
The Main Idea
k
R-FCN Architecture
R-FCN Architecture
R-FCN Architecture
R-FCN Architecture
R-FCN Architecture
R-FCN Architecture
Bounding Box
Aside from the 𝒌𝟐 𝒄 + 𝟏 -d convolutional layer, we append a
sibling 𝟒𝒌𝟐 -d convolutional layer for bounding box regression.
The position-sensitive RoI pooling is performed on this bank of 𝟒𝒌𝟐
maps.
producing a 𝟒𝒌𝟐 -d vector for each RoI.
Then it is aggregated into a 𝟒-d vector by average voting.
This 𝟒-d vector parameterizes a bounding box as 𝐭 = (𝒕𝒙, 𝒕𝒚, 𝒕𝒘, 𝒕𝒉).
Visualization of R-FCN for the person category when an RoI
does not correctly overlap the object (k × k = 3 × 3).
Visualization
Visualization of R-FCN for the person category when an RoI
does correctly overlap the object (k × k = 3 × 3).
Visualization
Loss Function
𝑳(𝒔, 𝒕𝒙,𝒚,𝒘,𝒉 ) = 𝑳𝒄𝒍𝒔 𝒔𝒄∗ + λ 𝒄∗ > 𝟎 𝑳𝒓𝒆𝒈 𝒕, 𝒕∗ .
𝑳𝒄𝒍𝒔 𝒔𝒄∗ computed by Softmax function.
𝑳𝒓𝒆𝒈 𝒕, 𝒕∗ computed by smooth L1 function.
𝒄∗ > 𝟎 - indicator which equals to 𝟏 if the argument is true and 𝟎otherwise.
We set the balance weight λ = 𝟏.
𝒄∗ - RoI’s ground-truth label (𝒄∗ = 𝟎 means background).
𝒕∗ - RoI’s ground-truth box.
Backpropagation
For the RPN we define positive examples as the RoIs that have
intersection-over-union (IoU) overlap with a ground-truth box of at
least 0.5, and negative otherwise.
Backpropagation is performed based on B = 128 RoIs that have the
highest loss (positive and negative) the selected examples.
Backbone Architecture
The incarnation of R-FCN based on ResNet-101.
ResNet-101 has 100 convolutional layers followed by global average pooling
and a 1000-class fc layer.
We remove the average pooling layer and the fc layer and only use the
convolutional layers to compute feature maps.
The last convolutional block in ResNet-101 is 2048-d, and we attach a
randomly initialized 1024-d 1 × 1 convolutional layer for reducing dimension.
Then we apply the 𝒌𝟐(𝒄 + 𝟏) - channel convolutional layer to generate score
maps.
Results
No. of proposals - 300
K X K = 7 X 7
𝟖𝟑. 𝟔%𝒎𝑨𝑷 - on the PASCAL VOC 2007
𝟏𝟕𝟎𝒎𝑺 - test time, per image
Speed/accuracy trade-offs for modern convolutional
object detectors
Comparative study of R-FCN, SSD and Faster R-CNN
motivation of 2nd paper
Most works discuss only accuracy: This work focus also on memory/speed and
on accuracy/speed/memory trade-off
Selection of correct algorithm for a specific purpose, and optimization of
parameters within that algorithm:
1. Mobile devices (cellular) require low memory footprint
2. Autonomous cars require real-time performance = speed and accuracy
3. Server-side applications such as google/ facebook require accuracy still
throughput bottleneck
4. Contests require accuracy
Compare apples vs apples – require an objective, comprehensive test bench
that can show the differences – need to develop a test bench
R-FCN performance
(Vatsal Srivastava)
Faster R-CNN
performance (kaimin He)
SSD
Summit BhalaSSD+
YOLO
Location of videos:
Yolo vs SSD:
https://www.youtube.com/watch?v=8QL69cAj2kU
R-FCN:
https://www.youtube.com/watch?v=JLjHxuZOeaQ
Faster R-CNNN:
https://www.youtube.com/watch?v=WZmSMkK9VuA
motivation of 2nd paper
There are sweet points on the trade-off graph, where investment of a lot of GPU
time yield small accuracy gain. This may be looked reversibly: one may invest
much less GPU time with little accuracy loss
Results
SSD Faster R-CNN R-FCN
Comparative architecture reminder
EXPERIMENTAL PLATFORM
Use 6 feature extractors at all detectors
VGG-16
ResNet101
Inspection V2 (semantic segmentation)
Inspection V3
Inspection ResNet
MobileNet
Which platforms used
EXPERIMENTAL PLATFORM – ADDITIONAL DETAILS
Loss function configuration
Matching anchors to ground-truth instances
Argmax vs. Bipartite matching
Input size configuration
Computation platform: Intel Xeon E5-1650 (6 cores)
Nvidia GTX titan GPU – 4 times more computation than
home gamer card
W X h – size rectangle
L1 norm location loss function Variable resolution input size – assure תחום רחב
What is the loss function
Training and hyper-parameter training
Asynchronous SGD – when some mini-batch compute it’s
gradient it is added to the total gradient without waiting for
the others and continue it’s training. Might cause delayed
SGD but is faster.
Results:
(mili sec)
Mean Average Precision (mAP)
R-FCN, SSD are faster than R-CNN on average
Faster R-CNN is more accurate
Sweet spot: a point where in order to obtain little more
accuracy much speed must be sacrificed.
Another way to view Sweet spot : little GPU is invested
without sacrifice too much accuracy
Larger feature extractors are slower
Mean Average Precision (mAP) conclusions
What is inception?
(mili sec)
Mean Average Precision (mAP) Colored by feature extractor
Larger feature extractors are slower
The colored cluster show relation to feature extractor. Architectures (R-FCN, R-CNN,SSD) were implemented using various feature extractors
That makes the test bench variable
MobileNet, Inception V2 are faster on average than inception Resnet V2 – due to being smaller feature extractors
Sweet spot: a point where in order to obtain little more accuracy -much speed must be sacrificed
Mean Average Precision (mAP)
Colored by feature extractor - conclusions
Memory vs. GPU time for different feature extractors
(mili sec)
Larger feature extractors are both slower and demand more memory. It comes together: larger means more memory and occasionally more GPU time
Inception ResNet V2 is more memory and demand consuming
MobileNet with SSD is fastest and minimal GPU/memory consuming
Sweet spot: R-FCN w/Resnet 101, and Faster R-CNN w/Resnet 101 with only 50 proposals
R-FCN w/ Resnet 101 at 100ms GPU with high accuracy and not too high memory consumption
Memory vs. GPU for different feature extractors - conclusions
mAP for each object size by meta-architecture and feature extractora
cc
ura
cy
How to read Bar Graph: partitions each feature extractor model by object size (small, medium, large). 3 architectures are drawn per each feature extractor
SSD has (very) poor performance on small objects and competitive with Faster R-CNN, R-FCN on larger objectsoutperforming them when they are with lightweight feature extractors
Small object improved resolution may compensate for its size, in accuracy
mAP for each object size by meta-architecture and feature extractor
mAP on small objects vs mAP on large objects colored by input
resolution
SSD
High resolution models lead to significantly better mAP results(*) on small objects (~*2) and somewhat better results on large objects
In SSD higher resolution improve large objects accuracy but is less successful at small objects accuracy improvement
R-FCN, Faster R-CNN, SSD: Strong performance on small objects implies strong performance on large objects.
Opposite is not correct: SSD perform well on large objects but poor on small objects
mAP on small objects vs mAP on large objects colored by input
resolution
mAP vs Top-1 accuracy of the feature extractor on imagenet
There is an overall correlation between classification (=feature
extraction) accuracy and detection (=overall) accuracy
This correlation appears to only be more significant for Faster R-
CNN and R-FCN
The performance of SSD appears to be less reliant on its feature
extractor’s classification accuracy
SSD is unable to fully leverage the power of the ResNet and
“Inception ResNet” feature extractors
Using cheaper feature extractors does not hurt SSD too much.
With large objects it is competitive with Faster R-CNN and F-RCN
mAP vs Top-1 accuracy of the feature extractor on imagenet
Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN
on mAP (solid line) and GPU time (dash line)
Faster R-CNN R-FCN
The effect of
adjusting number
of proposals on
performance
fixed
Figure (a): For “Faster R-CNN” with Inception Resnet feature
extractor with “50 proposals”, 96% of the “300 proposals”
accuracy is obtained, reducing GPU runtime by factor 3
Figure (a): Using Inception Resnet, which has 35.4% mAP with
“300 proposals” accuracy is maintained similar (29% mAP) with
only “10 proposals”. Sweet spot is around “50 proposals”
Figure (a): similar tradeoffs hold for other feature extractors
although less intense
Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN
on mAP (solid line) and GPU time (dash line)
Figure (b) savings from using fewer proposals in the R-FCN
setting are minimal, since box classifier (the expensive part) is
only run once per image.
Figure (b) at 100 proposals, the speed and accuracy for Faster
R-CNN models with ResNet, becomes comparable to that of
equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.
Faster R-CNN dramatic proposals-to-GPU effect, less significant
proposals-to-accuracy effect
R-FCN mild effect of proposals over GPU, accuracy
Effect of proposing fewer regions in (a) Faster R-CNN and )b) R-FCN
on mAP (solid line) and GPU time (dash line)
State-of-the-art detection with MS COCO dataset
What is mAP, AR?
What is multi cropping inference?
Facts:
Run on COCO dataset.
Average accuracy is taken at thresholds 50%, 95%
Table 3: Test is ensemble of 5 best performance, fast R-CNN RestNet feature extractors
table 2 results:
The model average accuracy is 41.3%, better than previous results 37.1%
Improvement of ~60% accuracy for small objects over previous result
Interpretation of tables 2,3,4
Thank you –
Questions?
Modifies the proposal generator to directly output class
probability (instead of objectiveness). 1) No separate proposal
generator such as R-CNN. 2) Direct link from Feature extractor to
detection generator
Pros: Very fast (suitable for mobile applications, autonomous
vehicle
Cons: Not good at detecting smaller object (YOLO) but using
feature maps from different layers can help a lot (SSD)back
Start with Feature Extractor continue with Proposal Generator, then Box Classifier
Feature Extractor: 5 convolution layers
Proposal generator: insert after conv5 of the feature extractor
output = bounding boxes and objectiveness
Box classifier: input = crop of conv5 from the bounding boxes with ROI pooling to get
feature maps of fixed size; pass through = fc* ; output = class probability
Pro: best performing accuracy
Con: GPU runtime depends on the number of proposal
Reminder:
back
Translation-variance in detection. The classification network to output the same thing if the cat moves from the top left to bottom right (object detection), but the Region-
Proposal-Network (object location) to output differently)
Box classifier is given the crop of fc6 instead of conv5 . Computation for each proposal is reduced
New position sensitive score maps: shape = k*k * (C+1), h, w . So this encodes the
position into the channel dimension
New position-sensitive ROI pooling: input = k * k * (c + 1), roi_h, roi_w ; pool = c + 1, k, k ;
output = c+1. In the other words, top-left bin will only pool from some filters.
Classifier: input = feature maps
Pro: a variation of R-FCN (TA-FCN) is best instance segmentation architecture Pro: fast and pretty accurate
Cons: less accurate than Faster R-CNNback
What is mAP and AR
back
multi- cropping inference
A novel pooling strategy that crops different regions from convolutional feature maps and
applies max-pooling at varying times
back
Loss function:
weight balancing localization and classification losses
predicted box encoding - location loss
location class - classification loss
- box encoding of box a with respect to anchor b
class - - class label
image parameters, - model parameters, - “negative anchor”
,
( ; , )locf a
( ; , )clsf a
a
back
( )loc
l x ( )cl
l x
( , )a
b a
ay
What is: Region-of-Interest pooling
For example, to detect multiple cars and pedestrians in a single image. Its
purpose is to perform max pooling on inputs of nonuniform sizes to obtain
fixed-size feature maps (e.g. 7×7).
back
Inception pooling module
module
By parallelizing layers and combining them back less computation
invested equivalent to using additional depth layers
back