speed/accuracy trade-offs for modern convolutional...

R-FCN: OBJECT DETECTION VIA

REGION-BASED FULLY CONVOLUTIONAL NETWORKS JIFENG DAI YI LI KAIMING HE JIAN SUN

MICROSOFT RESEARCH TSINGHUA UNIVERSITY MICROSOFT RESEARCH MICROSOFT RESEARCH

SPEED/ACCURACY TRADE-OFFS FOR MODERN

CONVOLUTIONAL OBJECT DETECTORSJONATHAN HUANG VIVEK RATHOD CHEN SUN MENGLONG ZHU ANOOP KORATTIKARA

ALIREZA FATHI IAN FISCHER ZBIGNIEW WOJNA YANG SONG SERGIO GUADARRAMA

KEVIN MURPHY

Gilad Uziel

Netzah Calamaro

Deep Learning Seminar

Tel-Aviv university

Instructor: Dr. Raja Giryes

6465-r-fcn-object-detection-via-region-based-fully-convolutional-networks.pdf

6465-r-fcn-object-detection-via-region-based-fully-convolutional-networks.pdf

Huang_SpeedAccuracy_Trade-Offs_for_CVPR_2017_paper.pdf

Huang_SpeedAccuracy_Trade-Offs_for_CVPR_2017_paper.pdf

IntroductionThere are two family methods for object detection

region - based - (two stages)

single - shot (one stage)

R-FCN is hybrid of both

Use Region Proposal Network (RPN)

Work on entire image simultaneously

Feature maps

ROI ROI

ROI

ROI

3.2

0.2

1.3

k kk

k

k

k

k

k

k

k

2.-

1.2

2.1

4.4

1.1.

0.8

4.8

0.8

0.6

0.1 2

-1.2

1.1 2.5

2.4

-3 1. 7

4

2.3 5

0.2

0.7 6.1

2.2

0.3 2.4

1.9

1.2 7

4.2

-1 4

3.2

1.2 0.33.2

1.6 -13.8

1.6 2.45.2

2.2 1.24.8

k

0.125

2.8

1.35

0.3

4.875

2.875

1.65

0.7254.25

3.65

0.225

1.2

The Main Idea

k

R-FCN Architecture

Bounding Box

Aside from the 𝒌𝟐 𝒄 + 𝟏 -d convolutional layer, we append a

sibling 𝟒𝒌𝟐 -d convolutional layer for bounding box regression.

The position-sensitive RoI pooling is performed on this bank of 𝟒𝒌𝟐

maps.

producing a 𝟒𝒌𝟐 -d vector for each RoI.

Then it is aggregated into a 𝟒-d vector by average voting.

This 𝟒-d vector parameterizes a bounding box as 𝐭 = (𝒕𝒙, 𝒕𝒚, 𝒕𝒘, 𝒕𝒉).

Visualization of R-FCN for the person category when an RoI

does not correctly overlap the object (k × k = 3 × 3).

Visualization

Visualization of R-FCN for the person category when an RoI

does correctly overlap the object (k × k = 3 × 3).

Visualization

Loss Function

𝑳(𝒔, 𝒕𝒙,𝒚,𝒘,𝒉 ) = 𝑳𝒄𝒍𝒔 𝒔𝒄∗ + λ 𝒄∗ > 𝟎 𝑳𝒓𝒆𝒈 𝒕, 𝒕∗ .

𝑳𝒄𝒍𝒔 𝒔𝒄∗ computed by Softmax function.

𝑳𝒓𝒆𝒈 𝒕, 𝒕∗ computed by smooth L1 function.

𝒄∗ > 𝟎 - indicator which equals to 𝟏 if the argument is true and 𝟎otherwise.

We set the balance weight λ = 𝟏.

𝒄∗ - RoI’s ground-truth label (𝒄∗ = 𝟎 means background).

𝒕∗ - RoI’s ground-truth box.

Backpropagation

For the RPN we define positive examples as the RoIs that have

intersection-over-union (IoU) overlap with a ground-truth box of at

least 0.5, and negative otherwise.

Backpropagation is performed based on B = 128 RoIs that have the

highest loss (positive and negative) the selected examples.

Backbone Architecture

The incarnation of R-FCN based on ResNet-101.

ResNet-101 has 100 convolutional layers followed by global average pooling

and a 1000-class fc layer.

We remove the average pooling layer and the fc layer and only use the

convolutional layers to compute feature maps.

The last convolutional block in ResNet-101 is 2048-d, and we attach a

randomly initialized 1024-d 1 × 1 convolutional layer for reducing dimension.

Then we apply the 𝒌𝟐(𝒄 + 𝟏) - channel convolutional layer to generate score

maps.

Results

No. of proposals - 300

K X K = 7 X 7

𝟖𝟑. 𝟔%𝒎𝑨𝑷 - on the PASCAL VOC 2007

𝟏𝟕𝟎𝒎𝑺 - test time, per image

Speed/accuracy trade-offs for modern convolutional

object detectors

Comparative study of R-FCN, SSD and Faster R-CNN

motivation of 2nd paper

Most works discuss only accuracy: This work focus also on memory/speed and

on accuracy/speed/memory trade-off

Selection of correct algorithm for a specific purpose, and optimization of

parameters within that algorithm:

1. Mobile devices (cellular) require low memory footprint

2. Autonomous cars require real-time performance = speed and accuracy

3. Server-side applications such as google/ facebook require accuracy still

throughput bottleneck

4. Contests require accuracy

Compare apples vs apples – require an objective, comprehensive test bench

that can show the differences – need to develop a test bench

R-FCN performance

(Vatsal Srivastava)

Faster R-CNN

performance (kaimin He)

SSD

Summit BhalaSSD+

YOLO

Object_Detection_with_Region-based_Fully_Convolutional_Networks_(R-FCN).avi

Object_detection_in_the_wild_by_Faster_R-CNN_26_ResNet-101.avi

Object Detection with SSD.mp4

YOLOvsSSD.avi

Location of videos:

Yolo vs SSD:

https://www.youtube.com/watch?v=8QL69cAj2kU

R-FCN:

https://www.youtube.com/watch?v=JLjHxuZOeaQ

Faster R-CNNN:

https://www.youtube.com/watch?v=WZmSMkK9VuA

https://www.youtube.com/watch?v=8QL69cAj2kU

https://www.youtube.com/watch?v=JLjHxuZOeaQ

motivation of 2nd paper

There are sweet points on the trade-off graph, where investment of a lot of GPU

time yield small accuracy gain. This may be looked reversibly: one may invest

much less GPU time with little accuracy loss

Results

SSD Faster R-CNN R-FCN

Comparative architecture reminder

EXPERIMENTAL PLATFORM

Use 6 feature extractors at all detectors

VGG-16

ResNet101

Inspection V2 (semantic segmentation)

Inspection V3

Inspection ResNet

MobileNet

Which platforms used

EXPERIMENTAL PLATFORM – ADDITIONAL DETAILS

Loss function configuration

Matching anchors to ground-truth instances

Argmax vs. Bipartite matching

Input size configuration

Computation platform: Intel Xeon E5-1650 (6 cores)

Nvidia GTX titan GPU – 4 times more computation than

home gamer card

W X h – size rectangle

L1 norm location loss function Variable resolution input size – assure תחום רחב

What is the loss function

Training and hyper-parameter training

Asynchronous SGD – when some mini-batch compute it’s

gradient it is added to the total gradient without waiting for

the others and continue it’s training. Might cause delayed

SGD but is faster.

Results:

(mili sec)

Mean Average Precision (mAP)

R-FCN, SSD are faster than R-CNN on average

Faster R-CNN is more accurate

Sweet spot: a point where in order to obtain little more

accuracy much speed must be sacrificed.

Another way to view Sweet spot : little GPU is invested

without sacrifice too much accuracy

Larger feature extractors are slower

Mean Average Precision (mAP) conclusions

What is inception?

(mili sec)

Mean Average Precision (mAP) Colored by feature extractor

Larger feature extractors are slower

The colored cluster show relation to feature extractor. Architectures (R-FCN, R-CNN,SSD) were implemented using various feature extractors

That makes the test bench variable

MobileNet, Inception V2 are faster on average than inception Resnet V2 – due to being smaller feature extractors

Sweet spot: a point where in order to obtain little more accuracy -much speed must be sacrificed

Mean Average Precision (mAP)

Colored by feature extractor - conclusions

Memory vs. GPU time for different feature extractors

(mili sec)

Larger feature extractors are both slower and demand more memory. It comes together: larger means more memory and occasionally more GPU time

Inception ResNet V2 is more memory and demand consuming

MobileNet with SSD is fastest and minimal GPU/memory consuming

Sweet spot: R-FCN w/Resnet 101, and Faster R-CNN w/Resnet 101 with only 50 proposals

R-FCN w/ Resnet 101 at 100ms GPU with high accuracy and not too high memory consumption

Memory vs. GPU for different feature extractors - conclusions

mAP for each object size by meta-architecture and feature extractora

cc

ura

cy

How to read Bar Graph: partitions each feature extractor model by object size (small, medium, large). 3 architectures are drawn per each feature extractor

SSD has (very) poor performance on small objects and competitive with Faster R-CNN, R-FCN on larger objectsoutperforming them when they are with lightweight feature extractors

Small object improved resolution may compensate for its size, in accuracy

mAP for each object size by meta-architecture and feature extractor

mAP on small objects vs mAP on large objects colored by input

resolution

SSD

High resolution models lead to significantly better mAP results(*) on small objects (~*2) and somewhat better results on large objects

In SSD higher resolution improve large objects accuracy but is less successful at small objects accuracy improvement

R-FCN, Faster R-CNN, SSD: Strong performance on small objects implies strong performance on large objects.

Opposite is not correct: SSD perform well on large objects but poor on small objects

mAP on small objects vs mAP on large objects colored by input

resolution

mAP vs Top-1 accuracy of the feature extractor on imagenet

There is an overall correlation between classification (=feature

extraction) accuracy and detection (=overall) accuracy

This correlation appears to only be more significant for Faster R-

CNN and R-FCN

The performance of SSD appears to be less reliant on its feature

extractor’s classification accuracy

SSD is unable to fully leverage the power of the ResNet and

“Inception ResNet” feature extractors

Using cheaper feature extractors does not hurt SSD too much.

With large objects it is competitive with Faster R-CNN and F-RCN

mAP vs Top-1 accuracy of the feature extractor on imagenet

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN

on mAP (solid line) and GPU time (dash line)

Faster R-CNN R-FCN

The effect of

adjusting number

of proposals on

performance

fixed

Figure (a): For “Faster R-CNN” with Inception Resnet feature

extractor with “50 proposals”, 96% of the “300 proposals”

accuracy is obtained, reducing GPU runtime by factor 3

Figure (a): Using Inception Resnet, which has 35.4% mAP with

“300 proposals” accuracy is maintained similar (29% mAP) with

only “10 proposals”. Sweet spot is around “50 proposals”

Figure (a): similar tradeoffs hold for other feature extractors

although less intense

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN


Figure (b) savings from using fewer proposals in the R-FCN

setting are minimal, since box classifier (the expensive part) is

only run once per image.

Figure (b) at 100 proposals, the speed and accuracy for Faster

R-CNN models with ResNet, becomes comparable to that of

equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.

Faster R-CNN dramatic proposals-to-GPU effect, less significant

proposals-to-accuracy effect

R-FCN mild effect of proposals over GPU, accuracy

Effect of proposing fewer regions in (a) Faster R-CNN and )b) R-FCN


State-of-the-art detection with MS COCO dataset

What is mAP, AR?

What is multi cropping inference?

Facts:

Run on COCO dataset.

Average accuracy is taken at thresholds 50%, 95%

Table 3: Test is ensemble of 5 best performance, fast R-CNN RestNet feature extractors

table 2 results:

The model average accuracy is 41.3%, better than previous results 37.1%

Improvement of ~60% accuracy for small objects over previous result

Interpretation of tables 2,3,4

Thank you –

Questions?

Modifies the proposal generator to directly output class

probability (instead of objectiveness). 1) No separate proposal

generator such as R-CNN. 2) Direct link from Feature extractor to

detection generator

Pros: Very fast (suitable for mobile applications, autonomous

vehicle

Cons: Not good at detecting smaller object (YOLO) but using

feature maps from different layers can help a lot (SSD)back

Start with Feature Extractor continue with Proposal Generator, then Box Classifier

Feature Extractor: 5 convolution layers

Proposal generator: insert after conv5 of the feature extractor

output = bounding boxes and objectiveness

Box classifier: input = crop of conv5 from the bounding boxes with ROI pooling to get

feature maps of fixed size; pass through = fc* ; output = class probability

Pro: best performing accuracy

Con: GPU runtime depends on the number of proposal

Reminder:

back

Translation-variance in detection. The classification network to output the same thing if the cat moves from the top left to bottom right (object detection), but the Region-

Proposal-Network (object location) to output differently)

Box classifier is given the crop of fc6 instead of conv5 . Computation for each proposal is reduced

New position sensitive score maps: shape = k*k * (C+1), h, w . So this encodes the

position into the channel dimension

New position-sensitive ROI pooling: input = k * k * (c + 1), roi_h, roi_w ; pool = c + 1, k, k ;

output = c+1. In the other words, top-left bin will only pool from some filters.

Classifier: input = feature maps

Pro: a variation of R-FCN (TA-FCN) is best instance segmentation architecture Pro: fast and pretty accurate

Cons: less accurate than Faster R-CNNback

What is mAP and AR

back

multi- cropping inference

A novel pooling strategy that crops different regions from convolutional feature maps and

applies max-pooling at varying times

back

Loss function:

weight balancing localization and classification losses

predicted box encoding - location loss

location class - classification loss

- box encoding of box a with respect to anchor b

class - - class label

image parameters, - model parameters, - “negative anchor”

,

( ; , )locf a

( ; , )clsf a

a

back

( )loc

l x ( )cl

l x

( , )a

b a

ay

What is: Region-of-Interest pooling

For example, to detect multiple cars and pedestrians in a single image. Its

purpose is to perform max pooling on inputs of nonuniform sizes to obtain

fixed-size feature maps (e.g. 7×7).

back

Inception pooling module

module

By parallelizing layers and combining them back less computation

invested equivalent to using additional depth layers

back

speed/accuracy trade-offs for modern convolutional...

Documents