fpn-basednetworkfor panoptic segmentationpresentations.cocodataset.org/eccv18/coco18-panoptic... ·...

FPN-based Network forPanoptic Segmentation

Caribbean Team

Yanwei Li* 1, 2, Naiyu Gao* 1, 2, Chaoxu Guo 1, 2, Xinze Chen1, Qian Zhang1,

Guan Huang1, Xin Zhao2, Kaiqi Huang2, Dalong Du1, Chang Huang1

1 Horizon Robotics,2 Institute of Automation, CAS

Kirillov A, He K, Girshick R B, et al. Panoptic Segmentation.[J]. arXiv: Computer Vision and Pattern Recognition, 2018.Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017.

FPN-basedbackbone

Feature Pyramid Networks for Object Detection

Tsung-Yi Lin1,2, Piotr Dollar1, Ross Girshick1,Kaiming He1, Bharath Hariharan1, and Serge Belongie2

1Facebook AI Research (FAIR)2Cornell University and Cornell Tech

Abstract

Feature pyramids are a basic component in recognition

systems for detecting objects at different scales. But recent

deep learning object detectors have avoided pyramid rep-

resentations, in part because they are compute and memory

intensive. In this paper, we exploit the inherent multi-scale,

pyramidal hierarchy of deep convolutional networks to con-

struct feature pyramids with marginal extra cost. A top-

down architecture with lateral connections is developed for

building high-level semantic feature maps at all scales. This

architecture, called a Feature Pyramid Network (FPN),

shows significant improvement as a generic feature extrac-

tor in several applications. Using FPN in a basic Faster

R-CNN system, our method achieves state-of-the-art single-

model results on the COCO detection benchmark without

bells and whistles, surpassing all existing single-model en-

tries including those from the COCO 2016 challenge win-

ners. In addition, our method can run at 5 FPS on a GPU

and thus is a practical and accurate solution to multi-scale

object detection. Code will be made publicly available.

1. IntroductionRecognizing objects at vastly different scales is a fun-

damental challenge in computer vision. Feature pyramids

built upon image pyramids (for short we call these featur-

ized image pyramids) form the basis of a standard solution[1] (Fig. 1(a)). These pyramids are scale-invariant in thesense that an object’s scale change is offset by shifting itslevel in the pyramid. Intuitively, this property enables amodel to detect objects across a large range of scales byscanning the model over both positions and pyramid levels.

Featurized image pyramids were heavily used in theera of hand-engineered features [5, 24]. They were socritical that object detectors like DPM [7] required densescale sampling to achieve good results (e.g., 10 scales peroctave). For recognition tasks, engineered features have

(a) Featurized image pyramid

predict

predict

predict

predict

(b) Single feature map

predict

(d) Feature Pyramid Network

predict

predict

predict

(c) Pyramidal feature hierarchy

predict

predict

predict

Figure 1. (a) Using an image pyramid to build a feature pyramid.Features are computed on each of the image scales independently,which is slow to compute. (b) Recent detection systems have optedto use only single scale features for faster detection. (c) An alter-native is to reuse the pyramidal feature hierarchy computed by aConvNet as if it were a featurized image pyramid. (d) Our pro-posed Feature Pyramid Network (FPN) is fast like (b) and (c), butmore accurate. In this figure, feature maps are indicate by blueoutlines and and thicker outlines denote higher level features.

largely been replaced with features computed by deep con-volutional networks (ConvNets) [18, 19]. Aside from beingcapable of representing higher-level semantics, ConvNetsare also more robust to variance in scale and thus facilitaterecognition from features computed on a single input scale[14, 11, 28] (Fig. 1(b)). But even with this robustness, pyra-mids are still needed to get the most accurate results. All re-cent top entries in the ImageNet [31] and COCO [20] detec-tion challenges use multi-scale testing on featurized imagepyramids (e.g., [15, 33]). The principle advantage of fea-turizing each level of an image pyramid is that it producesa multi-scale feature representation in which all levels are

semantically strong, including the high-resolution levels.Nevertheless, featurizing each level of an image pyra-

mid has obvious limitations. Inference time increases con-siderably (e.g., by four times [11]), making this approachimpractical for real applications. Moreover, training deep

1

arX

iv:1

612.

0314

4v1

[cs.C

V]

9 D

ec 2

016

Shared FPN backboneInput

FPN-based Network


Instance head

Semantic head

FPN-basedbackbone




Abstract


























predict

predict

predict

predict


predict


predict

predict

predict


predict

predict

predict





1

arX

iv:1

612.

0314

4v1

[cs.C

V]

9 D

ec 2

016


Function Head

FPN-based Network


Instance head

Semantic head

mergeFPN-basedbackbone




Abstract


























predict

predict

predict

predict


predict


predict

predict

predict


predict

predict

predict





1

arX

iv:1

612.

0314

4v1

[cs.C

V]

9 D

ec 2

016


Function Head

Result Choose

FPN-based Network

Unified Framework• Mask R-CNN architecture• Shared FPN backbone

Shared FPN backbone Seg Head for Thing

FPN Architecture

14

18

116

132

image

1

high resolution

low resolution

strong features

strong features

[1] He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. CVPR 2016.[2] Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. Aggregated residual transformations for deep neural networks. CVPR 2017.[3] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Feature pyramid networks for object detection. CVPR 2017.

Feature Pyramid Network (FPN) [3]

3

256

512

1024

2048

256

256

256

256

Classification

Regression

Mask

He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017.Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018.

Panoptic Segmentation (Thing)

Figure 1: A Squeeze-and-Excitation block.

features in a class agnostic manner, bolstering the quality ofthe shared lower level representations. In later layers, theSE block becomes increasingly specialised, and respondsto different inputs in a highly class-specific manner. Con-sequently, the benefits of feature recalibration conducted bySE blocks can be accumulated through the entire network.

The development of new CNN architectures is a chal-lenging engineering task, typically involving the selectionof many new hyperparameters and layer configurations. Bycontrast, the design of the SE block outlined above is sim-ple, and can be used directly with existing state-of-the-artarchitectures whose modules can be strengthened by directreplacement with their SE counterparts.

Moreover, as shown in Sec. 4, SE blocks are computa-tionally lightweight and impose only a slight increase inmodel complexity and computational burden. To supportthese claims, we develop several SENets and provide anextensive evaluation on the ImageNet 2012 dataset [34].To demonstrate their general applicability, we also presentresults beyond ImageNet, indicating that the proposed ap-proach is not restricted to a specific dataset or a task.

Using SENets, we won the first place in the ILSVRC2017 classification competition. Our top performing modelensemble achieves a 2.251% top-5 error on the test set1.This represents a ⇠25% relative improvement in compari-son to the winner entry of the previous year (with a top-5error of 2.991%).

2. Related Work

Deep architectures. VGGNets [39] and Inception mod-els [43] demonstrated the benefits of increasing depth.Batch normalization (BN) [16] improved gradient propa-gation by inserting units to regulate layer inputs, stabilis-ing the learning process. ResNets [10, 11] showed the ef-fectiveness of learning deeper networks through the use ofidentity-based skip connections. Highway networks [40]employed a gating mechanism to regulate shortcut connec-tions. Reformulations of the connections between networklayers [5, 14] have been shown to further improve the learn-ing and representational properties of deep networks.

An alternative line of research has explored ways to tunethe functional form of the modular components of a net-work. Grouped convolutions can be used to increase car-

1http://image-net.org/challenges/LSVRC/2017/results

dinality (the size of the set of transformations) [15, 47].Multi-branch convolutions can be interpreted as a generali-sation of this concept, enabling more flexible compositionsof operators [16, 42, 43, 44]. Recently, compositions whichhave been learned in an automated manner [26, 54, 55]have shown competitive performance. Cross-channel cor-relations are typically mapped as new combinations of fea-tures, either independently of spatial structure [6, 20] orjointly by using standard convolutional filters [24] with 1⇥1convolutions. Much of this work has concentrated on theobjective of reducing model and computational complexity,reflecting an assumption that channel relationships can beformulated as a composition of instance-agnostic functionswith local receptive fields. In contrast, we claim that provid-ing the unit with a mechanism to explicitly model dynamic,non-linear dependencies between channels using global in-formation can ease the learning process, and significantlyenhance the representational power of the network.

Attention and gating mechanisms. Attention can beviewed, broadly, as a tool to bias the allocation of availableprocessing resources towards the most informative compo-nents of an input signal [17, 18, 22, 29, 32]. The benefitsof such a mechanism have been shown across a range oftasks, from localisation and understanding in images [3, 19]to sequence-based models [2, 28]. It is typically imple-mented in combination with a gating function (e.g. a soft-max or sigmoid) and sequential techniques [12, 41]. Re-cent work has shown its applicability to tasks such as im-age captioning [4, 48] and lip reading [7]. In these appli-cations, it is often used on top of one or more layers rep-resenting higher-level abstractions for adaptation betweenmodalities. Wang et al. [46] introduce a powerful trunk-and-mask attention mechanism using an hourglass module[31]. This high capacity unit is inserted into deep resid-ual networks between intermediate stages. In contrast, ourproposed SE block is a lightweight gating mechanism, spe-cialised to model channel-wise relationships in a computa-tionally efficient manner and designed to enhance the repre-sentational power of basic modules throughout the network.

3. Squeeze-and-Excitation Blocks

The Squeeze-and-Excitation block is a computationalunit which can be constructed for any given transformationFtr : X ! U, X 2 RH

0⇥W0⇥C

0,U 2 RH⇥W⇥C . For

2

He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018.

Unified Framework• Mask R-CNN architecture• Shared FPN backboneStronger Network• SENet154• Deformable Conv.• Nonlocal Conv.


Figure 1: A Squeeze-and-Excitation block.

features in a class agnostic manner, bolstering the quality ofthe shared lower level representations. In later layers, theSE block becomes increasingly specialised, and respondsto different inputs in a highly class-specific manner. Con-sequently, the benefits of feature recalibration conducted bySE blocks can be accumulated through the entire network.

The development of new CNN architectures is a chal-lenging engineering task, typically involving the selectionof many new hyperparameters and layer configurations. Bycontrast, the design of the SE block outlined above is sim-ple, and can be used directly with existing state-of-the-artarchitectures whose modules can be strengthened by directreplacement with their SE counterparts.

Moreover, as shown in Sec. 4, SE blocks are computa-tionally lightweight and impose only a slight increase inmodel complexity and computational burden. To supportthese claims, we develop several SENets and provide anextensive evaluation on the ImageNet 2012 dataset [34].To demonstrate their general applicability, we also presentresults beyond ImageNet, indicating that the proposed ap-proach is not restricted to a specific dataset or a task.

Using SENets, we won the first place in the ILSVRC2017 classification competition. Our top performing modelensemble achieves a 2.251% top-5 error on the test set1.This represents a ⇠25% relative improvement in compari-son to the winner entry of the previous year (with a top-5error of 2.991%).

2. Related Work

Deep architectures. VGGNets [39] and Inception mod-els [43] demonstrated the benefits of increasing depth.Batch normalization (BN) [16] improved gradient propa-gation by inserting units to regulate layer inputs, stabilis-ing the learning process. ResNets [10, 11] showed the ef-fectiveness of learning deeper networks through the use ofidentity-based skip connections. Highway networks [40]employed a gating mechanism to regulate shortcut connec-tions. Reformulations of the connections between networklayers [5, 14] have been shown to further improve the learn-ing and representational properties of deep networks.

An alternative line of research has explored ways to tunethe functional form of the modular components of a net-work. Grouped convolutions can be used to increase car-

1http://image-net.org/challenges/LSVRC/2017/results

dinality (the size of the set of transformations) [15, 47].Multi-branch convolutions can be interpreted as a generali-sation of this concept, enabling more flexible compositionsof operators [16, 42, 43, 44]. Recently, compositions whichhave been learned in an automated manner [26, 54, 55]have shown competitive performance. Cross-channel cor-relations are typically mapped as new combinations of fea-tures, either independently of spatial structure [6, 20] orjointly by using standard convolutional filters [24] with 1⇥1convolutions. Much of this work has concentrated on theobjective of reducing model and computational complexity,reflecting an assumption that channel relationships can beformulated as a composition of instance-agnostic functionswith local receptive fields. In contrast, we claim that provid-ing the unit with a mechanism to explicitly model dynamic,non-linear dependencies between channels using global in-formation can ease the learning process, and significantlyenhance the representational power of the network.

Attention and gating mechanisms. Attention can beviewed, broadly, as a tool to bias the allocation of availableprocessing resources towards the most informative compo-nents of an input signal [17, 18, 22, 29, 32]. The benefitsof such a mechanism have been shown across a range oftasks, from localisation and understanding in images [3, 19]to sequence-based models [2, 28]. It is typically imple-mented in combination with a gating function (e.g. a soft-max or sigmoid) and sequential techniques [12, 41]. Re-cent work has shown its applicability to tasks such as im-age captioning [4, 48] and lip reading [7]. In these appli-cations, it is often used on top of one or more layers rep-resenting higher-level abstractions for adaptation betweenmodalities. Wang et al. [46] introduce a powerful trunk-and-mask attention mechanism using an hourglass module[31]. This high capacity unit is inserted into deep resid-ual networks between intermediate stages. In contrast, ourproposed SE block is a lightweight gating mechanism, spe-cialised to model channel-wise relationships in a computa-tionally efficient manner and designed to enhance the repre-sentational power of basic modules throughout the network.

3. Squeeze-and-Excitation Blocks

The Squeeze-and-Excitation block is a computationalunit which can be constructed for any given transformationFtr : X ! U, X 2 RH

0⇥W0⇥C

0,U 2 RH⇥W⇥C . For

2

He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018.

Unified Framework• Mask R-CNN architecture• Shared FPN backboneStronger Network• SENet154• Deformable Conv.• Nonlocal Conv.Bbox Head• Cascade R-CNN


Unified Framework• Mask R-CNN architecture• Shared FPN backboneStronger Network• SENet154• Deformable Conv.• Nonlocal Conv.Bbox Head• Cascade R-CNNMask Head• Cascade Mask

Exploits RoIs of from box head to refine the mask results.

Test-tricks

RoIs of second stage from Cascade R-CNNRoI Feature

RoI MaskConcat.

RoIs of first stage from Cascade R-CNN

Share params


• ResNet50 baseline

%

(. 5 ) )

).8 40 8 . 4 8 348

6 8 .6

. 648

Shared FPN Inference Thing

%

%

%

%

%%

0 (+ )1 A (+

+0 6 34 3 0 6 .56 4

3 8 08%

06 6 4

)0 386 3


• Using multi-scale training




• Adopt DCN & Nonlocal& Adaptive RoI pooling

%&

%) (

)

% &

%

%

&

&

2A +3

2 B 4 6 6 B2B 08

.6A BA 12

54 42 252 B 6

A B 2

+2A6 6





• Cascade RCNN

%&

% (

% &

)

%

%

&

&

3B +. 4 +.

.3 5078 7 3 1 8

7B B 23

53B5367 A5

65 53 363 7 A

8B A3 8

3B7 7





• Cascade RCNN

• Cascade Mask

%&

% (

% &

)

%

&

%

%

&

&

3B +. 4 +.

.3 5078 7 3 1 8

7B B 23

53B5367 3B

53B5367 A5

65 53 363 7 A

8B A3 8

3B7 7





• Cascade RCNN

• Cascade Mask

• Using test tricks

%&

% (

% &

)

%

&

(

%

%

&

&

3B +. 4 +.

.3 5078 7 3 1 8

7B B 23

7B A 5 B

53B5367 3B

53B5367 A5

65 53 363 7 A

8B A3 8

3B7 7


%&

% (

% &

)

%

&

(

%

%

%

%

&

&

4 + 5 +

4 A 618 8 4 2

08 34

4B 8B 78 .

8 B 6

64 6478 4

64 6478 B6

76 64 474A 8 B

AB4

4 8 8




• Cascade RCNN

• Cascade Mask

• Using test tricks

• Using larger model & BN

45.9%

51.7%


Unified Framework• FCN-based architecture• Shared FPN backboneStronger Network• SENet154• Deformable Conv.• Nonlocal Conv.Test-tricks• Flip• Multi-Scale testing• Other tricks

Shared FPN backbone Seg Head for Stuff

Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017.Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.

Panoptic Segmentation (Stuff)

(

(

%

%

9 0 9)33 55

2 73 4694 2 7 55

.4 8 28

2 487 4 .4• ResNet50 baseline

Shared FPN Inference Stuff

% )

) (

(

(

%

%

2 +55 . 1AB77

.4 A95 168 6 A4A9 1AB77

06 B A 34

4 6 9 6 06 4 A6 A 9• ResNet50 baseline

• Using all test skills


% )

) (

(

(

(

%

%

2 +55 . 1 88

.4 5 179 7 4 1 88

07B B 34

4B7 7 07B 4 7B B B 4A97A 67• ResNet50 baseline

• Using all test skills

• Adopt larger model45.9%

61.95%

34.7%


Merge method

• Sort thing results with scores

• Thing first, stuff second

• Merge tricks

Kirillov A, He K, Girshick R B, et al. Panoptic Segmentation.[J]. arXiv: Computer Vision and Pattern Recognition, 2018.

Panoptic Segmentation Merge

��

Visualize Results

Baseline EnsembleInput

Failure Results

Thanks & questions

For more questions, please contact:[email protected]

fpn-basednetworkfor panoptic segmentationpresentations.cocodataset.org/eccv18/coco18-panoptic... ·...

Documents