towards weakly supervised object segmentation & scene parsing · self-erasing network for...

Towards Weakly Supervised

Object Segmentation & Scene Parsing

Yunchao Wei

IFP, Beckman Institute, University of Illinois at Urbana-Champaign, IL, USA

Self-Erasing Network for Integral Object AttentionQibin Hou1, Peng-Tao Jiang1, Yunchao Wei 2, Ming-Ming Cheng1

1College of Computer Science, Nankai University, Beijing, China

2IFP, Beckman Institute, University of Illinois at Urbana-Champaign, IL, USA

Image Object Localization Map

…

Fully Convolutional Networks

Background

Weak Supervision: Lower degree (or cheaper, simper) annotations at training stage than the required outputs at the testing stage.

scribblespointsimage-level labels

horse

person

bounding boxes

horse person

Weak Supervision

Background

FCN

LOSS

bkg

horse

person

Learn to Produce Pseudo Mask

The Popular Pipeline

dog bird

Dense and integral object localization maps

Our target

Background

Small and sparse object localization maps

Class Activation Mapping [Zhou CVPR16]

Our Target & Current Issue

Revisit Adversarial Erasing

Object Region Mining with Adversarial Erasing [Wei CVPR17]


Adversarial Complementary Learning [Zhang CVPR18]


Over Erasing: The Failure Case of Adversarial Erasing

The backgroundregions can notbe suppressed!!

Our Solution: Self-Erasing Network

Motivation

Image Attention Map Ternary Mask

object priors

background priors


Ternary Mask

object priors

background priors

𝛿ℎ𝛿𝑙

Attention Map

Backbone

dining table

Backbone

Conv Attention Ternary Mask T

Lo

ss

Conv

GAP

SA

SB

GAP

Lo

ss...... L

oss

Conv

GAP

SC

...C-ReLU strategy I

T

T ThresholdingGAP Global Ave Pool

dining table

C-ReLU strategy II


Framework

Conv

SA

GAP

Loss...

Attention Ternary Mask T

T

Loss

Conv

GAP ... Loss

Conv

GAP

SC

...

Experimental Results

Ours ACoL [Zhang CVPR18] ACoL [Zhang CVPR18]Ours


Pascal VOC 2012

015

Weakly Supervised Scene Parsing with Point-based

Distance Metric LearningRui Qian1,3, Yunchao Wei 3, Honghui Shi 2,3, Jiachen Li 3, Jiaying Liu1 and Thomas Huang3

1Institute of Computer Science and Technology, Peking University, Beijing, China

2IBM T.J. Waston Research Center, 3IFP, Beckman, UIUC

Point Annotation Full Annotation

016 Review on Weakly Supervised Methods

◼ Weakly supervised methods for scene parsing◼ Image-level

◼ Box supervision

◼ Scribble supervision

◼ Point supervision

PersonBikeTreeSkyRoad

03 Annotation Comparison

◼ Annotation burden comparison

Method Full Scribble Point

Average

Anno.pixel/Image170K 1817.48 12.26

018 Motivation

Input

PointSup

Limited

label!

Input

PointSup

Limited

label!

How to utilize limited annotation?

Cross-image semantic similarity!

train

track

tree

05 Proposed method(1/5)

◼ Overview◼ Point-based distance metric learning(PDML)

◼ Point supervision(PointSup)

◼ Online extension supervision(ExtendSup)

Feature

Extract

PDML

Classification

Module

Input Feature Prediction

PointSup

ExtendSup

CrossEntropy

Loss


◼ Point supervision◼ Only calculate cross-entropy loss on annotated pixels

◼ Back propagate gradients accordingly

◼ Optimize by stochastic gradient descent

Feature

ExtractClassification

Module

Input Feature PredictionPointSup

Cro

ssE

ntro

py


◼ Online extension supervision◼ Extension method1 (region):

◼ Select pixels in 5*5 square near the annotated ones

◼ Extension method2 (score):

◼ Select pixels with score over 0.7 in the prediction

◼ Finally choose the intersection of two methods

Feature

ExtractClassification

Module

Input Feature Prediction

PointSup

Cro

ssE

ntro

py

Online

Extension


◼ Point-based distance metric learning

Shared

weights

(b)

Feature

Extract

𝐼𝑎

𝐼𝑏

Feature

Extract

Feature

Feature

platformtrain

minimizing

maximizing

(a)

Subgroup

Point-based Distance

Metric Learning

𝐼𝑎

𝐼𝑏𝐼𝑐

𝐼𝑑

train platform


◼ Loss function of PDML◼ For each image 𝐼𝑎 , define the embedding vector set as 𝐸𝑎:

◼ 𝐸𝑎 = 𝑖=1ڂ|𝑀𝑎|{𝑃𝑎𝑖}

◼ |𝑀𝑎| is the number of annotated pixel of 𝐼𝑎◼ 𝑃𝑎𝑖 is the feature vector of 𝑖th pixel

◼ We optimize in the triplet form of {𝑃𝑎𝑖 , 𝑃𝑏𝑗, 𝑃𝑏𝑘}:

◼ 𝑃𝑎𝑖 shares the same category with 𝑃𝑏𝑗◼ 𝑃𝑏𝑘 shares different category with 𝑃𝑎𝑖 , 𝑃𝑏𝑗

◼ We use the loss function of :

𝐿𝑡 𝑃𝑎𝑖 , 𝑃𝑏𝑗, 𝑃𝑏𝑘 = α𝐿𝑝 𝑃𝑎𝑖, 𝑃𝑏𝑗 + 𝛽𝐿𝑛 𝑃𝑎𝑖 , 𝑃𝑏𝑗 , 𝑃𝑏𝑘

◼ 𝐿𝑝 𝑃𝑎𝑖 , 𝑃𝑏𝑗 = ||𝑃𝑎𝑖 − 𝑃𝑏𝑗||2

◼ 𝐿𝑛 𝑃𝑎𝑖, 𝑃𝑏𝑗 , 𝑃𝑏𝑘 = max(||𝑃𝑎𝑖 − 𝑃𝑏𝑗||2 - ||𝑃𝑎𝑖 − 𝑃𝑏𝑘||2 + 𝑚, 0)

◼ α, 𝛽,𝑚 are hyper-params and are set to 0.8, 1, 20 in practice

10 Datasets

◼ Scene parsing datasets◼ PASCAL-Context

◼ ADE 20K

Dataset #Training #Evaluation #Instance/Image

PASCAL-

Context4998 5105 12.26

ADE20K 20210 2000 13.96

25 Experimental Results

Method Metrics

FullSup PointSup PDML Online Ext. mIoU Pixel Acc

√ 39.6 78.6%

√ 27.9 55.3%

√ √ 29.7 57.5%

√ √ √ 30.0 57.6%

◼ Quantitative evaluation on PASCAL-Context◼ The combination of three techniques is best

◼ We use only 0.007% annotated data but reached

75% of the full supervision performance!

26 Experimental Results

Method Metrics

FullSup PointSup PDML Online Ext. mIoU Pixel Acc

√ 33.9 75.8%

√(SegNet) 21.0 /

√ 17.7 58.0%

√ √ 19.0 59.0%

√ √ √ 19.6 61.0%

◼ Quantitative evaluation on ADE20K◼ The combination of three techniques is best

◼ Our method approaches the result SegNet under

full supervision scheme

27

Image

PointSup

PDML

Final

GT

Subjective Evaluation

28

Image

PointSup

PDML

Final

GT


29

Image

PointSup

PDML

Final

GT


30

Image

PointSup

PDML

Final

GT


31

Image

PointSup

PDML

Final

GT


32

Image

PointSup

PDML

Final

GT


33

Image

PointSup

PDML

Final

GT


34

Image

PointSup

PDML

Final

GT


35

Image

PointSup

PDML

Final

GT


36

Image

PointSup

PDML

Final

GT


23 Discussion

◼ Visualization the effect of PDML ◼ dis(+): L2 norm distance between same-class feature vectors

◼ dis(-): L2 norm distance between different-class feature vectors

23 Discussion

◼ Ablation on the design of PDML loss function◼ 𝐿𝑡 𝑃𝑎𝑖, 𝑃𝑏𝑗 , 𝑃𝑏𝑘 = α𝐿𝑝 𝑃𝑎𝑖 , 𝑃𝑏𝑗 + 𝛽𝐿𝑛 𝑃𝑎𝑖 , 𝑃𝑏𝑗 , 𝑃𝑏𝑘

039

◼ Problem

◼ Point-guided scene parsing

◼ Point-based distance metric learning

◼ Exploit semantic relationship across images

◼ Experimental results

◼ Good performance both quantitatively and

qualitatively

Conclusion

Conclusion

Weakly Supervised Learning for Real-World Computer Vision Applications & The 1st Learning from Imperfect Data (LID) Challenge

CVPR 2019 Workshop, Long Beach, CA

Object Segmentation on ILSVRC DET (Image-level Supervision) Scene Parsing on ADE20K (Point Supervision)

Task 1 Task 2

https://lidchallenge.github.io/

42 Thank you

towards weakly supervised object segmentation & scene parsing · self-erasing network for...

Documents