annotating object instances with a polygon-rnncvrr.ucsd.edu/ece285sp18/files/mandar_ece285.pdf ·...

ANNOTATING OBJECT INSTANCES

WITH A POLYGON-RNN

Authors: Castrejon et.al.

(Dept of CS,

University of Toronto)

Presented by Mandar Pradhan

OBJECTIVE OF THE PAPER

● To find how to annotate instances in an image as fast as possible

(AUTOMATIC ANNOTATION)

● To do the annotation as close to the ground truth as possible (POLYGON

FOR ANNOTATION)

● To allow a scope for human intervention to correct automated annotations

(AUTOMATIC SEMI-AUTOMATIC ANNOTATION)

MOTIVATION BEHIND THE IDEA

● More Data == More annotation == Time consuming and lots of hard work!! (if

done by manual polygon annotation)

● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single

point objects) - not as accurate as supervised methods (but an easier way to

obtain ground truth)

● Need for human intervention to correct automated annotations to prevent

model from breaking down

EARLIER RELATED WORKS

● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples,

not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM

algorithm (Idea extended to 3D bounding boxes + point clouds )

Scribbles GrabCut


● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples,

not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM

algorithm (Idea extended to 3D bounding boxes + point clouds )

Drawbacks:

- Hard to incorporate shape priors

- Labellings with holes

- Hard to correct (Not ideal )


● Semi automatic annotations:- Done at super pixel level

- May merge small objects or parts


● Semi automatic annotations:- Done at super pixel level

- May merge small objects or parts

● Object instance segmentation (**USED IN THIS PAPER)- CNN used for box / patch for labelling

- Detect edges and link them to obtain coherent region

- Combine small polygons into object regions to label images

- HERE RNNS HAVE BEEN USED TO DIRECTLY PREDICT FINAL

POLYGONS

Polygon - RNN (High level overview)

● Does automated annotation using CNN followed by RNN

● CNN extracts a Bounding Box output of the instance

● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1,

t-2 + Initial Point (details in subsequent slide)

● RNN Output : “Polygon object” outlining the instance with a bounding box

(Polygons are list of 2-D vertices)

● Trained end to end

● CNN are fine tuned to object boundaries, RNNs encode the priors on objects

shapes

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can

choose any vertex as starting point and then move on to the next points using

any orientation)

● Convention: Any starting point, Clockwise orientation

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can

choose any vertex as starting point and then move on to the next points using

any orientation)

● Convention: Any starting point, Clockwise orientation

● Why are vertices from t-1 and t-2, both, fed into the RNN input???

○ Account for the orientation

● Why is initial point of polygon fed into RNN input ???

○ Decide when to close the polygon

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling

layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3

convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling

layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3

convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

● Why skip connects??? - Pull out low level features like edges and corners)

and semantics of the instance

● How to handle skip connections from multiple dimensions???

- Bilinear upsampling after additional convolution at the conv5

- 2X2 max-pooling before additional convolution at pool2

RNN Module for vertex prediction

● Aim of RNN - Capture history(previous edges) and predict the future(next

edges/ polygon).

● Does coherent prediction for ambiguous cases (occlusion, shadows)

● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info

from CNNs, reduce number of parameters to deal with

The overall network architecture is presented in the diagram below


● 2 layer RNN with 16 channels and 3X3 kernels

● Representation of output vertex - D X D+1 matrix (one hot encoded)

● The DXD dimensions represent the possible 2D coordinates of the vertices

● The additional dimension is used to denote the end of sequence token

(polygon is complete)

● At the input, apart from the CNN representation of the image, we have the

one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.


● Prediction of starting point

- Reuse the CNN architecture with 2 additional layers

- The first layer predicts object boundaries

- The second branch takes first branch as well as the image features as

inputs and gives the vertices

- Both the above stated problems are binary classification problems

Training Details

● Loss - Cross Entropy

● Smoothening of target distribution (the D X D+1 grid is non binary)

- To prevent over-penalising the incorrect predictions.

- Assigning non zero probability to locations in distance of 2 from target in

grid

● Optimizer - Adam

● Batch size - 8

● Learning rate - 10-4 with decay by a factor of 10 every 10 epochs

● 𝜷1 = 0.9 , 𝜷2= 0.999 (Momentum constant)

● Use logistic regression

● Ground truth of object boundaries - edges of ground truth polygon

● Ground truth of vertex layer - vertices of the ground truth polygon

● GPU - NVIDIA TITAN-X

Implementational details

● How to choose the best vertex at each time step of RNN?? - look for the one

with highest log-probs

● How does correction of vertex take place?? - Annotator feeds in the correct

annotation at the next time step

● Inference time - 250 ms

● Polygon Simplification

- Eliminate 3 vertices in same line and 2 vertices in same grid cases

● Data augmentation:

- Flip image crop and annotation at random

- Randomly increase context (10-20% of the bounding box)

- Randomly pick the starting vertex

Results

● Datasets: KITTI, Cityscape

● Goals of the model :

- Polygon must be as accurate as possible

- Minimal number of clicks

● Yardsticks to gauge performance:

- Intersection over union measure

- No of vertex corrects needed to predict polygon

● Annotation of polygon done by inhouse detector, bounding box easy to obtain

using AMT

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test

● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets

● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test

● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets

● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

● Inbuilt instance segmentation is both in terms of pixel labelling as well as

polygons

● New Problem - Polygons in cityspace capture occlusion portion

● Solution - Depth ordering to remove the occluded part (we want only visible

part)

Results : Cityscape

● What do we do about objects with multiple components due to occlusion???

● The authors have treated each component as a single object

● So what happens if the RNN keeps adding new vertices without reaching a

termination???

● The authors set a hard limit of 70 vertices for the RNN (GPU constraints)

Results : Evaluation Metric

● Intersection of Union : Obtained prediction vs Ground Truth (Average over all

instances)

● How to evaluate the Human Action (Corrections of vertices)??? - simulate the

action of the annotators who correct the point each time predicted vertex

● Testing Gameplan : First do sanity check in PREDICTION mode (no

interaction of the annotators to correct). Then evaluate the amount of human

intervention needed

Results : Baselines

● DeepMask : Uses CNN to output pixel labels, indifferent to class

● SharpMask : Improvise the DeepMask idea using upsampling of output to

obtain improved resolution

● Performance is reported based on ground truth boxes

● Network structure: 50 layer ResNet architecture trained on COCO dataset

● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs

and the upsampling part of SharpMask is trained for 70 epochs

Results : Baselines

● SquareBox: Object is mapped to a bounding box (of reduced dimensions).

Individual boxes for each component of the object

● Dilation10: Use segmentation dataset. Pixels are mapped to objects are

grouped as instance masks

Results : Baselines

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet

architecture which is much powerful vs VGG

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet

architecture which is much powerful vs VGG

- Larger instances have advantage in larger objects like bus and train due

to better resolution

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention

quantified??? - No. of mouses clicks needed to get different levels of

accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -

chessboard metric of distance of the errors

● Also, show the resulting IoU to compare


● How is the quality of annotation and amount of human intervention

quantified??? - No. of mouses clicks needed to get different levels of

accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -

chessboard metric of distance of the errors

● Also, show the resulting IoU to compare

● Methodology in a nutshell

- In the first method, pick 10 images per annotator and ask them to

annotate freely without any cues or hint.

- In the second method, crop images and place blue markers on the

instances to be annotated (disambiguous)


● Verdict

- Human annotator IoU: 69.5% in free viewing method and 78.60% for

cropped images

- Indicates need to collect multiple annotations to reduce variations and

biases in the annotators


● Comparison with GRABCUT:

- 54 randomly chosen instances

● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU

● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU

● Verdict - Given model is faster as it needs lesser clicks for comparable

inference time

Results : Final Verdict

Advantages

● Polygon RNN provides plausible annotations with relatively less latency

● Performance is good on smaller objects. This fact is visible in performance

over the different instances of varying sizes within the same datasets (in

Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs

larger objects in Cityscapes)

● Competes well with SharpMask which had ResNet based architecture

● Definitely reduces annotation cost for IoU comparable to human annotation

● Introduction of human intervention adds scope to avoid extremely bad

polygons


Disadvantages

● Lower resolution and associated quantization error manifest in segmentation

of larger instances.

● Memory intensive - Polygons have more vertices to predict than a single

bounding box which may add latency in return for more accuracy.

● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets

which puts it at a disadvantage


Takeaways

● Tries to address issues of speed and accuracy of annotations

● The novelty of allowing human intervention allows it to not give very bad

performance

● Performance is good for smaller objects but lowers as complexity reduces

● Scope to work improving resolution and ability to exploit Velodyne point cloud

data to performance address issues in KITTI dataset

OTHER REFERENCES

[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised

convolutional networks for semantic segmentation. In CVPR, 2016

[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground

extraction using iterated graph cuts. In SIGGRAPH, 2004.

QUESTIONS??

annotating object instances with a polygon-rnncvrr.ucsd.edu/ece285sp18/files/mandar_ece285.pdf ·...

Documents