annotating object instances with a polygon-rnncvrr.ucsd.edu/ece285sp18/files/mandar_ece285.pdf ·...
TRANSCRIPT
ANNOTATING OBJECT INSTANCES
WITH A POLYGON-RNN
Authors: Castrejon et.al.
(Dept of CS,
University of Toronto)
Presented by Mandar Pradhan
OBJECTIVE OF THE PAPER
● To find how to annotate instances in an image as fast as possible
(AUTOMATIC ANNOTATION)
● To do the annotation as close to the ground truth as possible (POLYGON
FOR ANNOTATION)
● To allow a scope for human intervention to correct automated annotations
(AUTOMATIC SEMI-AUTOMATIC ANNOTATION)
MOTIVATION BEHIND THE IDEA
● More Data == More annotation == Time consuming and lots of hard work!! (if
done by manual polygon annotation)
● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single
point objects) - not as accurate as supervised methods (but an easier way to
obtain ground truth)
● Need for human intervention to correct automated annotations to prevent
model from breaking down
EARLIER RELATED WORKS
● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining
appearance cues and a smoothness term (Additional layer of training examples,
not accurate)
○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM
algorithm (Idea extended to 3D bounding boxes + point clouds )
Scribbles GrabCut
EARLIER RELATED WORKS
● Semi automatic annotations:○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining
appearance cues and a smoothness term (Additional layer of training examples,
not accurate)
○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM
algorithm (Idea extended to 3D bounding boxes + point clouds )
Drawbacks:
- Hard to incorporate shape priors
- Labellings with holes
- Hard to correct (Not ideal )
EARLIER RELATED WORKS
● Semi automatic annotations:- Done at super pixel level
- May merge small objects or parts
EARLIER RELATED WORKS
● Semi automatic annotations:- Done at super pixel level
- May merge small objects or parts
● Object instance segmentation (**USED IN THIS PAPER)- CNN used for box / patch for labelling
- Detect edges and link them to obtain coherent region
- Combine small polygons into object regions to label images
- HERE RNNS HAVE BEEN USED TO DIRECTLY PREDICT FINAL
POLYGONS
Polygon - RNN (High level overview)
● Does automated annotation using CNN followed by RNN
● CNN extracts a Bounding Box output of the instance
● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1,
t-2 + Initial Point (details in subsequent slide)
● RNN Output : “Polygon object” outlining the instance with a bounding box
(Polygons are list of 2-D vertices)
● Trained end to end
● CNN are fine tuned to object boundaries, RNNs encode the priors on objects
shapes
Polygon - RNN (Some more details)
● “Polygon object” : List of vertices of bounding polygon
● Defining a specific polygon may involve multiple parameterizations. (We can
choose any vertex as starting point and then move on to the next points using
any orientation)
● Convention: Any starting point, Clockwise orientation
Polygon - RNN (Some more details)
● “Polygon object” : List of vertices of bounding polygon
● Defining a specific polygon may involve multiple parameterizations. (We can
choose any vertex as starting point and then move on to the next points using
any orientation)
● Convention: Any starting point, Clockwise orientation
● Why are vertices from t-1 and t-2, both, fed into the RNN input???
○ Account for the orientation
● Why is initial point of polygon fed into RNN input ???
○ Decide when to close the polygon
CNN Module - CNN + Skip connects
● Based on VGG16 architecture with fully connected layer and last max pooling
layer removed and replaced
● We stack all skip connects from the lower layers, after they pass through 3X3
convolutional layer + ReLU and upscaling them to 28 X 28
● Output is downsampled by a factor of 16
CNN Module - CNN + Skip connects
● Based on VGG16 architecture with fully connected layer and last max pooling
layer removed and replaced
● We stack all skip connects from the lower layers, after they pass through 3X3
convolutional layer + ReLU and upscaling them to 28 X 28
● Output is downsampled by a factor of 16
● Why skip connects??? - Pull out low level features like edges and corners)
and semantics of the instance
● How to handle skip connections from multiple dimensions???
- Bilinear upsampling after additional convolution at the conv5
- 2X2 max-pooling before additional convolution at pool2
RNN Module for vertex prediction
● Aim of RNN - Capture history(previous edges) and predict the future(next
edges/ polygon).
● Does coherent prediction for ambiguous cases (occlusion, shadows)
● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info
from CNNs, reduce number of parameters to deal with
The overall network architecture is presented in the diagram below
RNN Module for vertex prediction
● 2 layer RNN with 16 channels and 3X3 kernels
● Representation of output vertex - D X D+1 matrix (one hot encoded)
● The DXD dimensions represent the possible 2D coordinates of the vertices
● The additional dimension is used to denote the end of sequence token
(polygon is complete)
● At the input, apart from the CNN representation of the image, we have the
one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.
RNN Module for vertex prediction
● Prediction of starting point
- Reuse the CNN architecture with 2 additional layers
- The first layer predicts object boundaries
- The second branch takes first branch as well as the image features as
inputs and gives the vertices
- Both the above stated problems are binary classification problems
Training Details
● Loss - Cross Entropy
● Smoothening of target distribution (the D X D+1 grid is non binary)
- To prevent over-penalising the incorrect predictions.
- Assigning non zero probability to locations in distance of 2 from target in
grid
● Optimizer - Adam
● Batch size - 8
● Learning rate - 10-4 with decay by a factor of 10 every 10 epochs
● 𝜷1 = 0.9 , 𝜷2= 0.999 (Momentum constant)
● Use logistic regression
● Ground truth of object boundaries - edges of ground truth polygon
● Ground truth of vertex layer - vertices of the ground truth polygon
● GPU - NVIDIA TITAN-X
Implementational details
● How to choose the best vertex at each time step of RNN?? - look for the one
with highest log-probs
● How does correction of vertex take place?? - Annotator feeds in the correct
annotation at the next time step
● Inference time - 250 ms
● Polygon Simplification
- Eliminate 3 vertices in same line and 2 vertices in same grid cases
● Data augmentation:
- Flip image crop and annotation at random
- Randomly increase context (10-20% of the bounding box)
- Randomly pick the starting vertex
Results
● Datasets: KITTI, Cityscape
● Goals of the model :
- Polygon must be as accurate as possible
- Minimal number of clicks
● Yardsticks to gauge performance:
- Intersection over union measure
- No of vertex corrects needed to predict polygon
● Annotation of polygon done by inhouse detector, bounding box easy to obtain
using AMT
Results : Cityscape
● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test
● Issue faced - Test set has no ground truth instances
● Solutions - 500 validation images are now test images
- The images from the Weimar and Zurich are the validation sets
● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle
● Size of Instances - 28 -1792 pixel
Results : Cityscape
● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test
● Issue faced - Test set has no ground truth instances
● Solutions - 500 validation images are now test images
- The images from the Weimar and Zurich are the validation sets
● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle
● Size of Instances - 28 -1792 pixel
● Inbuilt instance segmentation is both in terms of pixel labelling as well as
polygons
● New Problem - Polygons in cityspace capture occlusion portion
● Solution - Depth ordering to remove the occluded part (we want only visible
part)
Results : Cityscape
● What do we do about objects with multiple components due to occlusion???
● The authors have treated each component as a single object
● So what happens if the RNN keeps adding new vertices without reaching a
termination???
● The authors set a hard limit of 70 vertices for the RNN (GPU constraints)
Results : Evaluation Metric
● Intersection of Union : Obtained prediction vs Ground Truth (Average over all
instances)
● How to evaluate the Human Action (Corrections of vertices)??? - simulate the
action of the annotators who correct the point each time predicted vertex
● Testing Gameplan : First do sanity check in PREDICTION mode (no
interaction of the annotators to correct). Then evaluate the amount of human
intervention needed
Results : Baselines
● DeepMask : Uses CNN to output pixel labels, indifferent to class
● SharpMask : Improvise the DeepMask idea using upsampling of output to
obtain improved resolution
● Performance is reported based on ground truth boxes
● Network structure: 50 layer ResNet architecture trained on COCO dataset
● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs
and the upsampling part of SharpMask is trained for 70 epochs
Results : Baselines
● SquareBox: Object is mapped to a bounding box (of reduced dimensions).
Individual boxes for each component of the object
● Dilation10: Use segmentation dataset. Pixels are mapped to objects are
grouped as instance masks
Results : Baselines
Results : Baselines
● Verdict
- Baselines are hard to correct
- Better overall average and tops the charts in 6 / 8 categories
- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %
and 7% respectively
- Why is the previous point worth noting - SharpMask uses ResNet
architecture which is much powerful vs VGG
Results : Baselines
● Verdict
- Baselines are hard to correct
- Better overall average and tops the charts in 6 / 8 categories
- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %
and 7% respectively
- Why is the previous point worth noting - SharpMask uses ResNet
architecture which is much powerful vs VGG
- Larger instances have advantage in larger objects like bus and train due
to better resolution
Results : Annotators in the loop
● How is the quality of annotation and amount of human intervention
quantified??? - No. of mouses clicks needed to get different levels of
accuracy
● What do they mean by different “levels” of segmentation accuracy ??? -
chessboard metric of distance of the errors
● Also, show the resulting IoU to compare
Results : Annotators in the loop
● How is the quality of annotation and amount of human intervention
quantified??? - No. of mouses clicks needed to get different levels of
accuracy
● What do they mean by different “levels” of segmentation accuracy ??? -
chessboard metric of distance of the errors
● Also, show the resulting IoU to compare
● Methodology in a nutshell
- In the first method, pick 10 images per annotator and ask them to
annotate freely without any cues or hint.
- In the second method, crop images and place blue markers on the
instances to be annotated (disambiguous)
Results : Annotators in the loop
Results : Annotators in the loop
● Verdict
- Human annotator IoU: 69.5% in free viewing method and 78.60% for
cropped images
- Indicates need to collect multiple annotations to reduce variations and
biases in the annotators
Results : Annotators in the loop
● Comparison with GRABCUT:
- 54 randomly chosen instances
● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU
● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU
● Verdict - Given model is faster as it needs lesser clicks for comparable
inference time
Results : Annotators in the loop
Results : Annotators in the loop
Results : Final Verdict
Advantages
● Polygon RNN provides plausible annotations with relatively less latency
● Performance is good on smaller objects. This fact is visible in performance
over the different instances of varying sizes within the same datasets (in
Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs
larger objects in Cityscapes)
● Competes well with SharpMask which had ResNet based architecture
● Definitely reduces annotation cost for IoU comparable to human annotation
● Introduction of human intervention adds scope to avoid extremely bad
polygons
Results : Final Verdict
Disadvantages
● Lower resolution and associated quantization error manifest in segmentation
of larger instances.
● Memory intensive - Polygons have more vertices to predict than a single
bounding box which may add latency in return for more accuracy.
● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets
which puts it at a disadvantage
Results : Final Verdict
Takeaways
● Tries to address issues of speed and accuracy of annotations
● The novelty of allowing human intervention allows it to not give very bad
performance
● Performance is good for smaller objects but lowers as complexity reduces
● Scope to work improving resolution and ability to exploit Velodyne point cloud
data to performance address issues in KITTI dataset
OTHER REFERENCES
[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised
convolutional networks for semantic segmentation. In CVPR, 2016
[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground
extraction using iterated graph cuts. In SIGGRAPH, 2004.
QUESTIONS??