ssd: single shot multibox detectorkosecka/cs747/presentation-ssd.pdfssd: single shot multibox...

SSD: Single Shot MultiBoxDetector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,

Cheng-Yang Fu, Alexander C. Berg

Slides by: Sulabh Shrestha

Receptive Field

Ref: https://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/

▪ Deep feature maps▪ Smaller size

▪ Larger receptive fields

▪ May miss small objects

▪ Shallow feature maps▪ Larger size

▪ Smaller receptive fields

▪ May not be able to see larger objects

▪ Use multiple for corresponding receptive field sized objects

Use multiple

https://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/

Architecture

▪ Base Network + Extra Feature Layer

▪ No FC layer

▪ Specific feature maps responsive to particular scale of objects▪ Not necessarily same as the receptive field

▪ A hyper-parameter

▪ Dependent on data8x8 Feature map 4x4 Feature map

VGG

Base Network

▪VGG 16

▪Pool5 changed:▪ 3x3 kernel instead of 2x2▪ Stride 1 instead of 2

▪1st 2 FCs replaced by CNN▪ DeepLab LargeFOV

▪Last FC removed altogether

▪No dropouts used

▪Conv4_3 also used for prediction▪ 4th Group of Conv▪ 3rd kernel

Ref: VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

https://arxiv.org/abs/1409.1556

Multiple Default Boxes▪Similar to Anchor boxes of Faster-RCNN

▪Example feature map: ▪m x n

▪p-channels

▪For each location (i, j)▪ Multiple default boxes (k)

▪ 3 x 3 x p-channel CNN for each box▪ Confidence of each class, ci ; i Є [1, C]

▪ x, y, w, h

▪ (C+4) outputs

▪ Total outputs for 1 feature map:▪ m * n * k * (#classes + 4)

m

n

p

Scale and Aspect ratio

▪ How many default boxes per location?

▪ Scale▪ Related to but not exact as the receptive field▪ If m feature maps used for prediction:▪ smin = 0.2 ▪ smax = 0.9▪ Eg.

▪ s = 0.2▪ img-size = 300▪ Default box corresponding size = 0.2 * 300 = 60

▪ Aspect ratios(ar) ▪ {1, 2, 3, 1/2, 1/3} ~ k▪ Width (wk

a) = sk √ ar

▪ Height (hka) = sk / √ ar

▪ Eg. ▪ s = 0.2, img-size = 300▪ ar = 1 --> w = 0.2 * 300 = 60 h = 0.2 * 300 = 60▪ ar = 2 --> w = 0.2 * √ 2 * 300 = 85 h = 0.2 / √ 2 * 300 = 42▪ ar = 1/2 --> w = 0.2 * √ ½ * 300 = 42 h = 0.2 / √ ½ * 300 = 85

Training

• Basenet pre-trained on ImageNet CLS-LOC dataset• Fine-tuned for respective dataset

• Matching Strategy• Any 𝐼𝑂𝑈𝑑𝑒𝑓𝑎𝑢𝑙𝑡𝑏𝑜𝑥

𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ> 0.5 → 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

• Simplifies learning problem• Can detect object in multiple overlapping default boxes

• Loss• Confidence loss (c)

• Softmax loss over multiple classes

• Localization loss (xywh)• Smooth L1 loss• Ground truth box(g) vs Default box(l)

Ref: https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf

https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf

ResultsPASCAL VOC2007 test detection results

PASCAL VOC2012 test detection results

Inference

• Filter boxes with low confidence

• NMS with 0.45 IOU

• Take top 200 detections

• Better mAP

• Faster FPSVOC2007 Test data

Analysis

• Better than 2 stage network:• Single network for localization and classification

• Better than YOLO• Use multiple feature maps• Use many more default boxes• No FC layer

• Faster inference• Fewer parameters

• Smaller input size• Faster RCNN

• 600 min. size• YOLO

• 448 x 448

Ablation Studies - 1

• Data Augmentation helps• Original image

• Random sample of patch

• Sample patch• IOUmin is 0.1, 0.3, 0.5, 0.7, 0.9

• More Multiple boxes helps

• Using FC instead of CNN (Atrous)• Similar result

• 20% slow

Ablation Studies - 2• Use different number of feature maps

• Similar # of default boxes to make it fair

• More feature maps better• Up to a certain extent

• Not using boundary defaults boxes better• Avoid default boxes lying outside the image

Thank youQuestions?

ssd: single shot multibox detectorkosecka/cs747/presentation-ssd.pdfssd: single shot multibox...

Documents