ssd: single shot multibox detectorwliu/papers/ssd_eccv2016_slide.pdfssd: single shot multibox...

99
SSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott Reed(4), Cheng-Yang Fu(1), Alexander C. Berg(1) UNC Chapel Hill(1), Zoox Inc.(2), Google Inc.(3), University of Michigan(4)

Upload: others

Post on 02-Mar-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD: Single Shot MultiBox Detector

Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott Reed(4), Cheng-Yang Fu(1), Alexander C. Berg(1)

UNC Chapel Hill(1), Zoox Inc.(2), Google Inc.(3), University of Michigan(4)

Page 2: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

VGGNet Titan X Pascal

Page 3: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

VGGNet Titan X Pascal

Page 4: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

All with VGGNet pretrained on ImageNet, batch_size = 1 on Titan X

Page 5: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30074% mAP / 46 fps

6.6x faster

All with VGGNet pretrained on ImageNet, batch_size = 1 on Titan X

Page 6: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30074% mAP / 46 fps

6.6x faster

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30074% mAP / 46 fps

SSD51277% mAP / 19 fps

11% better

All with VGGNet pretrained on ImageNet, batch_size = 1 on Titan X

Page 7: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30074% mAP / 46 fps

6.6x faster

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30074% mAP / 46 fps

SSD51277% mAP / 19 fps

11% better

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps SSD300

74% mAP / 46 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD51277% mAP / 19 fps

SSD30077% mAP / 46 fps

SSD51280% mAP / 19 fps

All with VGGNet pretrained on ImageNet, batch_size = 1 on Titan X

Page 8: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30077% mAP / 46 fps

SSD51280% mAP / 19 fps

Page 9: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

10 20 30 40 50Speed (fps)

70

80VO

C20

07 te

st m

AP

R-CNN, Girshick 201466% mAP / 0.02 fps

Fast R-CNN, Girshick 201570% mAP / 0.4 fps

Faster R-CNN, Ren 201573% mAP / 7 fps

YOLO, Redmon 201666% mAP / 21 fps

SSD30077% mAP / 46 fps

SSD51280% mAP / 19 fps

Two-Stage

box proposal + postclassify Single Shot

Page 10: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Classical sliding windows

Bounding Box Prediction

Page 11: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Classical sliding windows

Bounding Box Prediction

Is it a cat? No

Page 12: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Is it a cat? No

Discretize the box space densely

Classical sliding windows

Bounding Box Prediction

Page 13: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

Is it a cat? No

Discretize the box space densely

Classical sliding windows

Bounding Box Prediction

Page 14: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

Is it a cat? No

Discretize the box space densely

Classical sliding windows

Bounding Box Prediction

Page 15: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

cat: 0.8 dog: 0.1Is it a cat? No

Discretize the box space densely

Classical sliding windows

Bounding Box Prediction

Page 16: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

Is it a cat? No

Classical sliding windows

Bounding Box Prediction

Discretize the box space densely

Page 17: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

Is it a cat? No

Classical sliding windows

Bounding Box Prediction

Discretize the box space densely

Page 18: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

Is it a cat? No

Classical sliding windows

Bounding Box Prediction

Discretize the box space densely

Page 19: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

dog: 0.4 cat: 0.2Is it a cat? No

Classical sliding windows

Bounding Box Prediction

Discretize the box space densely

Page 20: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD and other deep approaches

dog: 0.4 cat: 0.2Is it a cat? No

Classical sliding windows

Bounding Box Prediction

Discretize the box space more coarselyRefine the coordinates of each boxDiscretize the box space densely

Page 21: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

ConvNet

feature map

SSD Output Layer

Page 22: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

ConvNet

feature map

SSD Output Layer

small (e.g. 3x3) conv kernel

Page 23: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

ConvNet

feature map

SSD Output Layer

small (e.g. 3x3) conv kernel

default box

Page 24: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

feature map

box regression

multiclass probabilities

SSD Output Layer

ConvNet

Page 25: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

ConvNet

feature map

box regression

multiclass probabilities

SSD Training• Match default boxes to ground truth boxes to determine true/false positives.

• Loss = SmoothL1(box param) + Softmax(class prob)

Smooth L1 loss Softmax loss

Page 26: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related Work

Page 27: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkMultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

Page 28: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkMultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

+ post classify boxes

Page 29: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkYOLO [Redmon et al. CVPR16]

multiclass probfor K boxes

Fully connected

Offsets for K boxes

MultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

+ post classify boxes

Page 30: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkYOLO [Redmon et al. CVPR16]

multiclass probfor K boxes

Fully connected

Offsets for K boxes

Faster R-CNN [Ren et al. NIPS15]

Convolutional

P(objectness) Box offsets

MultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

+ post classify boxes

Page 31: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkYOLO [Redmon et al. CVPR16]

multiclass probfor K boxes

Fully connected

Offsets for K boxes

Faster R-CNN [Ren et al. NIPS15]

Convolutional

P(objectness) Box offsets

+ post classify boxes

MultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

+ post classify boxes

Page 32: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Related WorkYOLO [Redmon et al. CVPR16]

multiclass probfor K boxes

Fully connected

Offsets for K boxes

Faster R-CNN [Ren et al. NIPS15]

Convolutional

P(objectness) Box offsets

SSD

Convolutional

Box offsetsmulticlass prob

+ post classify boxes

MultiBox [Erhan et al. CVPR14]

P(objectness) for K boxes

Fully connected

Offsets for K boxes

+ post classify boxes

Page 33: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #1:Multi-Scale Feature Maps

ConvNet

box regression

multiclass scores

Page 34: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #1:Multi-Scale Feature Maps

ConvNet

box regression

multiclass scores

stride 2 convolution

Page 35: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #1:Multi-Scale Feature Maps

ConvNet

box regression

multiclass scores

box regression

multiclass scores

stride 2 convolution

Page 36: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

8⇥ 8 feature map 4⇥ 4 feature map

vs.

8⇥ 8 feature map

SSD

Multi-Scale Feature Maps

Page 37: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

8⇥ 8 feature map 4⇥ 4 feature map

vs.

8⇥ 8 feature map

SSD

Multi-Scale Feature Maps

Faster R-CNN Objectness Proposal, Ren 2015

Page 38: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 39: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 40: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 41: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 42: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 43: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 44: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

boundary boxes

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 45: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 46: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Prediction source layers from:

mAP

use boundary boxes?

# Boxes

38⇥ 38 19⇥ 19 10⇥ 10 5⇥ 5 3⇥ 3 1⇥ 1 Yes No

4 4 4 4 4 4 74.3 63.4 8732

4 4 4 70.7 69.2 9864

4 62.4 64.0 8664

Multi-Scale Feature Maps Experiment

Page 47: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #2:Splitting the Region Space

ConvNet convolution

Page 48: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD300include { 1

2 , 2} box? 4 4include { 1

3 , 3} box? 4number of Boxes 3880 7760 8732

VOC2007 test mAP 71.6 73.7 74.3

Contribution #2:Splitting the Region Space

ConvNet convolution

Page 49: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #2:Splitting the Region Space

ConvNet convolution

Use 38x38 feature map : +2.5 mAP (conv4_3)

Page 50: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

Page 51: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

Page 52: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

GT

Page 53: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

GT DETECTION

Page 54: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

• SmoothL1 or L2 loss for box shape averages among likely hypotheses

GT DETECTION

Page 55: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

• SmoothL1 or L2 loss for box shape averages among likely hypotheses

• Need to have enough default boxes (discrete bins) to do accurate regression in each

GT DETECTION

Page 56: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Why So Many Default Boxes?Faster R-CNN YOLO SSD300 SSD512

# Default Boxes 6000 98 8732 24564Resolution 1000x600 448x448 300x300 512x512

• SmoothL1 or L2 loss for box shape averages among likely hypotheses

• Need to have enough default boxes (discrete bins) to do accurate regression in each

• General principle for regressing complex continuous outputs with deep nets

GT DETECTION

Page 57: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Handling Many Default Boxes

Page 58: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes

Handling Many Default Boxes

Page 59: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes

Handling Many Default Boxes

`

`

GT

Page 60: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes

Handling Many Default Boxes

`

`

GT Default box

Page 61: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

Page 62: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 63: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes‣ Match each GT box to closest default box

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 64: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes‣ Match each GT box to closest default box

‣ Also match each GT box to all unassigned default boxes with IoU > 0.5

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 65: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes‣ Match each GT box to closest default box

‣ Also match each GT box to all unassigned default boxes with IoU > 0.5

• Hard negative mining

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 66: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes‣ Match each GT box to closest default box

‣ Also match each GT box to all unassigned default boxes with IoU > 0.5

• Hard negative mining• Unbalanced training: 1-30 TP, 8k-25k FP

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 67: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Matching ground truth and default boxes‣ Match each GT box to closest default box

‣ Also match each GT box to all unassigned default boxes with IoU > 0.5

• Hard negative mining• Unbalanced training: 1-30 TP, 8k-25k FP

• Keep TP:FP ratio fixed (1:3), use worst-misclassified FPs.

Handling Many Default Boxes

`

`

GT Default box

TP

TP

FP

?

Page 68: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

SSD Architecture

300

VGG16

Det

ectio

ns:8

732

per

Cla

ss

Classifier : Conv: 3x3x(3x(Classes+4))

Non

-Max

imum

Sup

pres

sion

74.3mAP 46FPS

Classifier : Conv: 3x3x(6x(Classes+4))

SSD

Extra Convolutional Feature Maps

Conv: 3x3x(4x(Classes+4))

38 19 10

1910

300 38

5

5

31

image

Page 69: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Contribution #3:The Devil is in the Details

Page 70: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Data Augmentation

Page 71: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Data Augmentation

`

`

Page 72: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Data Augmentation

`

`

`

` `

`

Page 73: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Data Augmentation

`

`

`

` `

`

data augmentation SSD300horizontal flip 4 4

random crop & color distortion 4VOC2007 test mAP 65.5 74.3

Page 74: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Data Augmentation

Page 75: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

`

`

Data Augmentation

Page 76: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

`

`

`

`

``

Random expansion creates more small training examples

Data Augmentation

Page 77: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

`

`

`

`

``

Random expansion creates more small training examples

Data Augmentation

data augmentation SSD300horizontal flip 4 4 4

random crop & color distortion 4 4random expansion 4

VOC2007 test mAP 65.5 74.3 77.2

Page 78: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 79: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

6.6x

Page 80: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 81: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

10%

Page 82: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 83: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 84: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 85: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on VOC2007 test

77.2

77.279.8

79.8

Method mAP FPS batch size # Boxes Input resolution

Faster R-CNN (VGG16) 73.2 7 1 ⇠ 6000 ⇠ 1000⇥ 600

Fast YOLO 52.7 155 1 98 448⇥ 448YOLO (VGG16) 66.4 21 1 98 448⇥ 448

SSD300 74.3 46 1 8732 300⇥ 300SSD512 76.8 19 1 24564 512⇥ 512SSD300 74.3 59 8 8732 300⇥ 300SSD512 76.8 22 8 24564 512⇥ 512

Page 86: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on More Datasets

Page 87: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on More Datasets

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/A

Page 88: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on More Datasets

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300 74.3 72.4 23.2 43.4

Page 89: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on More Datasets

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300 74.3 72.4 23.2 43.4

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300 74.3 72.4 23.2 43.4SSD512 76.8 74.9 26.8 46.4

Page 90: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Results on More Datasets

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/A

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300 74.3 72.4 23.2 43.4

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300 74.3 72.4 23.2 43.4SSD512 76.8 74.9 26.8 46.4

Method VOC2007test

VOC2012test

MS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/AFaster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/ASSD300* 77.2 75.8 25.1 N/ASSD512* 79.8 78.5 28.8 N/A

Page 91: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

COCO Bounding Box precision

Page 92: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

COCO Bounding Box precision

mAP @ IoU 0.5 0.75 0.5:0.95

Faster R-CNN 45.3 23.5 24.2SSD512* 48.5 30.3 28.8

gain +3.2 +6.8 +4.6

Page 93: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Future Work

Page 94: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Object detection + pose estimation

Future Work

Page 95: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Object detection + pose estimation

Figure 2. Two-stage vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object detection,the detected objects are cropped and then processed by a separate network for pose estimation. This requires resampling the image at leastthree times: once for region proposals, once for detection, and once for pose estimation. (b) The proposed method, in contrast, requires noresampling of the image and instead relies on convolutions for detecting the object and its pose in a single forward pass. This offers a largespeed up because the image is not resampled, and computation for detection and pose estimation is shared.

3. ModelFor an input RGB image, a single evaluation of the

model network is performed and produces scores for cat-egory, bounding box offset directions, and pose, for a con-stant number of boxes. These are filtered by non-max sup-pression to produce the final output. The network is a vari-ant of the single shot detection (SSD) network from [10]with additional outputs for pose. Here we present the net-work’s design choices, structure of the outputs, and training.

An SSD-style detector [10] works by adding a sequenceof feature maps of progressively decreasing spatial resolu-tion to an image classification network such as VGG [17].These feature layers replace the last few layers of the imageclassification network, and 3x3 and 1x1 convolutional fil-ters are used to transform one feature map to the next alongwith max-pooling. See Fig. 3 for a depiction of the model.

Predictions for a regularly spaced set of possible detec-tions are computed by applying a collection of 3x3 filtersto channels in one of the feature layers. Each 3x3 filterproduces one value at each location, where the outputs areeither classification scores, localization offsets, and, in ourcase, discretized pose predictions for the object (if any) in abox. See Fig. 1. Note that different sized detections are pro-duced by different feature layers instead of taking the moretraditional approach of resizing the input image or predict-ing different sized detections from a single feature layer.

We take one of two different approaches for pose predic-tions, either sharing outputs for pose across all the objectcategories (share) or having separate pose outputs for eachobject category (separate). One output is added for each of

N

possible poses. With N

c

categories of objects, there areN

c

⇥ N

pose outputs for the separate model and N

poseoutputs for the share model. While we do add a 3x3 filter foreach of the pose outputs, this added cost is relatively smalland the original SSD pipeline is quite fast, so the result isstill faster than two stage approaches that rely on a (oftenslower) detector followed by a separate pose classificationstage. See Fig. 2 (a).

3.1. Pose Estimation Formulation

There are a number of design choices for a joint detectionand pose estimation method. This section details three par-ticular design choices, and Sec. 4.1.1 shows justificationsfor them through experimental results.

One important choice is in how the pose estimation taskis formulated. A possibility is to train for continuous poseestimation and formulate the problem as a regression. How-ever, in this work we discretize the pose space into N

dis-joint bins and formulate the task as a classification problem.Doing so not only makes the task feasible (since both thequantity and consistency of pose labels is not high enoughfor continuous pose estimation), but also allows us to mea-sure the confidence of our pose prediction. Furthermore,discrete pose estimation still presents a very challengingproblem.

Another design choice is whether to predict poses sepa-rately for the N

c

object classes or to use the same weights topredict poses for all classes. Sec. 4.1.1 assess these options.

The final design choice is the resolution of the input im-age. Specifically, we consider two resolutions for input:

[Poirson et al, coming out at 3DV, 2016]

Future Work

Page 96: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Object detection + pose estimation

Figure 2. Two-stage vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object detection,the detected objects are cropped and then processed by a separate network for pose estimation. This requires resampling the image at leastthree times: once for region proposals, once for detection, and once for pose estimation. (b) The proposed method, in contrast, requires noresampling of the image and instead relies on convolutions for detecting the object and its pose in a single forward pass. This offers a largespeed up because the image is not resampled, and computation for detection and pose estimation is shared.

3. ModelFor an input RGB image, a single evaluation of the

model network is performed and produces scores for cat-egory, bounding box offset directions, and pose, for a con-stant number of boxes. These are filtered by non-max sup-pression to produce the final output. The network is a vari-ant of the single shot detection (SSD) network from [10]with additional outputs for pose. Here we present the net-work’s design choices, structure of the outputs, and training.

An SSD-style detector [10] works by adding a sequenceof feature maps of progressively decreasing spatial resolu-tion to an image classification network such as VGG [17].These feature layers replace the last few layers of the imageclassification network, and 3x3 and 1x1 convolutional fil-ters are used to transform one feature map to the next alongwith max-pooling. See Fig. 3 for a depiction of the model.

Predictions for a regularly spaced set of possible detec-tions are computed by applying a collection of 3x3 filtersto channels in one of the feature layers. Each 3x3 filterproduces one value at each location, where the outputs areeither classification scores, localization offsets, and, in ourcase, discretized pose predictions for the object (if any) in abox. See Fig. 1. Note that different sized detections are pro-duced by different feature layers instead of taking the moretraditional approach of resizing the input image or predict-ing different sized detections from a single feature layer.

We take one of two different approaches for pose predic-tions, either sharing outputs for pose across all the objectcategories (share) or having separate pose outputs for eachobject category (separate). One output is added for each of

N

possible poses. With N

c

categories of objects, there areN

c

⇥ N

pose outputs for the separate model and N

poseoutputs for the share model. While we do add a 3x3 filter foreach of the pose outputs, this added cost is relatively smalland the original SSD pipeline is quite fast, so the result isstill faster than two stage approaches that rely on a (oftenslower) detector followed by a separate pose classificationstage. See Fig. 2 (a).

3.1. Pose Estimation Formulation

There are a number of design choices for a joint detectionand pose estimation method. This section details three par-ticular design choices, and Sec. 4.1.1 shows justificationsfor them through experimental results.

One important choice is in how the pose estimation taskis formulated. A possibility is to train for continuous poseestimation and formulate the problem as a regression. How-ever, in this work we discretize the pose space into N

dis-joint bins and formulate the task as a classification problem.Doing so not only makes the task feasible (since both thequantity and consistency of pose labels is not high enoughfor continuous pose estimation), but also allows us to mea-sure the confidence of our pose prediction. Furthermore,discrete pose estimation still presents a very challengingproblem.

Another design choice is whether to predict poses sepa-rately for the N

c

object classes or to use the same weights topredict poses for all classes. Sec. 4.1.1 assess these options.

The final design choice is the resolution of the input im-age. Specifically, we consider two resolutions for input:

[Poirson et al, coming out at 3DV, 2016]

Future Work

• Single shot 3D bounding box detection

Page 97: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

• Object detection + pose estimation

Figure 2. Two-stage vs. Proposed. (a) The two-stage approach separates the detection and pose estimation steps. After object detection,the detected objects are cropped and then processed by a separate network for pose estimation. This requires resampling the image at leastthree times: once for region proposals, once for detection, and once for pose estimation. (b) The proposed method, in contrast, requires noresampling of the image and instead relies on convolutions for detecting the object and its pose in a single forward pass. This offers a largespeed up because the image is not resampled, and computation for detection and pose estimation is shared.

3. ModelFor an input RGB image, a single evaluation of the

model network is performed and produces scores for cat-egory, bounding box offset directions, and pose, for a con-stant number of boxes. These are filtered by non-max sup-pression to produce the final output. The network is a vari-ant of the single shot detection (SSD) network from [10]with additional outputs for pose. Here we present the net-work’s design choices, structure of the outputs, and training.

An SSD-style detector [10] works by adding a sequenceof feature maps of progressively decreasing spatial resolu-tion to an image classification network such as VGG [17].These feature layers replace the last few layers of the imageclassification network, and 3x3 and 1x1 convolutional fil-ters are used to transform one feature map to the next alongwith max-pooling. See Fig. 3 for a depiction of the model.

Predictions for a regularly spaced set of possible detec-tions are computed by applying a collection of 3x3 filtersto channels in one of the feature layers. Each 3x3 filterproduces one value at each location, where the outputs areeither classification scores, localization offsets, and, in ourcase, discretized pose predictions for the object (if any) in abox. See Fig. 1. Note that different sized detections are pro-duced by different feature layers instead of taking the moretraditional approach of resizing the input image or predict-ing different sized detections from a single feature layer.

We take one of two different approaches for pose predic-tions, either sharing outputs for pose across all the objectcategories (share) or having separate pose outputs for eachobject category (separate). One output is added for each of

N

possible poses. With N

c

categories of objects, there areN

c

⇥ N

pose outputs for the separate model and N

poseoutputs for the share model. While we do add a 3x3 filter foreach of the pose outputs, this added cost is relatively smalland the original SSD pipeline is quite fast, so the result isstill faster than two stage approaches that rely on a (oftenslower) detector followed by a separate pose classificationstage. See Fig. 2 (a).

3.1. Pose Estimation Formulation

There are a number of design choices for a joint detectionand pose estimation method. This section details three par-ticular design choices, and Sec. 4.1.1 shows justificationsfor them through experimental results.

One important choice is in how the pose estimation taskis formulated. A possibility is to train for continuous poseestimation and formulate the problem as a regression. How-ever, in this work we discretize the pose space into N

dis-joint bins and formulate the task as a classification problem.Doing so not only makes the task feasible (since both thequantity and consistency of pose labels is not high enoughfor continuous pose estimation), but also allows us to mea-sure the confidence of our pose prediction. Furthermore,discrete pose estimation still presents a very challengingproblem.

Another design choice is whether to predict poses sepa-rately for the N

c

object classes or to use the same weights topredict poses for all classes. Sec. 4.1.1 assess these options.

The final design choice is the resolution of the input im-age. Specifically, we consider two resolutions for input:

[Poirson et al, coming out at 3DV, 2016]

Future Work

• Single shot 3D bounding box detection

• Joint object detection + tracking model

Page 98: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Check out the code/models

https://github.com/weiliu89/caffe/tree/ssd

Page 99: SSD: Single Shot MultiBox Detectorwliu/papers/ssd_eccv2016_slide.pdfSSD: Single Shot MultiBox Detector Wei Liu(1), Dragomir Anguelov(2), Dumitru Erhan(3), Christian Szegedy(3), Scott

Thank you! Come by our poster O-1A-02