cs231n 2017 lecture11 detection and segmentation
TRANSCRIPT
![Page 1: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/1.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20171
Lecture 11:Detection and Segmentation
![Page 2: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/2.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20172
Administrative
Midterms being gradedPlease don’t discuss midterms until next week - some students not yet taken
A2 being graded
Project milestones due Tuesday 5/16
![Page 3: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/3.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
HyperQuest
3
![Page 4: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/4.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20174
HyperQuest
![Page 5: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/5.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20175
HyperQuest
![Page 6: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/6.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20176
HyperQuest
![Page 7: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/7.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20177
![Page 8: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/8.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20178
![Page 9: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/9.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20179
![Page 10: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/10.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201710
HyperQuest
Will post more details on Piazza this afternoon
![Page 11: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/11.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201711
Last Time: Recurrent Networks
![Page 12: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/12.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201712
Last Time: Recurrent Networks
![Page 13: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/13.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201713
Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015.Reproduced for educational purposes.
Last Time: Recurrent Networks
A cat sitting on a suitcase on the floor
A cat is sitting on a tree branch
Two people walking on the beach with surfboards
A tennis player in action on the court
A woman is holding a cat in her hand
A person holding a computer mouse on a desk
![Page 14: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/14.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201714
Last Time: Recurrent NetworksVanilla RNNSimple RNNElman RNN
Elman, “Finding Structure in Time”, Cognitive Science, 1990.Hochreiter and Schmidhuber, “Long Short-Term Memory”, Neural computation, 1997
Long Short Term Memory(LSTM)
![Page 15: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/15.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201715
Today: Segmentation, Localization, Detection
![Page 16: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/16.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201716
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
So far: Image Classification
This image is CC0 public domain Vector:4096
Fully-Connected:4096 to 1000
![Page 17: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/17.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201717
Other Computer Vision TasksClassification + Localization
SemanticSegmentation
Object Detection
Instance Segmentation
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 18: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/18.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201718
Semantic Segmentation
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 19: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/19.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201719
Semantic Segmentation
Cow
Grass
SkyTrees
Label each pixel in the image with a category label
Don’t differentiate instances, only care about pixels
This image is CC0 public domain
Grass
Cat
Sky Trees
![Page 20: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/20.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201720
Semantic Segmentation Idea: Sliding Window
Full image
Extract patchClassify center pixel with CNN
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
![Page 21: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/21.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201721
Semantic Segmentation Idea: Sliding Window
Full image
Extract patchClassify center pixel with CNN
Cow
Cow
GrassProblem: Very inefficient! Not reusing shared features between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
![Page 22: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/22.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201722
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W
Convolutions:D x H x W
Conv Conv Conv Conv
Scores:C x H x W
argmax
Predictions:H x W
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
![Page 23: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/23.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201723
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W
Convolutions:D x H x W
Conv Conv Conv Conv
Scores:C x H x W
argmax
Predictions:H x W
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Problem: convolutions at original image resolution will be very expensive ...
![Page 24: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/24.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201724
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
![Page 25: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/25.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201725
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:Pooling, strided convolution
Upsampling:???
![Page 26: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/26.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201726
In-Network upsampling: “Unpooling”
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4
Nearest Neighbor
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 0 2 0
0 0 0 0
3 0 4 0
0 0 0 0
“Bed of Nails”
![Page 27: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/27.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201727
In-Network upsampling: “Max Unpooling”
Input: 4 x 4
1 2 6 3
3 5 2 1
1 2 2 1
7 3 4 8
1 2
3 4
Input: 2 x 2 Output: 4 x 4
0 0 2 0
0 1 0 0
0 0 0 0
3 0 0 4
Max UnpoolingUse positions from pooling layer
5 6
7 8
Max PoolingRemember which element was max!
… Rest of the network
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
![Page 28: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/28.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201728
Learnable Upsampling: Transpose Convolution
Recall:Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
![Page 29: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/29.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201729
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
![Page 30: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/30.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201730
Learnable Upsampling: Transpose Convolution
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
![Page 31: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/31.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201731
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
![Page 32: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/32.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201732
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
![Page 33: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/33.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201733
Learnable Upsampling: Transpose Convolution
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
Filter moves 2 pixels in the input for every one pixel in the output
Stride gives ratio between movement in input and output
Recall: Normal 3 x 3 convolution, stride 2 pad 1
![Page 34: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/34.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201734
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
![Page 35: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/35.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201735
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
![Page 36: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/36.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201736
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
![Page 37: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/37.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201737
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
![Page 38: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/38.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201738
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
Other names:-Deconvolution (bad)-Upconvolution-Fractionally strided convolution-Backward strided convolution
![Page 39: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/39.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201739
Transpose Convolution: 1D Example
a
b
x
y
z
ax
ay
az + bx
by
bz
Input FilterOutput
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Need to crop one pixel from output to make output exactly 2x input
![Page 40: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/40.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201740
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=1, padding=1
![Page 41: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/41.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201741
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=1, padding=1
Convolution transpose multiplies by the transpose of the same matrix:
When stride=1, convolution transpose is just a regular convolution (with different padding rules)
![Page 42: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/42.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201742
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=2, padding=1
![Page 43: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/43.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201743
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=2, padding=1
Convolution transpose multiplies by the transpose of the same matrix:
When stride>1, convolution transpose is no longer a normal convolution!
![Page 44: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/44.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201744
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:Pooling, strided convolution
Upsampling:Unpooling or strided transpose convolution
![Page 45: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/45.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201745
Classification + Localization
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 46: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/46.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201746
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Classification + Localization
This image is CC0 public domain Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Treat localization as a regression problem!
![Page 47: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/47.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201747
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Classification + Localization
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
This image is CC0 public domain
Treat localization as a regression problem!
![Page 48: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/48.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201748
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Classification + Localization
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
+This image is CC0 public domain
Treat localization as a regression problem!
Multitask Loss
![Page 49: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/49.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201749
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Classification + Localization
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
+This image is CC0 public domain Often pretrained on ImageNet
(Transfer learning)
Treat localization as a regression problem!
![Page 50: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/50.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201750
Aside: Human Pose Estimation
This image is licensed under CC-BY 2.0.
Represent pose as a set of 14 joint positions:
Left / right footLeft / right kneeLeft / right hipLeft / right shoulderLeft / right elbowLeft / right handNeck Head top
Johnson and Everingham, "Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation", BMVC 2010
![Page 51: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/51.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201751
Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation
Vector:4096
Left foot: (x, y)
Right foot: (x, y)
…
Head top: (x, y)
![Page 52: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/52.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201752
Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation
Vector:4096
Left foot: (x, y)
Right foot: (x, y)
…
Head top: (x, y)
L2 loss
Correct left foot: (x’, y’)
L2 loss
Correct head top: (x’, y’)
L2 loss
Loss+...
![Page 53: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/53.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201753
Object Detection
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 54: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/54.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201754
Object Detection: Impact of Deep Learning
Figure copyright Ross Girshick, 2015. Reproduced with permission.
![Page 55: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/55.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201755
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)
DUCK: (x, y, w, h)DUCK: (x, y, w, h)….
![Page 56: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/56.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201756
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)
DUCK: (x, y, w, h)DUCK: (x, y, w, h)….
4 numbers
16 numbers
Many numbers!
Each image needs a different number of outputs!
![Page 57: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/57.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201757
Object Detection as Classification: Sliding Window
Dog? NOCat? NOBackground? YES
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
![Page 58: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/58.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201758
Object Detection as Classification: Sliding Window
Dog? YESCat? NOBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
![Page 59: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/59.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201759
Object Detection as Classification: Sliding Window
Dog? YESCat? NOBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
![Page 60: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/60.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201760
Object Detection as Classification: Sliding Window
Dog? NOCat? YESBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
![Page 61: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/61.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201761
Object Detection as Classification: Sliding Window
Dog? NOCat? YESBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive!
![Page 62: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/62.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201762
Region Proposals● Find “blobby” image regions that are likely to contain objects● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
![Page 63: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/63.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201763
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 64: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/64.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201764
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 65: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/65.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201765
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 66: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/66.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201766
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 67: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/67.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201767
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 68: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/68.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201768
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 69: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/69.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201769
R-CNN: Problems
• Ad hoc training objectives• Fine-tune network with softmax classifier (log loss)• Train post-hoc linear SVMs (hinge loss)• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]• Fixed by SPP-net [He et al. ECCV14]
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 70: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/70.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201770
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 71: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/71.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201771
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 72: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/72.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201772
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 73: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/73.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201773
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 74: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/74.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201774
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 75: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/75.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201775
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 76: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/76.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201776
Fast R-CNN(Training)
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 77: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/77.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201777
Fast R-CNN(Training)
Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 78: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/78.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201778
Faster R-CNN: RoI Pooling
Hi-res input image:3 x 640 x 480
with region proposal
Hi-res conv features:512 x 20 x 15;
Projected region proposal is e.g.
512 x 18 x 8(varies per proposal)
Fully-connected layers
Divide projected proposal into 7x7 grid, max-pool within each cell
RoI conv features:512 x 7 x 7
for region proposal
Fully-connected layers expect low-res conv features:
512 x 7 x 7
CNN
Project proposal onto features
Girshick, “Fast R-CNN”, ICCV 2015.
![Page 79: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/79.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201779
R-CNN vs SPP vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015
![Page 80: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/80.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201780
R-CNN vs SPP vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015
Problem:Runtime dominated by region proposals!
![Page 81: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/81.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201781
Faster R-CNN: Make CNN do proposals!
Insert Region Proposal Network (RPN) to predict proposals from features
Jointly train with 4 losses:1. RPN classify object / not object2. RPN regress box coordinates3. Final classification score (object
classes)4. Final box coordinates
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015Figure copyright 2015, Ross Girshick; reproduced with permission
![Page 82: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/82.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201782
Faster R-CNN: Make CNN do proposals!
![Page 83: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/83.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201783
Detection without Proposals: YOLO / SSD
Divide image into grid7 x 7
Image a set of base boxes centered at each grid cell
Here B = 3
Input image3 x H x W
Within each grid cell:- Regress from each of the B
base boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
- Predict scores for each of C classes (including background as a class)
Output:7 x 7 x (5 * B + C)
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
![Page 84: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/84.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201784
Detection without Proposals: YOLO / SSD
Divide image into grid7 x 7
Image a set of base boxes centered at each grid cell
Here B = 3
Input image3 x H x W
Within each grid cell:- Regress from each of the B
base boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
- Predict scores for each of C classes (including background as a class)
Output:7 x 7 x (5 * B + C)
Go from input image to tensor of scores with one big convolutional network!
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
![Page 85: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/85.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201785
Object Detection: Lots of variables ...
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Base NetworkVGG16ResNet-101Inception V2Inception V3Inception ResNetMobileNet
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Object Detection architectureFaster R-CNNR-FCNSSD
Image Size# Region Proposals…
TakeawaysFaster R-CNN is slower but more accurate
SSD is much faster but not as accurate
![Page 86: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/86.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201786
Aside: Object Detection + Captioning= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.
![Page 87: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/87.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201787
Aside: Object Detection + Captioning= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.
![Page 88: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/88.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201788
![Page 89: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/89.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201789
Instance Segmentation
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 90: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/90.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201790
Mask R-CNN
He et al, “Mask R-CNN”, arXiv 2017
RoI Align Conv
Classification Scores: C Box coordinates (per class): 4 * C
CNN Conv
Predict a mask for each of C classes
C x 14 x 14
256 x 14 x 14 256 x 14 x 14
![Page 91: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/91.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201791
Mask R-CNN: Very Good Results!
He et al, “Mask R-CNN”, arXiv 2017Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.
![Page 92: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/92.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201792
Mask R-CNNAlso does pose
He et al, “Mask R-CNN”, arXiv 2017
RoI Align Conv
Classification Scores: C Box coordinates (per class): 4 * CJoint coordinates
CNN Conv
Predict a mask for each of C classes
C x 14 x 14
256 x 14 x 14 256 x 14 x 14
![Page 93: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/93.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201793
Mask R-CNNAlso does pose
He et al, “Mask R-CNN”, arXiv 2017Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.
![Page 94: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/94.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201794
Recap:Classification + Localization
SemanticSegmentation
Object Detection
Instance Segmentation
CATGRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 95: Cs231n 2017 lecture11 Detection and Segmentation](https://reader031.vdocuments.site/reader031/viewer/2022020314/5a647b307f8b9a3b568b4937/html5/thumbnails/95.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201795
Next time:Visualizing CNN featuresDeepDream + Style Transfer