lecture 6: convnets for object detection and segmentation · uva deep learning course –efstratios...
TRANSCRIPT
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 1
Lecture 6: Convnets for object detection and segmentationDeep Learning @ UvA
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 2
o What do convolutions look like?
o Build on the visual intuition behind Convnets
o Deep Learning Feature maps
o Transfer Learning
Previous Lecture
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 3
o What is object detection, segmentation and structured output
o How to use a convnet to localize objects?
o What is the relation convnets and Conditional Random Fields (CRF)
Lecture Overview
UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES
CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 4
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 5
o The task of discovering the location of one or more object in an image
o Images◦ Detect spatial object location
o Videos◦ Detect spatial object location
◦ Optionally, detect object temporal location
o Generic◦ Define the same model for all types of categories
o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 6
o The task of discovering the location of one or more object in an image
o Images◦ Detect spatial object location
o Videos◦ Detect spatial object location
◦ Optionally, detect object temporal location
o Generic◦ Define the same model for all types of categories
o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 7
o The task of discovering the location of one or more object in an image
o Images◦ Detect spatial object location
o Videos◦ Detect spatial object location
◦ Optionally, detect object temporal location
o Generic◦ Define the same model for all types of categories
o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 8
o The task of discovering the location of one or more object in an image
o Images◦ Detect spatial object location
o Videos◦ Detect spatial object location
◦ Optionally, detect object temporal location
o Generic◦ Define the same model for all types of categories
o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 9
o The task of discovering the location of one or more object in an image
o Images◦ Detect spatial object location
o Videos◦ Detect spatial object location
◦ Optionally, detect object temporal location
o Generic◦ Define the same model for all types of categories
o Specific◦ Define specialized models for particular categories (“face”, “pedestrian”)
Object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 10
o Robotics
o Self-driving cars
o Better classification◦ Isolates the object signal from the
background signal
o Huge pictures◦ E.g. in astronomy pictures have ultra-
high resolution
◦ Detecting a new star or galaxy goes beyond image classification
Why is localization important?
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 11
o Box level, aka object detection◦ Predict a bounding box surrounding the objects
o Pixel level, aka semantic segmentation◦ Predict the category label for each pixel
o Pixel and instance level◦ Predict a free-form delineation around object instances and predict their category
◦ Separate between instances of the same category (much more difficult!)
Localization granularity
Image classification Object detection Pixel classification Pixel and instance classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 12
o Supervised (category specific)◦ Examples for “bicycle”, “dog”, “person”, “face”
o Unsupervised (category agnostic)◦ aka bounding box/segment proposal algorithm
◦ Method typically returns hundreds possible object locations
Localization supervision
Supervised (category specific)Supervised (category specific)
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 13
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 14
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 15
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 16
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 17
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 18
o Training◦ Crop objects from the images
◦ Train a classifier (SVM) for each of the categories
o Inference◦ Detect possible, category agnostic object locations
◦ Crop all of them from the image
◦ Assign a per category score for each of them
◦ Keep the ones with score more than a threshold
◦ Success: > 50% overlap with ground truth
Agnostic to category specific detection
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 19
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Parse image at several locations and scales
◦ For each location compute category score
◦ Keep boxes with high enough score
◦ More thorough search
◦ Potentially more expensive (or maybe not)
Sliding window and direct classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 20
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Parse image at several locations and scales
◦ For each location compute category score
◦ Keep boxes with high enough score
◦ More thorough search
◦ Potentially more expensive (or maybe not)
Sliding window and direct classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 21
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Parse image at several locations and scales
◦ For each location compute category score
◦ Keep boxes with high enough score
◦ More thorough search
◦ Potentially more expensive (or maybe not)
Sliding window and direct classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 22
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Parse image at several locations and scales
◦ For each location compute category score
◦ Keep boxes with high enough score
◦ More thorough search
◦ Potentially more expensive (or maybe not)
Sliding window and direct classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 23
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Parse image at several locations and scales
◦ For each location compute category score
◦ Keep boxes with high enough score
◦ More thorough search
◦ Potentially more expensive (or maybe not)
Sliding window and direct classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 24
o Training◦ Crop objects from the images
◦ Train a classifier for each of the categories
o Inference◦ Regress bounding box coordinates
◦ Cheap
◦ Not accurate enough
Sliding window and direct classification
UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES
CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 25
Convnets and object localization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 26
o Simple pipeline!
o Use the convnet as a feature extractor
o Given a bounding box, compute a “deep” feature
o Run an SVM (one per category) on the deep features
R-CNN [Girshick2013]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 27
o Simple pipeline!
o Use the convnet as a feature extractor
o Given a bounding box, compute a “deep” feature
o Run an SVM (one per category) on the deep features
o Warp cropped images to make them “square”
R-CNN [Girshick2013]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 28
o Simple pipeline!
o Use the convnet as a feature extractor
o Given a bounding box, compute a “deep” feature
o Run an SVM (one per category) on the deep features
o Warp cropped images to make them “square”
o Add some background context
R-CNN [Girshick2013]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 29
o Simple pipeline!
o Use the convnet as a feature extractor
o Given a bounding box, compute a “deep” feature
o Run an SVM (one per category) on the deep features
o Warp cropped images to make them “square”
o Add some background context
o Very accurate!!!
R-CNN [Girshick2013]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 30
o Simple pipeline!
o Use the convnet as a feature extractor
o Given a bounding box, compute a “deep” feature
o Run an SVM (one per category) on the deep features
o Warp cropped images to make them “square”
o Add some background context
o Very accurate!!!
o Unfortunately slow◦ More than 60 seconds per image
R-CNN [Girshick2013]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 31
R-CNN results
R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv 2013
+25%!!!!!
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 32
R-CNN examples
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 33
R-CNN examples
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 34
o Observation: A Convnet is convolutional!!!
o Convolutions are location specific◦ The same filter is applied on different locations
o R-CNN assumes that the image locations for different bounding boxes are different
o However, because of the convolutions the convolved image locations are shared between box proposals
o Can we reused the convolutional feature maps
Fast R-CNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 35
o Observation: A Convnet is convolutional!!!
o Convolutions are location specific◦ The same filter is applied on different locations
o R-CNN assumes that the image locations for different bounding boxes are different
o However, because of the convolutions the convolved image locations are shared between box proposals
o Can we reused the convolutional feature maps
Fast R-CNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 36
o Observation: A Convnet is convolutional!!!
o Convolutions are location specific◦ The same filter is applied on different locations
o R-CNN assumes that the image locations for different bounding boxes are different
o However, because of the convolutions the convolved image locations are shared between box proposals
o Can we reused the convolutional feature maps
Fast R-CNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 37
o Observation: A Convnet is convolutional!!!
o Convolutions are location specific◦ The same filter is applied on different locations
o R-CNN assumes that the image locations for different bounding boxes are different
o However, because of the convolutions the convolved image locations are shared between box proposals
o Can we reused the convolutional feature maps
Fast R-CNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 38
o Observation: A Convnet is convolutional!!!
o Convolutions are location specific◦ The same filter is applied on different locations
o R-CNN assumes that the image locations for different bounding boxes are different
o However, because of the convolutions the convolved image locations are shared between box proposals
o Can we reused the convolutional feature maps
Fast R-CNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 39
o Do not compute forward propagate from scratch for each box
o Compute feature maps once
o Reuse the convolutions for different boxes
Fast-RCNN
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 40
o Do not compute forward propagate from scratch for each box
o Compute feature maps once
o Reuse the convolutions for different boxes
o Region-of-Interest pooling◦ Don’t define the stride absolutely (e.g. sample every 4 pixels)
◦ Define stride relatively (box width divided by predefined number of “poolings” T)
◦ Fixed length vector
Fast-RCNN
T=5
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 41
o Do not compute forward propagate from scratch for each box
o Compute feature maps once
o Reuse the convolutions for different boxes
o Region-of-Interest pooling◦ Don’t define the stride absolutely (e.g. sample every 4 pixels)
◦ Define stride relatively (box width divided by predefined number of “poolings” T)
◦ Fixed length vector
o End-to-end training!
o More accurate
o Much faster◦ Less than a second per image
o Still, external box proposals needed
Fast-RCNN
T=5
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 42
Fast-RCNN results
R. Girshick, Fast R-CNN, CVPR, 20152013
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 43
Fast-RCNN results
R. Girshick, Fast R-CNN, CVPR, 20152013
+4%!!!!!
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 44
Fast-RCNN results
R. Girshick, Fast R-CNN, CVPR, 20152013
+4%!!!!!
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 45
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
Fully Convolutional Networks [LongCVPR2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 46
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the
whole layer
Fully Convolutional Networks [LongCVPR2014]
⋅
, ,
≡
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 47
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the
whole layer
Fully Convolutional Networks [LongCVPR2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 48
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the
whole layer
Fully Convolutional Networks [LongCVPR2014]
“Slices” of 1-of-1000 Classification layer
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 49
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the
whole layer
Fully Convolutional Networks [LongCVPR2014]
“Slices” of 1-of-1000 Classification layer
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 50
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution◦ The size of the convolution filter equals the size of the
whole layer
Fully Convolutional Networks [LongCVPR2014]
“Slices” of 1-of-1000 Classification layer
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 51
o Not single number output (“car” vs. “not car”)◦ Pixel-wise output (as many outputs, as inputs)
o A fully connected layer can be viewed as a convolution
◦ The size of the convolution filter equals the size of the whole layer
o Upsampling◦ Deconvolution filters learnt with
backpropagation
◦ Deconvolution filter return feature mapslarger than input
◦ Outputs progressively bigger till theyreach input size
Fully Convolutional Networks [LongCVPR2014]
“Slices” of 1-of-1000 Classification layer
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 52
Fully Convolutional Networks architecture
o Connect intermediate layers to output
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 53
Fully Convolutional Networks results
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 54
o Identity verification◦ Find whether two pictures belong to the same person
o Essentially similar to metric learning◦ Find a projection matrix (metric), with which distances between different person faces
are smaller than distances from different persons
o Triplet “Euclidean” loss
𝐸 𝑊′ =
𝑎,𝑝,𝑛 ∈𝑇
max 0, 𝑎 − 𝑥𝑎 − 𝑥𝑛 22 + 𝑥𝑎 − 𝑥𝑝 2
2, 𝑥𝑖 = 𝑊′
𝜑(ℓ𝑖)
𝜙 ℓ𝑖 2
o Start from very deep architecture
o More similar to classification, but face detection already very accurate
Deep Face Recognition
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES CONVNETS FOR OBJECT DETECTION, SEGMENTATION AND STRUCTURED OUTPUTS - 55
Deep Face Recognition results
Image Segmentation using Conditional Random Fields
Deep Learning @ UvA Patrick Putzky
[Shotton et al., 2009]
Outlook
[Krähenbühl & Koltun, 2011]
Conditional Random Fieldsp(y|x) = 1
Z(x)
exp(�E(y|x))
E(y|x) =X
c2CG
c(y|x)
y
x
General form
E(y|x) =X
i
u(yi|x)
Unary form
E(y|x) =X
i
u(yi|x)
+X
i<j
p(yi, yj |x)
Pairwise CRF
Unary modelsE(y|x) =
X
i
u(yi|x)
- Per pixel predictions - No interactions between neighbouring class predictions
Example: Fully Convolutional Networks
[Long et al., 2015]
x y
Unary models
- Per pixel predictions - No interactions between neighbouring class predictions - No object coherency
Fully Convolutional Networks: Fail Case
[Zheng et al., 2015]
Adjacency CRFsE(y|x) =
X
i
u(yi|x)| {z }unary potential
+X
j2N (i)
p(yi, yj |x)| {z }pairwise potential
- define a neighbourhood graph - arises naturally in images - efficient inference - only local interactions- graph is not trainable
Adjacency CRFs
- No distinction between classes
Popular: Potts model
p(yi, yj |x) = 1[yi 6=yj ]
�w1 exp
���kxi � xjk2
�+ w2
�
- Shrinking bias- Limited edge-awareness
Fully connected CRFs
- Edges between all pairs of nodes - Long-range interactions
E(y|x) =X
i
u(yi|x)| {z }unary potential
+X
j 6=i
p(yi, yj |x)| {z }pairwise potential
Fully connected CRFs
- Strength of edges define connectivity - No more shrinking bias - N2 edges -> computationally expensive
E(y|x) =X
i
u(yi|x)| {z }unary potential
+X
j 6=i
p(yi, yj |x)| {z }pairwise potential
Gaussian Edge Potentials
km (fi, fj) = exp
✓�1
2
(fi � fj)TQm (fi � fj)
◆
p(yi, yj |x) = 1[yi 6=yj ]
�w1 exp
���kxi � xjk2
�+ w2
�Potts model
p(yi, yj |x) = µ(yi, yj)MX
m=1
wmkm (fi, fj)
Generalisation
With Gaussian kernels
µ(yi, yj) = 1[yi 6=yj ]Pottsµ(yi, yj)Label compatibility models interactions between classes
trainable parameters
Gaussian Edge Potentialskm (fi, fj) = exp
✓�1
2
(fi � fj)TQm (fi � fj)
◆
x f
km
Gaussian Edge Potentialskm (fi, fj) = exp
✓�1
2
(fi � fj)TQm (fi � fj)
◆
x f
km
Edge potentials as convolutions
Ep(y|x) =X
i
X
j 6=i
p(yi, yj |x)
=X
i
X
j 6=i
µ(yi, yj)MX
m=1
wmkm (fi, fj)
=MX
m=1
wm
X
i
X
j 6=i
µ(yi, yj)km (fi, fj)
=MX
m=1
wm
X
i
NX
j=1
µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)
=MX
m=1
wm
X
i
i�NX
l=i�1
µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)
Edge potentials as convolutions
Ep(y|x) =X
i
X
j 6=i
p(yi, yj |x)
=X
i
X
j 6=i
µ(yi, yj)MX
m=1
wmkm (fi, fj)
=MX
m=1
wm
X
i
X
j 6=i
µ(yi, yj)km (fi, fj)
=MX
m=1
wm
X
i
NX
j=1
µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)
=MX
m=1
wm
X
i
i�NX
l=i�1
µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)
Edge potentials as convolutions
Ep(y|x) =X
i
X
j 6=i
p(yi, yj |x)
=X
i
X
j 6=i
µ(yi, yj)MX
m=1
wmkm (fi, fj)
=MX
m=1
wm
X
i
X
j 6=i
µ(yi, yj)km (fi, fj)
=MX
m=1
wm
X
i
NX
j=1
µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)
=MX
m=1
wm
X
i
i�NX
l=i�1
µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)
Edge potentials as convolutions
Ep(y|x) =X
i
X
j 6=i
p(yi, yj |x)
=X
i
X
j 6=i
µ(yi, yj)MX
m=1
wmkm (fi, fj)
=MX
m=1
wm
X
i
X
j 6=i
µ(yi, yj)km (fi, fj)
=MX
m=1
wm
X
i
NX
j=1
µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)
=MX
m=1
wm
X
i
i�NX
l=i�1
µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)
Edge potentials as convolutions
- If km is a symmetric kernel the pairwise potential can be implemented as a convolution - Gaussian has most probability mass around a close region about it’s mean - Approximate this using a truncated Gaussian
[Wikipedia]
Ep(y|x) =X
i
X
j 6=i
p(yi, yj |x)
=X
i
X
j 6=i
µ(yi, yj)MX
m=1
wmkm (fi, fj)
=MX
m=1
wm
X
i
X
j 6=i
µ(yi, yj)km (fi, fj)
=MX
m=1
wm
X
i
NX
j=1
µ(yi, yj)km (fi, fj)� µ(yi, yi)km (fi, fi)
=MX
m=1
wm
X
i
i�NX
l=i�1
µ(yi, yi�l)km (fi, fi�l)� µ(yi, yi)km (fi, fi)
k̃m
Gaussian Edge Potentialskm (fi, fj) = exp
✓�1
2
(fi � fj)TQm (fi � fj)
◆
x f
km
- Convolution in a high dimensional space
Permutohedral Lattices for convolution in high dimensional space
fi- can be high dimensional - as a result the feature space will be sparsely populated
Problem:
[Adams et al., 2010]- Convolutions in and for separable filters - Naive convolution on a grid is O(2dn)
O(d2n) O(dn)
Efficient inference in dense CRFs
P (y|x) ⇡ Q(y|x) =Y
i
Q(yi|x)Mean-field approximation:
p(y|x) = 1
Z(x)
exp
0
@�X
i
u(yi|x)�X
j 6=i
p(yi, yj |x)
1
A
Qi(yi = l) =1
Ziexp
0
@� u(yi|x)�X
l02Lµ (l, l0)
MX
m=1
wm
X
j 6=i
km (fi, fj)Qj(l0)
1
A
Mean field updates:
[Krähenbühl & Koltun, 2011]
- Naively implemented an update over all Qi costs O(N2) - With the approximations from before it is O(N)
Hard to evaluate
Efficient inference in dense CRFs
Qi(yi = l) =1
Ziexp
0
@� u(yi|x)�X
l02Lµ (l, l0)
MX
m=1
wm
X
j 6=i
km (fi, fj)Qj(l0)
1
A
Mean field updates:
[Krähenbühl & Koltun, 2011]
10 Iterations own 0.2 seconds
Efficient inference in dense CRFs
[Krähenbühl & Koltun, 2011]
What are these features?km (fi, fj) = exp
✓�1
2
(fi � fj)TQm (fi � fj)
◆
fi =
✓pi
xi
◆
pi position vectorxi vector of pixel values
- This is known as bilateral filtering - It is an edge-preserving approach to filtering
[Adams et al., 2010]
CRFs as RNNs
Demo: http://www.robots.ox.ac.uk/~szheng/crfasrnndemo
[Zheng et al., 2015]
- Mean field iterations as steps in an RNN - End-to-end training
Learning the filters
[Kiefel et al., 2015]
Summary- We can improve performance of segmentation
algorithms by adding pairwise interactions between pixel predictions
- Approximate Inference in dense CRFs can be done efficiently
- End-to-end learning can improve performance - Pairwise kernels can be learned and lead to
potential improvements - These methods could have an impact in other
problem domains
References(1) Adams A, Baek J, Davis MA. Fast high-dimensional filtering using the
permutohedral lattice. Comput Graph Forum. 2010;29(2):753-762. (2) Krähenbühl P, Koltun V. Efficient Inference in Fully Connected CRFs with
Gaussian Edge Potentials. In: Advances in Neural Information Processing Systems.; 2011:109-117. http://papers.nips.cc/paper/4296-efficient-inference-in-fully-connected-crfs-with-gaussian-edge-potentials. Accessed January 8, 2016.
(3) Kiefel M, Gehler P V, Kiefel M, Mpg T, Gehler P, Mpg T. Sparse Convolutional Networks using the Permutohedral Lattice. 2015. http://arxiv.org/abs/1503.04949. Accessed January 8, 2016.
(4) Shotton J, Winn J, Rother C, Criminisi A. TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis. 2009. http://research.microsoft.com/apps/pubs/default.aspx?id=117885.
(5) Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional Random Fields as Recurrent Neural Networks. In: International Conference on Computer Vision (ICCV).; 2015:16. http://arxiv.org/abs/1502.03240. Accessed January 8, 2016.