learning hierarchical features for scene labeling hierarchical features for scene labeling fb...

14.01.14 | Seminar aus maschinellem Lernen | 114.01.14 | Seminar aus maschinellem Lernen | 1

Learning Hierarchical Featuresfor Scene Labeling

FB InformatikKnowledge Engineering GroupProf. Dr. Johannes FürnkranzSeminar Machine Learning

Author : Tanya Harizanova

14.01.14 | Seminar aus maschinellem Lernen | 2

Contents

➢ Introduction

➢ Multiscale Feature Extraction For Scene Parsing

➢ Scene Labeling Strategies

➢ Experinments

➢ Important insights on the experiments

➢ Conclusion

➢ Questions / Discussion


Introduction – Scene Parsing

➢ Scene Parsing (full-scene-labeling) – labeling every pixel in image to the

category of the object it belongs to.


Introduction – Scene Parsing(2)

➢ Questions to Scene Parsing :

➢ How to produce a good internal representation of the visual information?

➢ How to use contextual information to ensure the self-consistency of the

interpretation ?

➔ This Paper presents a Scene Parsing System, that relies on deep learning methods to approach both questions.

➔ Main Idea – use a Convolutional Network operating on a large input window to produce label hypotheses for each pixel location.


Convolutional Networks - are hierarchical architectures, which can be trained and are compose of multiple stage,each of which contains three layers :

filter bank module ,non-linarity module und spatial pooling module.The typical convolutional network are composed from two or three such stages,followed by classifying module.

Introduction – Convolutional Network


Problem

Labeling each Pixel by looking at a small region around is difficult, the category of a pixel may depend on relatively short-range information, but may also depend on long-range information.

Solution of the problem

Use of Multi-scale Convolutional Networks – can take into account a large input windows, while keeping the number of free parameters to minimum.

Introduction – Convolutional Network(2)


Introduction – Scene Parsing Architecture

➢ Scene Parsing Architektur of this system relies on two main components :

1. Multi-scale convolution representation

2. Graph-based classifikation

➔ Superpixels, Conditional random field over superpixels, Multilevel cut with class purity criterion


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Multiscale feature extraction for scene parsing – Scene invariant, scene-level feature extraction

➢ Good iternal representations are hierarchical ➢ Convolutional networks provides a simple framework to learn such

hierarchies of features, composed of multiple stages ➢ Feature extractor of this model is a three-stage convolutional network➢ The convulational kernels are the actuall subject to training


Multiscale feature extraction for scene parsing – Scene invariant, scene-level feature extractionMultiscale feature extraction for scene parsing – Scene invariant, scene-level feature extraction

➢ Convention : ➢ Bank of images as 3D arrays

➢ The maps of the pyramid computed using scaling/normalisation function as

➢ For network with L layers where the vector of hidden units at layer l is

➢ The outputs of the N networks – unsampled und concatenated so as to produce F , where is an unsampling function

g s X s=g s( I ) ∀ s∈1,... , Nf s(X s ;θs)=W LH L−1f s

H l= pool ( tanh(W lH l−1+b l)) H lp= pool (tanh (blp+∑W lp∗H l−1,q))q∈ parents ( p)

F=[ f 1 ,u ( f 2) , ....... , u ( f N )] u


Multiscale feature extraction for scene parsing – Learning discriminative scale-invariant features

➢ Multiclass cross entropy loss function ➢ Normalized prediction vector

➢ Normalized predicticted probability destributions over classes

➢ Compute using softmax finction

➢ is a temporary weigth matrix only used to learn features

➢ The cross entropy between the predicted class distribution and the target class distribution c penalizes their deviation and is measured by

c i

c i ,a

c i ,a=eW a

T F i / ∑b∈classes

eW bT F i

W

c

Lcat=− ∑i∈ pixels

∑a∈classes

ci , a ln( ci , a)


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Scene Labeling Strategien

● The simplest strategy for scene labeling is to use a linear classifier and assign each pixel with argmax of the prediction of its location.

● The resulting labeling l, although fairly accurate, is not satisfying visually, as it lacks spatial consistency, and precise delineation of objects.


➢ Predicting the class of each pixel indipendantly from its neighbors yields noisy prediction

➢ Classify each location of the image densely and aggregate these predictions in each superpixel, by computing the average class distribution within the superpixel.

➢ Superpixel not involve global understanding of the scene

Scene Labeling Strategien - Superpixels


➢ Classical CRF Model constructed on Superpixels.

➢ Each pixel in image is a vertex in graph, the edges are added between every neightbor nodes and it is defined an energy function.

➢ CRF energy minimized using alpha expansions.

Scene Labeling Strategien - Conditional Random Fields


➢ Observation Level Problem

➢ Parameter-free Multilevel parsing – method to analyze a family of segmentation and automatically discover the best observation level for each pixel in the image

Scene Labeling Strategien - Parameter-free Multilevel Parsing


Optimal Purity Cover – optimization problem for search for most adapted neighborhood of a pixel

➢ For each pixel i ,we wish to find an index of component that best explains this pixel this with the min cost

Scene Labeling Strategien - Parameter-free Multilevel Parsing(2)

S k∗(i)

k∗(i)→

k∗(i)=argmin S ki∈C kk∣∣



Producing the confidence costs – the construction of the cost function that is minimized

➢ Confidence Costs with given set of components and using the set of object descriptors we define a function as predicting the destribution( ) of classes presents in component

S k C k

C k

Ok

d k

c :Ok →[0,1 ](N c)


Training Procedure – training procedure used by producing the confidence costs

➢ Segmentation collections are constructed on the entire training set, and, for all train the classifier to predict the destribution of the classes in component ,as well as the costs .


(T )T∈τT∈τ c

S k


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Experiments

Semantic scene understanding results on three different datasets

➢ Stanford Background – contains 715 images of outdoor scenes composed in 8 classes,all of the images with 320x240 pixels, with atleast one foreground object. 5-fold cross validation : 572 images used for training and 143 for testing

➢ SIFT Flow – composed of 2688, thoroughly labeled by LabelMe users,slitt in 2488 trainig images and 200 test images. Synonim correction used to obtain 33 semantic labels.

➢ Barcelona - has 14,871 training and 279 test images.The test set consists of street scenes from Barcelona, while the training set ranges in scene type but has no street scenes from Barcelona. Manually consolidated the synonyms in the label set to 170 unique labels


Experiments on Stanford Background Data Sets

Pixel Acc. Class Acc. CT (sec.)● System based on convolutional network alone

66.0% 56.5% 0.35s

● Multiscale convolutional network with raw pixel prediction

78.8% 72.4% 0.6s

● Superpixel-based predictions

80.4% 74.56% 0.7s

● CRF-based predictions

80.4% 75.24% 61s

● Cover-based predictions

81.4% 76.0% 60.5s


SkyGrass

Building

MountainTreeObject

Experiments on Stanford Background Data Sets (2)


Experiments on SIFT Flow dataset

Pixel Acc. Class Acc.● raw multiscale net 67.9% 45.9%

● multiscale net + superpixels 71.9% 50.08%

● multiscale net + cover (1) 72.3% 50.08%



Experiments on SIFT Flow dataset (2)


Experiments on Barcelona dataset

Pixel Acc. Class Acc.● raw multiscale net 37.8% 12.1%

● multiscale net + superpixels 44.1% 12.4%




Real World Experiment

● For the real-world experiment –

● Multiscale feature combined mit classification using Superpixel strategy trained on SIFT Flow dataset.

● The test movie build from 4 videos – stiched to form a 360° video stream of 1280x256 images

● Result – the system constitutes the first approach achieving real time performance,one of the frame being processed in less then a second using i7 4-core Intel(with dadicated FPGA Software can be reduced to 60 ms)


Real World Experiment - Video – Real Time Performance


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Important Insights on the experiments

➢ Using high-capacity feature-learning system fed with raw pixels yields excellent result compared with systems using engineered features

➢ Feeding the system with a wide contextual window is critical to the quality of the results

➢ When a wide context is taken into accounts to produce each pixel label, the role of the post-processing is greatly reduced

➢ The use of highly sophisticated post-processing schemes does not seems to improve the results significantly over simple schemes

➢ Relying heavily on highly-accurate feed-foward pixel labeling system, while simplifying the post-processing module to its bare minnimum cuts down the inference times considerably


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Conclusion

➢ Feed-foward convolutional network can produce state of art performance on standard scene parsing datasets

➢ Without relying on engineering features

➢ Even in the absense of any post-processing by simply labeling each pixel with the highest scoring category produced by convolutional network for that location, the system yields neat state-of-the-art pixel-wise accuracy, and better per class accuracy then all previous published results

➢ Results on datasets with few categories are good, but the accuracy of the best existing scene parsing system is still low by higher number of categories


Contents

➢ Introduction



➢ Experinments


➢ Conclusion



Questions ?! Discussion


Sources

➢ http://www.clement.farabet.net/index.html

➢ http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf

➢ http://yann.lecun.com/exdb/lenet/

➢ http://eblearn.sourceforge.net/old/tutorials/libeblearn/

➢ http://parse.ele.tue.nl/education/cluster2

➢ http://step.polymtl.ca/~rv101/energy/

http://www.clement.farabet.net/index.html

http://yann.lecun.com/exdb/publis/pdf/farabet-pami-13.pdf

http://yann.lecun.com/exdb/lenet/

http://eblearn.sourceforge.net/old/tutorials/libeblearn/

http://parse.ele.tue.nl/education/cluster2

http://step.polymtl.ca/~rv101/energy/

learning hierarchical features for scene labeling hierarchical features for scene labeling fb...

Documents