project 2q&a( - stanford...

Lecture 6 - ! 4-May-15!Fei-Fei Li, Alexandre Alahi, Vignesh Ramanathan!

Project 2 Q&A

Alexandre Alahi Vignesh Ramanathan

1!


Outline

2!

•  TLD Review •  Error metrics •  Code Overview •  Project 2 Report •  Project 2 PresentaCons

Lecture 6 - ! 4-May-15!Fei-Fei Li, Alexandre Alahi, Vignesh Ramanathan! 3!


Outline


•  Tracker & Detector (T&D) are running in parallel •  Both contribute •  “Not visible” is a possible output •  Updates of T&D depends on Learning module (L)

TLD review

4!


TLD: Tracking

•  Median-‐shiQ tracker: EsCmate translaCon & scale

•  Tracker validaCon: Detector is updated If forward-‐backward consistent

5!


TLD: DetecCon •  Three stages:

-‐ 1st stage filtering (patch variance) -‐ 2nd stage: DetecCon model -‐ 3nd stage classifier: NN, NCC confidence = d-‐/(d-‐+d+)

6!

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 6, NO. 1, JANUARY 2010 8

Ensemble classifier 1-NN classifier

Patch

variance

Rejected patches

Accepted

patches

( ,..., )

1

1

2

32

3

Fig. 9. Block diagram of the object detector.

5.3 Object detector

The detector scans the input image by a scanning-window andfor each patch decides about presence or absence of the object.

Scanning-window grid. We generate all possible scales andshifts of an initial bounding box with the following parameters:scales step = 1.2, horizontal step = 10% of width, verticalstep = 10% of height, minimal bounding box size = 20 pixels.This setting produces around 50k bounding boxes for a QVGAimage (240x320), the exact number depends on the aspect ratioof the initial bounding box.

Cascaded classifier. As the number of bounding boxesto be evaluated is large, the classification of every singlepatch has to be very efficient. A straightforward approachof directly evaluating the NN classifier is problematic as itinvolves evaluation of the Relative similarity (i.e. search fortwo nearest neighbours). As illustrated in figure 9, we structurethe classifier into three stages: (i) patch variance, (ii) ensembleclassifier, and (iii) nearest neighbor. Each stage either rejectsthe patch in question or passes it to the next stage. In ourprevious work [59] we used only the first two stages. Lateron, we observed that the performance improves if the thirdstage is added. Templates allowed us to better estimate thereliability of the detection.

5.3.1 Patch variance

Patch variance is the first stage of our cascade. This stagerejects all patches, for which gray-value variance is smallerthan 50% of variance of the patch that was selected fortracking. The stage exploits the fact that gray-value varianceof a patch p can be expressed as E(p2) � E2

(p), and thatthe expected value E(p) can be measured in constant timeusing integral images [35]. This stage typically rejects morethan 50% of non-object patches (e.g. sky, street). The variancethreshold restricts the maximal appearance change of theobject. However, since the parameter is easily interpretable,it can be adjusted by a user for particular application. In allof our experiments we kept it constant.

5.3.2 Ensemble classifier

Ensemble classifier is the second stage of our detector. The in-put to the ensemble is an image patch that was not rejected bythe variance filter. The ensemble consists of n base classifiers.Each base classifier i performs a number of pixel comparisonson the patch resulting in a binary code x, which indexes to anarray of posteriors Pi(y|x), where y 2 {0, 1}. The posteriorsof individual base classifiers are averaged and the ensembleclassifies the patch as the object if the average posterior islarger than 50%.

1

0

0

0

1

1

0

1

1

1

pixel comparisons binary codeinput image + bounding box

blur measure

blurred image

output

Fig. 10. Conversion of a patch to a binary code.

Pixel comparisons. Every base classifier is based on a setof pixel comparisons. Similarly as in [60], [61], [62], the pixelcomparisons are generated offline at random and stay fixed inrun-time. First, the image is convolved with a Gaussian kernelwith standard deviation of 3 pixels to increase the robustnessto shift and image noise. Next, the predefined set of pixelcomparison is stretched to the patch. Each comparison returns0 or 1 and these measurements are concatenated into x.

Generating pixel comparisons. The vital element of en-semble classifiers is the independence of the base classi-fiers [63]. The independence of the classifiers is in our caseenforced by generating different pixel comparisons for eachbase classifier. First, we discretize the space of pixel locationswithin a normalized patch and generate all possible horizontaland vertical pixel comparisons. Next, we permutate the com-parisons and split them into the base classifiers. As a result,every classifier is guaranteed to be based on a different set offeatures and all the features together uniformly cover the entirepatch. This is in contrast to standard approaches [60], [61],[62], where every pixel comparison is generated independentof other pixel comparisons.

Posterior probabilities. Every base classifier i maintains adistribution of posterior probabilities Pi(y|x). The distributionhas 2

d entries, where d is the number of pixel comparisons.We use 13 comparison, which gives 8192 possible codes thatindex to the posterior probability. The probability is estimatedas Pi(y|x) =

#p#p+#n , where #p and #n correspond to

number of positive and negative patches, respectively, thatwere assigned the same binary code.

Initialization and update. In the initialization stage, allbase posterior probabilities are set to zero, i.e. vote for negativeclass. During run-time the ensemble classifier is updated asfollows. The labeled example is classified by the ensembleand if the classification is incorrect, the corresponding #p

and #n are updated which consequently updates Pi(y|x).

5.3.3 Nearest neighbor classifier

After filtering the patches by the variance filter and the ensem-ble classifier, we are typically left with several of boundingboxes that are not decided yet (⇡ 50). Therefore, we can usethe online model and classify the patch using a NN classifier.A patch is classified as the object if S

r(p,M) > ✓NN, where

✓NN = 0.6. This parameter has been set empirically and itsvalue is not critical. We observed that similar performanceis achieved in the range (0.5-0.7). The positively classifiedpatches represent the responses of the object detector. Whenthe number of templates in NN classifier exceeds some thresh-old (given by memory), we use random forgetting of templates.We observed that the number of templates stabilizes around


Slide credit from D. Capel (h\p://vision.cse.psu.edu/seminars/talks/2009/random_`f/ForestsAndFernsTalk.pdf)


TLD: DetecCon •  Three stages:

-‐ 1st stage filtering (patch variance) -‐ 2nd stage: DetecCon model -‐ 3nd stage classifier: NN, NCC confidence = d-‐/(d-‐+d+)

8!

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 6, NO. 1, JANUARY 2010 8

Ensemble classifier 1-NN classifier

Patch

variance

Rejected patches

Accepted

patches

( ,..., )

1

1

2

32

3

Fig. 9. Block diagram of the object detector.

5.3 Object detector

The detector scans the input image by a scanning-window andfor each patch decides about presence or absence of the object.

Scanning-window grid. We generate all possible scales andshifts of an initial bounding box with the following parameters:scales step = 1.2, horizontal step = 10% of width, verticalstep = 10% of height, minimal bounding box size = 20 pixels.This setting produces around 50k bounding boxes for a QVGAimage (240x320), the exact number depends on the aspect ratioof the initial bounding box.

Cascaded classifier. As the number of bounding boxesto be evaluated is large, the classification of every singlepatch has to be very efficient. A straightforward approachof directly evaluating the NN classifier is problematic as itinvolves evaluation of the Relative similarity (i.e. search fortwo nearest neighbours). As illustrated in figure 9, we structurethe classifier into three stages: (i) patch variance, (ii) ensembleclassifier, and (iii) nearest neighbor. Each stage either rejectsthe patch in question or passes it to the next stage. In ourprevious work [59] we used only the first two stages. Lateron, we observed that the performance improves if the thirdstage is added. Templates allowed us to better estimate thereliability of the detection.

5.3.1 Patch variance

Patch variance is the first stage of our cascade. This stagerejects all patches, for which gray-value variance is smallerthan 50% of variance of the patch that was selected fortracking. The stage exploits the fact that gray-value varianceof a patch p can be expressed as E(p2) � E2

(p), and thatthe expected value E(p) can be measured in constant timeusing integral images [35]. This stage typically rejects morethan 50% of non-object patches (e.g. sky, street). The variancethreshold restricts the maximal appearance change of theobject. However, since the parameter is easily interpretable,it can be adjusted by a user for particular application. In allof our experiments we kept it constant.

5.3.2 Ensemble classifier

Ensemble classifier is the second stage of our detector. The in-put to the ensemble is an image patch that was not rejected bythe variance filter. The ensemble consists of n base classifiers.Each base classifier i performs a number of pixel comparisonson the patch resulting in a binary code x, which indexes to anarray of posteriors Pi(y|x), where y 2 {0, 1}. The posteriorsof individual base classifiers are averaged and the ensembleclassifies the patch as the object if the average posterior islarger than 50%.

1

0

0

0

1

1

0

1

1

1

pixel comparisons binary codeinput image + bounding box

blur measure

blurred image

output

Fig. 10. Conversion of a patch to a binary code.

Pixel comparisons. Every base classifier is based on a setof pixel comparisons. Similarly as in [60], [61], [62], the pixelcomparisons are generated offline at random and stay fixed inrun-time. First, the image is convolved with a Gaussian kernelwith standard deviation of 3 pixels to increase the robustnessto shift and image noise. Next, the predefined set of pixelcomparison is stretched to the patch. Each comparison returns0 or 1 and these measurements are concatenated into x.

Generating pixel comparisons. The vital element of en-semble classifiers is the independence of the base classi-fiers [63]. The independence of the classifiers is in our caseenforced by generating different pixel comparisons for eachbase classifier. First, we discretize the space of pixel locationswithin a normalized patch and generate all possible horizontaland vertical pixel comparisons. Next, we permutate the com-parisons and split them into the base classifiers. As a result,every classifier is guaranteed to be based on a different set offeatures and all the features together uniformly cover the entirepatch. This is in contrast to standard approaches [60], [61],[62], where every pixel comparison is generated independentof other pixel comparisons.

Posterior probabilities. Every base classifier i maintains adistribution of posterior probabilities Pi(y|x). The distributionhas 2

d entries, where d is the number of pixel comparisons.We use 13 comparison, which gives 8192 possible codes thatindex to the posterior probability. The probability is estimatedas Pi(y|x) =

#p#p+#n , where #p and #n correspond to

number of positive and negative patches, respectively, thatwere assigned the same binary code.

Initialization and update. In the initialization stage, allbase posterior probabilities are set to zero, i.e. vote for negativeclass. During run-time the ensemble classifier is updated asfollows. The labeled example is classified by the ensembleand if the classification is incorrect, the corresponding #p

and #n are updated which consequently updates Pi(y|x).

5.3.3 Nearest neighbor classifier

After filtering the patches by the variance filter and the ensem-ble classifier, we are typically left with several of boundingboxes that are not decided yet (⇡ 50). Therefore, we can usethe online model and classify the patch using a NN classifier.A patch is classified as the object if S

r(p,M) > ✓NN, where

✓NN = 0.6. This parameter has been set empirically and itsvalue is not critical. We observed that similar performanceis achieved in the range (0.5-0.7). The positively classifiedpatches represent the responses of the object detector. Whenthe number of templates in NN classifier exceeds some thresh-old (given by memory), we use random forgetting of templates.We observed that the number of templates stabilizes around


TLD: Learning •  P-‐constraints: Patches close to trajectory update the detector with PosiCve label •  N-‐constraints: Non-‐maximally confident detecCons update the detector with NegaCve

label •  Both constraints make errors.

9!


TLD: Integrator

Tracker" Detector" Integrator"Found box" Found box"No box" Found box"Found box" No box"No box" No box"

You need to implement the output


TLD: Learning (init)

•  For 1st frame: – Sample 200 P

•  For other frames: – Sample 100 P


TLD: Learning (model update) •  Augment both P & N when :

-‐ the patch is wrongly classified by NN ówhen integrator relies on tracker response

•  The NN uses a threshold to determine P & N patches

12!

Integrator" NN" Retain Or discard"P" N" Retain as P"N" P" Retain as N"P" P" Discard"N" N" Discard"


TLD QuesCons?

13!



Outline


DeviaCon from ground-‐truth

15!

Ground-‐truth bounding box



16!

Bound box from TLD (confidence)

Conf=0.9 Conf=0.2

Conf=0.7




17!


Conf=0.9 Conf=0.2

Conf=0.7


Compute overlap as (IntersecCon area)/(Union area)

IntersecCon



18!


Conf=0.9 Conf=0.2

Conf=0.7



Union



19!


Conf=0.9 Conf=0.2

Conf=0.7



Overlap = 0.7 Overlap = 0.55 Overlap = 0.15


Metric 1: Average Overlap

20!

overlap between ground-‐truth and tracked bounding box in frame #i



21!


Conf=0.9 Conf=0.2

Conf=0.7


Overlap = 0.7 Overlap = 0.55 Overlap = 0.1 5

Average Overlap = 0.467


Problem with average overlap

22!

Conf=0.9 Conf=0.2

Conf=0.7

Doesn’t account for confidence score from tracking algorithm.

More confident boxes should be weighted higher


Metric 2: Mean Average Precision

23!

1.  Sort frames by confidence of bounding box from TLD algorithm

Conf=0.9 Conf=0.2

Conf=0.7 Overlap = 0.7 Overlap = 0.55 Overlap = 0.1 5

Frame #1 Frame #2 Frame #3

Decreasing confidence



24!

1.  Sort frames by confidence of bounding box from TLD algorithm 2.  A bounding box from TLD is said to be tracked correctly if the overlap > 0.5

Conf=0.9 Conf=0.2

Conf=0.7 Overlap = 0.7 Overlap = 0.55 Overlap = 0.1 5

Frame #1 Frame #2 Frame #3

Correct Wrong Correct



25!

1.  Sort frames by confidence of bounding box from TLD algorithm 2.  A bounding box from TLD is said to be tracked correctly if the overlap > 0.5 3.  Compute precision at different values of recall




26!



recall = 0.33 precision = 1.0



27!



recall = 0.67 precision = 0.67



28!

1.  Sort frames by confidence of bounding box from TLD algorithm 2.  A bounding box from TLD is said to be tracked correctly if the overlap > 0.5 3.  Compute precision at different values of recall 4.  Compute average precision



29!

1.  Sort frames by confidence of bounding box from TLD algorithm 2.  A bounding box from TLD is said to be tracked correctly if the overlap > 0.5 3.  Compute precision at different values of recall 4.  Compute average precision


recall = 0.33 precision = 1.0 recall = 0.67

precision = 0.67



Outline


What We Provide •  TLD_project_starter_codes_release.tar.gz

–  Matlab wrapper with various uClity funcCons and display methods for TLD tracking

–  Also includes evaluaCon code –  Modified from original implementaGon of TLD by Zendek Kalal

•  tiny_tracking_data.tar.gz

–  4 validaCon video sequences (sequence of image frames) –  5 test video sequences (sequence of image frames) –  iniCalizing bounding box on first frame + ground-‐truth bounding

box in each frame –  All videos less than 200 frames

31!


Starter code: Param. IniCalizaCon (A modified version of the original Matlab implementaCon from Zendek Kalal)

32!

•  run_TLD_on_video.m –  Sets up tld parameters, calls tldExample and saves

tracking results to a text file –  TODO: Set all the parameters for the TLD algorithm •  Minimal window size of object bbox •  Patchsize to resize every patch before learning/detecCon

•  Parameters specific to your learning algo (such as regularizaCon constant)

•  Parameters for selecCng posiCve and negaCve patches for learning


Starter code: Wrapper (A modified version of the original Matlab implementaCon from Zendek Kalal)

33!

•  tldExample.m (Nothing to do) –  IniCalizes with tldInit –  Calls the tldProcessFrame funcCon on every frame –  Also saves the output images with tracked bbox to output

directory


Starter code: IniCalizaCon (A modified version of the original Matlab implementaCon from Zendek Kalal)

34!

•  tldInit.m –  IniCalizes the LK tracker and also chooses posiCve and

negaCve examples from the first frame for iniCalizing the detector and Nearest Neighbor (NN) method

–  TODO: IniCalize your detector based on the posiCve and negaCve examples from first frame


Starter code: Process frame (A modified version of the original Matlab implementaCon from Zendek Kalal)

35!

•  tldProcessFrame.m –  Calls the LK tracker tldTracking.m to track densely

sampled keypoints from bounding box –  Calls the trained detector to idenCfy potenCal object boxes

in frame –  Integrates detecCon and tracking bounding boxes •  TODO: Modify the integrator to improve performance. The provided integrator might not be a good strategy for all video sequences

–  Calls tldLearning to update detector and NN model


Starter code: DetecCon (A modified version of the original Matlab implementaCon from Zendek Kalal)

36!

•  tldDetection.m –  Calls the detector to idenCfy candidate bounding boxes in

the current frame –  TODO: Run your detecCon method on provided image

patches •  tldNN.m –  Runs Nearest Neighbor model on the patches selected by

detector from previous step –  TODO: Compute a confidence measure to determine how

confident the NN is about each patch being a bbox •  Use Normalized Cross correlaCon


Starter code: Learning (A modified version of the original Matlab implementaCon from Zendek Kalal)

37!

•  tldLearning.m –  Updates the detecCon model –  Calls tdlTrainNN to update the NN model –  TODO: Train your detecCon method

•  tldTrainNN.m –  TODO: Update stored posiCve and negaCve patches

tld.pex and tld.nex based on newly seen posiCve and negaCve patches


Starter code: +ve & -‐ve examples (A modified version of the original Matlab implementaCon from Zendek Kalal)

38!

•  tldGeneratePositiveData.m –  Called by tldLearning.m and tldInit.m –  TODO: Choose posiCve examples from current image based on

overlap of the grid boxes with the tracked box from frame •  tldGenerateNegativeData.m

–  Called by tldLearning.m and tldInit.m –  TODO: Choose negaCve examples from current image based on

overlap of the grid boxes with the tracked box from frame •  tldPatch2Pattern.m

–  TODO: Compute features from the given patches to be used by learning/detecCon/NN


Starter code: Other UCls (Nothing todo) (A modified version of the original Matlab implementaCon from Zendek Kalal)

39!

•  tldDisplay.m –  Plots the tracked bounding box on each image –  Shows the points tracked by LK tracker in blue –  Shows center points of patches selected by your detector in grey

•  tldEvaluate.m –  Evaluates tracking by compuCng avg. overlap and avg. precision

•  mex/bb_overlap.cpp: Computes overlap between bboxes •  mex/lk.cpp: Lucas Kenade tracker •  bbox/bb_cluster.m: Clusters bounding boxes •  bbox/bb_scan.m: generates a dense grid of bounding boxes in

image


What You Need To Do

1.  Implement the TODO secCons in code 1.  Learning / DetecCon method 2.  Features used 3.  PosiCve and NegaCve sampling strategy 4.  Integrator to combine detecCon and tracking results

2. Measure performance with provided ground-‐truth for all videos (main.m) •  Sanity check: Our baseline TLD has average overlap=0.68, average precision=0.78. Should be able to get be\er performance.

40!


What You Need To Do

Q: Do I have to use the Matlab starter code? A: No! But ask the TAs if you want to use another language. You might have to be careful about the LK tracking implementaCon and integraCon. Q: Do I need to turn in my code? A: Yes. There should be a script we can call that’ll e.g. run your method on an image without any/much modificaCon.

41!



Outline


Project 2 Report

•  Write-‐up template provided on website (link) •  Use CVPR LaTeX template •  No more than 5 pages

43!


Project 2 Report Rough secCons:

1.   Overview of the field (online single object tracking) 2.   The algorithm overview 3.   Components implemented by you (Your contribuCon)

1.  Learning / DetecCon method 2.  Features used 3.  PosiCve and NegaCve sampling strategy 4.  Integrator to combine detecCon and tracking results

4.   Code README 5.   Results

1.  QuanCtaCve result for each sequence (ValidaCon + Test) 1.  Avg. overlap, Avg. precision and Cme taken/frame

2.  QualitaCve result with analysis 3.  Error analysis for difficult sequences

44!


Project 2 Report

Overview of the field • What is the problem • What is the general scope of methods we’ve talked about in class

• Mini-‐summary of class papers • Cite papers!

45!


Project 2 Report

The algorithm overview •  Your understanding of how TLD works • Why would just using a LK tracker fail? • Why does only using detecCon/learning prohibiCve? • How do the tracker (T) and learning/detecCon (LD) interact?

•  SuggesCons for improving the method!

46!


Project 2 Report

Components implemented by you: •  Learning / DetecCon method •  Features used •  PosiCve and NegaCve sampling strategy •  Integrator to combine detecCon and tracking results

•  MoCvate your choice for each component! •  Provide a quanCtaCve/qualitaCve comparison with other

possible model choices

47!


Project 2 Report

Code • A README for your code • What are the key files/funcCons (if you added addiConal files, explain them too)

• How can the TAs reproduce your results?

48!


Project 2 Report Results •  QuanCtaCve results

–  Average precision per video sequence –  Average overlap per video sequence –  Time taken per frame to track object in the video

•  For project 2, provide results separately for the validaCon and test sets.

•  QualitaCve results –  2 interesCng examples where your detecCon method succeeded and 2 examples where it failed

–  Detailed error analysis for cases where it failed

49!


Extensions

•  The assignment is open ended in terms of the features/learning/detecCon methods you choose to use

•  Plenty of possibiliCes to try different methods J

50!


Possible Extensions •  Comparison of different features (patch2paXern) –  Binary features are usually fast to compute and give

reasonable performance –  Try openly available implementaCons of BRIEF, LBP, FREAK –  Dense features give good performance but are slower •  Resized patch aQer mean subtracCon (or whitening) •  HOG from resized patch

•  Try different sampling and pooling strategies for features –  Densely sample the enCre patch or use keypoints –  SpaCal pyramids for pooling

51!


Possible Extensions •  Comparison of different learning methods –  Slower batch trained classifiers such as linear SVM –  Faster online SVM, random ferns –  DetecCon strategy •  Run classifier densely on all grid bounding boxes •  Pre-‐select a smaller subset of good candidates to run classifier

•  Data augmentaGon for learning –  Warping/shiQing/noise addiCon to posiCve and negaCves –  Mine only hard negaCves for training classifiers

52!


Possible Extensions •  ExperimenGng with the integrator –  When to restart the LK tracker? –  How to weight the tracker and detecCon results? –  AdapCng the integrator method based on video properCes

•  Introducing priors to regularize the tracking –  Penalize sudden and large bbox transiCons between

frames –  Penalize sudden change in direcCon of moCon

53!



Outline


Project 2 PresentaCons

•  These happen the day before the report/code is due.

•  Every team should submit 4-‐5 slides to Alex ([email protected]) by 5 pm the day before (Sun May 10)

•  Reminder: Teams of 1 or 2 people •  If two people, make sure both present!

•  Randomly pick ~10 teams to present.

55!


Things to include in presentaCon

•  Important contribuCons in your implementaCon

•  SubtleCes/things you didn’t expect •  Important: 2 video results for your tracking

–  Provide result on one video which is not from the provided dataset –  (Note: You may use ffmpeg to combine the output frames generated by

the tracking method into a video)

• Any insights!

56!


Grading

•  35%: Technical Approach and Code •  Is your code correct? Do anything cool?

•  35%: Experimental EvaluaCon •  Performance, insights, thorough evaluaCon

•  20%: Write-‐up •  Contains everything, forma\ed well, etc.

•  10%: Project PresentaCon •  Clarity, Content. •  Not counted if no presentaCon in a week.

57!


Submi}ng

•  Submit via CourseWork •  One submission per team •  We’ll use cheaCng-‐detecCon soQware –  Do not use the openly available TLD code! –  Cite any external code/library you use! –  Please don’t make this an issue!

58!


Late Days

•  You have 7, split between the three projects any way you want

•  But your project presentaCon itself sCll needs to be on Cme (in class). Late days only apply to write-‐up/code submission

59!


Working in Groups

•  You can work with up to one other person •  Shared code/report. •  We’ll grade fairly regardless of team size

60!


Important Dates

•  May 11(in class): PresentaCons •  May 12 (5 pm): Reports due

61!


Other QuesCons?

62!

project 2q&a( - stanford...

Documents