face detection lecturers: mor yakobovits roni karlikar supervisor: hagit hel-or
TRANSCRIPT
Face Detection
Lecturers:Mor Yakobovits
Roni Karlikar
Supervisor:Hagit Hel-Or
Introduction
Humans can easily detect faces, although faces can be very different from each other.
• Humans have also tendency to see face patterns even if they don’t really exist.
Faces everywhere
5 http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html
Face Detection
• The problem of face detection is:Given an image, say if it contains faces or not.
• The idea of face detection in computer vision is to let the computer learn to detect faces in images, just as a human can do.
Applications of Face Detection
• Auto-focus in cameras• Security systems (recognize faces of certain
people)• Human-computer interface• Marketing systems• Much more..
Difficulties of Face DetectionBuilding a model for faces is not a simple task,faces can be complex and vary from each other. Faces in images are also affected from the environment.
Difficulties - Changing lightening
• Affects color, facial feature
Difficulties - Skin Tone
• Large variety of skin tones.
Difficulties - Facial Expressions
• Affects shape of face and its features
Difficulties – Scaling and Angles
Difficulties - Obstructions
• Obstruction of facial features
Today’s Lecture
• We will talk about:
– Skin detection
– Eigenfaces
– Viola-Jones algorithm
Today’s Lecture
• All of the 3 approaches we’ll see today are based on learning.– The computer learns to detect faces.
Learning - Intro
• The learning model we’ll use is Classifier.– Purpose: classify data to several classes.– Training level: let the computer learn the features of each
class (face & non-face). This is done using a dataset with examples for instances of each class. (the instances are already classified)
– Classification: given a new instance, tell which class it belongs.
Example: Studying for exam by solving previous exams.
Face Detection Using Skin Detection
Probabilistic Approach
image source
Skin Detection
• Purpose:– Find “skin pixels” in a given image.
• The main question:- How to determine if a pixel is a “skin pixel”?
- Our approach will be to teach the computer what color is a skin color and what color isn’t.
Skin Detection
• Skin detection is a color(pixel) – based approach for detecting faces.
• This approach is quite simple
• but has limited results due to– high sensitivity to illumination and other changes in skin
tones– not only faces contain skin (arms, legs…)– some objects have similar colors to skin (for example,
wooden furniture)
Example for how illumination cause false-negative and false-positive detection.
“ ” , -Detecting Faces in Color Images by Hsu Abdel Mottaleb & Jain
(a) a yellow-biased face image (b) a light-compensated image (c) skin regions of (a) in white (d) skin regions of (b)
Another examples for false-positive (chair in top left corner) & false-negative (dark area of face in the image of the soccer player) skin detections.
(1999)Rehg & Jones
Skin Colors In RGB Color Space
97% of the skin-color bins overlaps with non-skin color bins.explanation could be-
many objects whose color resemble a skin color, like walls, railtracks, furnitures and wooden objects.
(1999)Rehg & Jones
Skin Classifier
skin
• The problem:Given a pixel x with color (r,g,b), determine if it’s a skin or not.
Skin Classifier
• Given x = (R,G,B), how do we determine it’s class? (skin/non-skin)
• Nearest neighbor– find labeled pixel closest to X– choose the class that pixel
• Data modeling– fit a model (curve, surface, or volume) to
each class
• Probabilistic data modeling– fit a probability model to each class– we’ll focus on this approach Orange dots – skin
Purple dots – non skin
source
Probabilistic Skin Classifier
• Two approaches we’ll discuss– Gaussian-Based (parametric model)– Histogram-Based (non-parametric model)
Parametric modeling
Main Idea:• Assume the type/shape of the distribution we’re trying
to find.
• Find the parameters values for the assumed type from a training set.
• Single Gaussian Model– We assume the probabilistic distribution we are trying to
find is a Normal Distribution (Gaussian function)
• To find that distribution, all we need is:• - mean of the learned skin colors• - covariance matrix of the learned skin colors
Gaussian-Based Approach)Parametric model(
Those parameters are evaluated separately for each class
is a color vector!
• After we have the mean & covariance, we get:
• Where is the mean vector and is the covariance matrix of class j– For j=skin & j=non-skin
Gaussian-Based Approach)Parametric model(
What we have
•P(rgb / skin) & P(rgb / ~skin) –“probability that a (non-)skin pixel will have the color rgb”
But that’s not what we want.
•We need P(skin / rgb) & P(~skin / rgb) –“the probability that a pixel with the color rgb is (non-) skin”
What we need
After we achieve that, we can use MAP estimation.
Remember Bayes’ Rule?
Bayes’ Rule
P(skin) is the portion of skin pixels from total pixels in the
learning dataset
P(R) can be calculated using the probabilities we already have.
MAP Estimation
Classification: A pixel with color R will be classified as skin iff P(skin / R) is higher than P(~skin / R)
)Maximum A Posteriori estimation(
MAP estimation- Maximizes the probability for the posterior, and so Minimizes the probability for false-negative misclassification
False-negative misclassification: a skin pixel classified as non-skin
• Problem with Single Gaussian Model:– Actual skin distribution might be too complex to be represented as a
Gaussian distribution.
• Solution: Mixture of Gaussians (MoG)– Represent the distribution with several different Gaussian distributions
to allow more dynamic modeling of the distribution
Another Gaussian-Based Approach)Parametric model(
Skin Color distribution in HSV color space
source
-HSV (Hue, Saturation, Value) separates color components from intensity(in RGB intensity affects all channels)
- Not the best color space for color-based approaches, but conversion is very simple compared to the better color spaces
• Drawback:– Slower learning because we need to use EM algorithm to estimate the
MoG– Slower classification, since it requires a evaluation all of the Gaussians
Mixed Non-skin Model Mixed Skin Model
• In the case of Mixture of Gaussians:
Gaussian-Based Approach)Parametric model(
Classification: Use Bayes’ and then MAP
MoG vs. Single Gaussian
source
Single GaussianMoG Training set distribution
EM algorithm – Expectation Maximization algorithm
Non-parametric modeling
Main Idea:• Do not assume anything about the distribution we are
looking for.
• Derive the distribution directly from the dataset.
Histogram-Based Approach
• Learn from a labeled dataset – for each color bin (256*256*256 ~ 16.7m in RGB), count
• how many pixels of that color were skin• how many pixels of that color were non-skin
• We get a histogram:
Non-parametric model
And a equivalent histogram for non-skin pixels.
(Our histograms will have three dimensions)
• we have P(rgb / skin) & P(rgb / ~skin)
• we need P(skin / rgb) & P(~skin / rgb)
Histogram-Based ApproachNon-parametric model
• A 3D histogram looks like this:
(1999)Rehg & Jones
Histogram-Based ApproachNon-parametric model
• Viewing direction along the green-magenta axis which joins corners (0,255,0) and (255,0,255) in RGB
• The viewpoint was chosen to orient the gray line horizontally
• 8 bins in each color channel• Only shows bins with counts greater than 336,818
• Step by step explanation:– Learning:
1. Using a labeled dataset:1. For each color X:
count how many occurrences of X as skin pixel & non-skin pixels:
2. Normalize each histogram for each color X:
Histogram-Based ApproachNon-parametric model
& respectively
• Step by step explanation: (cont’d)– Learning:
3. Apply bayes’ rule, for each color X:
• We have P(X|skin) from histogram N
• .P(skin) =
• .• P(X) =
• Symmetrically for
Histogram-Based ApproachNon-parametric model
• Step by step explanation: (cont’d)– Classification:
• We are given a color X• Determine class with: (MAP estimation)
• Only 2 table look-ups!– One in the skin histogram and one in the non-skin histogram
Histogram-Based ApproachNon-parametric model
• Assume we observed from the dataset:– 534 skin-pixels with the color (100, 100, 100)– 330 non-skin pixels with the color (100, 100, 100)– Total number of observed pixels is 10000
• 5000 skin pixels• 5000 non-skin pixels
• We get the corresponding probabilities:– P((100, 100, 100) | Skin ) = 534/5000= 0.1068– P((100, 100, 100) | Non-skin ) = 330/5000= 0.066
• The histograms (skin & non-skin):
Histogram-Based Approach - ExampleNon-parametric model
Example – cont’d
• P((100, 100, 100) | Skin ) = 0.1068• P((100, 100, 100) | ~Skin ) = 0.066
• Using Bayes’ Rule:
• P((100, 100, 100) | Skin ) is bigger than P((100, 100, 100) | ~Skin ) – And so every pixel with the color (100, 100, 100) will be classified as a skin pixel
Reminder: Bayes’ Rule -
Parametric vs. Non-parametric Modeling
Parametric Non-parametric
Can generalize with rather small dataset
Requires a big dataset to achieve good performance
Dataset size needed
Slower than non-parametric if we use EM algorithm
Fast and simple Learning
Rather slow, need to evaluate (at least) 2 Gaussians on each classification
Very fast, only 2 table look-ups required
Classification
Unproportionately smaller to non-parametric model
(Rehg & Jones – MoG needed 896 bytes)
Very big, we explicitly store the distribution for each color (16.7m for 3D color space)
(Rehg & Johns – needed 262 Kbytes
Storage space
Parametric vs. Non-parametric Modeling
Bibliography Statistical Color Models with Application to Skin Detection by
Rehg & Jones (1999) “Detecting Faces in Color Images” by Hsu, Abdel-Mottaleb
& Jain (2002) “A Survey on Pixel-Based Skin Color Detection Techniques” by
Vezhnevets, Sazonov & Andreeva (2003) http://alumni.media.mit.edu/~maov/classes/comp_photo_vis
ion08f/lect/05_skin_detection.pdf http://pages.cs.wisc.edu/~lizhang/courses/cs766-2007f/syllab
us/10-23-recognition/10-22-recognition.ppt
Eigenfaces
M.A. Turk and A.P. Pentland:
Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3 (1):71--86,
1991.
An image is a point in a high dimensional space• An N x M image is a point in RNM
• We can define vector for every image in this space
What is image?
The Space of Faces
• Images of faces being similar in overall configuration (like nose, mouth, eyes…) , will not be randomly distributed in this huge image space while the image space is very big (an 200x200 image is point in )
• Therefore, they can be described by a low dimensional subspace.
Eigenfaces-key ideas
Eigenfaces look somewhat like generic faces.
• Find basis vectors that describe the face space without losing a lot of data.
• Use them to detect faces
v 1
v 2
• Dimensionality reductionWe can represent the yellow points with only their v1 coordinates
• since v2 coordinates are all essentially 0
Dimensionality Reduction
Motivation• This makes it much cheaper to store and compare points• A bigger deal for higher dimensional problems (today there is 8 mega pixel
images).
The Problem:• In perfect world we can find small sub-space to
describe the face space without losing data. x2
x1
2 dimensions
y1
1-dimensions
But , This is not the situation!!!
1-dimensions2 dimensions
x
y
• What we can do? use PCA
PCA- Principal Component Analysis
The goal of PCA is to reduce the dimensionality of the data while retaining as much information as possible in the original dataset.
Dimensionality reduction• PCA allows us to compute a linear transformation that maps
data from a high dimensional space to a lower dimensional sub-space. K x N
Principal Component Analysis (PCA)
𝑦=𝑇𝑥𝑤h𝑒𝑟𝑒𝑇=[ 𝑡11 ⋯ 𝑡1 𝑁
⋮ ⋱ ⋮𝑡𝐾 1 ⋯ 𝑡𝐾𝑁 ]
𝑦 1=𝑡11 𝑥1+𝑡12𝑥2+. ..+𝑡1𝑁 𝑥𝑁
𝑦 2=𝑡 21𝑥1+𝑡 22𝑥2+ .. .+𝑡 2𝑁 𝑥𝑁
𝑦𝐾=𝑡𝐾 1𝑥1+𝑡𝐾 2𝑥2+. . .+𝑡𝐾𝑁 𝑥𝑁
− Dimensionality reduction implies information loss!− PCA preserves as much information as possible, that is, it minimizes the error.
− PCA minimizes the reconstruction error.
Principal Component Analysis (PCA)
¿|𝑥− �̂�|∨¿x- the orginal vector.the vector after dimensionality reduction.(represent in the normal space)
• PCA assumes that the data follows a Gaussian distribution (mean µ, covariance matrix Σ).
Principal Component Analysis (PCA)
PCA-example
Consider the variation along direction v among all of the orange points:
What unit vector v minimizes var?
What unit vector v maximizes var?
v 1
v 2
From 2 dimensions to 1-dimension
𝑇 h𝑒𝑑𝑎𝑡𝑎𝑤𝑒𝑙𝑜𝑠𝑒
v1 is eigenvector of AAT with largest eigenvaluev2 is eigenvector of AAT with smallest eigenvalue
Solution:
PCA-example (cont.)
¿𝑣𝑇 𝐴𝐴𝑇𝑣𝑤h𝑒𝑟𝑒 𝐴=¿
The best low-dimensional space can be determined by the “best” eigenvectors of the covarice matrix of x (i.e. the eigenvector corresponding to the “largest” eigen-values– also called “principal components”).
Why is it true? Intuition
𝐶= ∑𝑖∈𝑟 𝑒𝑑𝑝𝑜𝑖𝑛𝑡
❑
(𝑥 𝑖 , 𝑦 𝑖 ) (𝑥 𝑖
𝑦 𝑖)= ∑
𝑖∈𝑟 𝑒𝑑𝑝𝑜𝑖𝑛𝑡
❑
(𝑥 𝑖
2 𝑥 𝑖 𝑦 𝑖
𝑥𝑖 𝑦 𝑖 𝑦 𝑖2 )=¿¿( ∑ 𝑥 𝑖
2 ∑ 𝑥 𝑖 𝑦 𝑖
∑ 𝑥 𝑖 𝑦 𝑖 ∑ 𝑦 𝑖2 )
∑ 𝑥 𝑖 𝑦 𝑖 0 Because x and y are independent of each other.
Result:(∑ 𝑥 𝑖2 0
0 ∑ 𝑦 𝑖2)
Intuition(cont.)• What to do in complex
vector?
• Rotate the axis until we get the previous example
• find the eigenvector and rotate them to original axis
∑ 𝑥 𝑖 𝑦 𝑖≠0
Use PCA to Find eigenfacesImage RepresentationAn N*N image are represented by vectors of size N2
x1,x2,x3,…,xM
Example:
33154
213
321
191
5
4
2
1
3
3
2
1
Use PCA to Find eigenfaces,………,(very important : the face images must be centered and of the same size)
𝐸𝑥𝑎𝑚𝑝𝑙𝑒 :𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑖𝑚𝑎𝑔𝑒𝑠
Use PCA to Find eigenfaces𝑀𝑒𝑎𝑛 :𝜇
𝑇𝑜𝑝𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟𝑠 :𝑢1,…𝑢𝑘
The result:
Problem1: Choosing the Dimension K
K NMi =
eigenvalues
• How many eigenfaces to use?• Look at the decay of the eigenvalues
– the eigenvalue tells you the amount of variance “in the direction” of that eigenface
– ignore eigenfaces with low variance
Example: For N = 1024 x 1024 pixels, Size of A will be 1048576 x 1048576 ! Number of eigenvectors will be 1048576!
Choosing the Dimension K- Example
Problem2: Size of Covariance Matrix C• Suppose each data point is -dimensional ( pixels)
– The size of covariance matrix C is – The number of eigenfaces is M’
– Example: For N = 1024 x 1024 pixels, Size of A will be 1048576 x 1048576 ! Number of eigenvectors will be 1048576 !
Typically, only 20-30 eigenvectors suffice. So, this method is very inefficient!
For every eigenvector u with eigenvalues (AAT) u = u =>AT(AAT) u = AT(u)
=>AT(AAT) u = (ATu)=> (ATA)(ATu) = (ATu)=>v=ATuis the eigenvector ofATA.
Efficient Computation of Eigenvectors
Find u from v:v=Atu => Av=AAtu=u => u=(1/)Av
Eigenfaces – summary in words
• Eigenfaces are the eigenvectors of the covariance matrix of the probability distribution of the vector space of human faces
• Eigenfaces are the ‘standardized face ingredients’ derived from the statistical analysis of many pictures of human faces
• A human face may be considered to be a combination of these standardized faces
Eigenfaces-key ideas
• Find basis vectors that describe the face space without losing a lot of data.
• Use them to detect faces
Projecting onto the Eigenfaces
• The eigenfaces v1, ..., vK span the space of faces
– A face is converted to eigenface coordinates by
,….,
Projecting onto the Eigenfaces
Detection with Eigenfaces
+……+
¿∨𝑥− �̂�∨¿<𝑡 h𝑟𝑒𝑠 h𝑜𝑙𝑑
• Option 1: Use the trainning set to check how much data lost.
• Option 2: Use validation data.
Problem: determination good threshold
Advantages of the Approach:
• Fast
• Simple
• Learning skill
• Robust to small change in the face
Limitations of Eigenfaces Approach
• Variations in lighting conditions– Different lighting conditions for
enrolment and query. – Bright light causing image
saturation.
• Light changes degrade performance (not Drastically)− Light normalization helps.
• Performance decreases quickly with changes to face size− Multi-scale eigenspaces.− Scale input image to multiple sizes.
• Performance decreases with changes to face orientation (but not as fast as with scale changes)
− Plane rotations are easier to handle.− Out-of-plane rotations are more difficult to handle.
Limitations of Eigenfaces Approach
Limitations of Eigenfaces Approach• Not robust to misalignment
Visualization://https. . / ? = 7www youtube com watch v YWRiF FAuKEhttp://demonstrations.wolfram.com/FaceRecognitionUsin
gTheEigenfaceAlgorithm/
Reconstruction using the eigenfaces
Bibliography• M.A. Turk and A.P. Pentland: Eigenfaces
for Recognition. Journal of Cognitive Neuroscience, 3 (1):71--86,1991.
• http://www.cs.cmu.edu/afs/cs/academic/class/15385-s12/www/lec_slides/lec-19.ppt
• http://www.cse.unr.edu/~bebis/CS485/Lectures/Eigenfaces.pptx
Viola Jones Algorithm for Face Detection
Paul Viola & Michael J. Jones
First publication (2001)Revised and improved (2004)
Viola – Jones Algorithm
• Object-detection algorithm– Feature-based– Today: example for implementation of face detector– Face detection was the motivation for the algorithm
• First real-time face detector– Speed is very important!
• Implemented in OpenCV library– Improved version of what we will see today
Features – what are they?
• How many features are needed to indicate the existence of a face?
• All faces share some common features:– The eyes region is darker than the
upper-cheeks.– The nose bridge region is brighter
than the eyes.– That is useful domain knowledge
• How can we encode such domain features?
Rectangle Features (or Haar-like Features)
• We will look for features inside 24x24 pixels window
• Each feature contains black & white rectangles,
the feature value is defined by:
• ∑ (pixels in white area) – ∑ (pixels in black area)
• Other explanation: correlation of the image with a mask with 1 in pixels of white areas, and -1 in pixels of black areas 0 0 0 0 0
0 0 -1 1 0
0 0 1 -1 0
0 0 0 0 0
Rectangle Features (Haar Features)
• The basic 4 features are on the right. – All other features are
obtained by change of orientation and/or change of scale of those 4
Rectangle Features (Haar Features)
• For a 24x24 detection region, the number of possible rectangle features is ~160,000
Rectangle Features (Haar Features)
• Some features correspond to common facial features. Examples:
Challenges
1) Feature Computation – as fast as possible
2) Feature Selection – too many features, need to select the most informative ones
3) Real-timeliness – focus mainly on potentially positive image areas (potentially faces)
Rectangles Feature Evaluation
• Feature evaluation is one of the basic operations in Viola & Jones algorithm which use it a lot
• Naively summing each rectangle is not practical
• We must find a way to evaluate features fast
Integral Image• Definition:
– The integral image at location (x,y) is the sum of the pixels above and to the left of (x,y), inclusive.
• The integral image can be computed in a single pass.
)x,y(
' , '
formal definition:
, ', '
Recursive definition:
, , 1 ,
, 1, ,
x x y y
ii x y i x y
s x y s x y i x y
ii x y ii x y s x y
s(x,y) = sum of pixels in row x, columns 1…y
i(x,y) is the imageii(x,y) is its integral image
Computing the Integral Image
ii(x, y-1)
s(x-1, y)
i(x, y)
' , '
formal definition:
, ', '
Recursive definition:
, , 1 ,
, 1, ,
x x y y
ii x y i x y
s x y s x y i x y
ii x y ii x y s x y
s(x,y) = sum of pixels in row x, columns 1…y
i(x,y) is the imageii(x,y) is its integral image
Integral Image - Motivation• Using the values of the Integral Image
we can compute any rectangular sum (e.g white part of a feature) in a constant time:– Example: the sum of rectangle D
can be computed with:
ii(d) – ii(b) – ii(c) + ii(a)
• Result: Rapid feature evaluation!– two-, three- and four-rectangular
features can be computed with 6, 8 and 9 array accesses respectively.
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Integral Image - Motivation• Using the values of the Integral Image
we can compute any rectangular sum (e.g white part of a feature) in a constant time:– Example: the sum of rectangle D
can be computed with:
ii(d) – ii(b) – ii(c) + ii(a)
• Result: Rapid feature evaluation!– two-, three- and four-rectangular
features can be computed with 6, 8 and 9 array accesses respectively.
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Integral Image - Motivation• Using the values of the Integral Image
we can compute any rectangular sum (e.g white part of a feature) in a constant time:– Example: the sum of rectangle D
can be computed with:
ii(d) – ii(b) – ii(c) + ii(a)
• Result: Rapid feature evaluation!– two-, three- and four-rectangular
features can be computed with 6, 8 and 9 array accesses respectively.
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Integral Image - Motivation• Using the values of the Integral Image
we can compute any rectangular sum (e.g white part of a feature) in a constant time:– Example: the sum of rectangle D
can be computed with:
ii(d) – ii(b) – ii(c) + ii(a)
• Result: Rapid feature evaluation!– two-, three- and four-rectangular
features can be computed with 6, 8 and 9 array accesses respectively.
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
-
Integral Image - Motivation• Using the values of the Integral Image
we can compute any rectangular sum (e.g white part of a feature) in a constant time:– Example: the sum of rectangle D
can be computed with:
ii(d) – ii(b) – ii(c) + ii(a)
• Result: Rapid feature evaluation!– two-, three- and four-rectangular
features can be computed with 6, 8 and 9 array accesses respectively.
ii(a) = A
ii(b) = A+B
ii(c) = A+C
ii(d) = A+B+C+D
D = ii(d)+ii(a)-ii(b)-ii(c)
Feature Evaluation Using Integral Image
• ∑ (pixels in white area) - ∑ (pixels in black area)
Example:
-1 +1+2-1
-2+1
Integral Image
A BC DE F
Black square = D – B – C + AWhite square = F – D – E + C
White - Black = -A+B+2C-2D-E+F
Our achievements – so far
1) Feature Computation – as fast as possible
2) Feature Selection – select the most informative features
3) Real-timeliness: focus mainly on potentially positive image areas (potentially faces)
Feature Selection
• The problem: too many features– In a 24x24 sub-window
there are ~160,000 possible features
– Impractical to evaluate all of the features in every candidate sub-window
• The solution: select the most informative features
– How? AdaBoost
AdaBoost Algorithm
• Introduced by Yoav Freund & Robert E. Schapire in 1995– Received Gödel Prize in 2003 for their work
• It is a machine-learning algorithm• Stands for Adaptive Boost
• AdaBoost is used to improve learning algorithms– Combines several “weak“ learners into a
“strong“ one
AdaBoost Algorithm
• Main idea: create a strong classifier by a linear combination of weighted simple weak classifiers.
Strong classifier
Weak classifier
WeightImage
AdaBoost – Intro• What are our “weak” classifiers?
– Each single rectangle feature is regarded as a “weak” classifier.
• It’s an iterative algorithm– Iteratively choose the best “weak” classifiers.– Tweak each “weak” classifier in favor of instances that were
misclassified by previous “weak” classifiers (thus Adaptive)
• What about the weights? Learning!
The weak classifiers
• A weak classifier hj(x) consists of a feature fj, a threshold j, and a parity pj indicating the direction of inequality sign:
x is a 24-by-24 sub-window of an image
otherwise0
)(if1)(
jjjj
θpxfxh
AdaBoost – How it works
• Given training set: – Xi is an 24x24 image, Yi is it’s label –
face/non-face– Each Xi has a weight
• Initially, all weights are equal
– This weights will be used to force the chosen “weak” classifiers to focus on unrepresented data in the training set
AdaBoost – How it works
Given: example images labeled +/- Initially, all weights set equally
Repeat T times: (T – number of “weak” classifiers we want) Step 1: choose the most efficient weak classifier that will be a
component of the final strong classifier
Step 2: Update the weights of the dataset images to emphasize the examples from the training set which were incorrectly classified This makes the next weak classifier to focus on “harder”
examples
AdaBoost – Feature Selection
• Step 1 is slow – there is a large set of possible weak classifiers to check (each “weak” classifier is in fact a single feature)
• Which feature to choose?– Choose the most informative one
• Test each “weak” classifier on the weighted training set• Choose the “weak” classifier with the best detection rates
AdaBoost – Feature Selection
• Finally we get a “strong” classifier that is a weighted combination of the best T “weak” classifiers– Weight of each classifier depends on its detection rate on the
training set– Higher weight for better classifier
Each is a “weak” classifier, is its weighth is the “strong” classifierx is a 24x24 image
otherwise
xxhT
t
T
t ttth0
2
1)(1)( 1 1
Boosting illustration
Weak Classifier 1
Slide source
Boosting illustration
WeightsIncreased
Slide source
Boosting illustration
Weak Classifier 2
Slide source
Boosting illustration
WeightsIncreased
Slide source
Boosting illustration
Weak Classifier 3
Slide source
Boosting illustration
Final classifier is a combination of weak
classifiers
Slide source
AdaBoost - Conclusion
• AdaBoost selects a small set of informative features and use them to build a strong classifier
Feature Selection
• Top two features weighted by AdaBoost:
(specific to the training dataset that Viola & Jones used in their experiment)
Our achievements – so far
1) Feature Computation – as fast as possible
2) Feature Selection – select the most informative features
3) Real-timeliness: focus mainly on potentially positive image areas (potentially faces)
Real-timeliness
• On average only 0.01% of all sub-windows in a image are positives (faces)
• We spend time equally on negative & positive windows
• Can we spend less time on non-faces?
Real-timeliness
• Attentional Cascade is the answer!– The idea: Cascade classifiers with gradually
increased complexity• An instance will get to layer 10, only if he passed
layers 1-9• 1st layer will use, say 2-3 features, to filter out
easy-to-find negative windows (non-faces)• 2nd layer will use, say 10 features, to filter out more
challenging negatives• And so on…
– Each layer will be a “strong” classifier obtained using AdaBoost
Cascading classifiers
Training a cascade
• First, we should decide:– How many layers? (strong classifiers)– How many features in each layer?– Threshold of each strong classifier?
• What is the optimal combination?– This is a complex problem
Training a cascade• Finding the optimum is not practical, any
workaround?
• Viola & Jones goal: no worse than 95% TP rate, 10-6 FP rate– Viola & Jones suggested an algorithm that:
• does not guarantee optimality, but• able to generate a cascade that meets their goal
Training a cascade - outline
• User selects:– fi (Maximum Acceptable False Positive rate / layer)
– di (Minimum Acceptable True Positive rate / layer)
– Ftarget (Target Overall FP rate)
– Trial & error process until target rates are met
• Until Ftarget is met:– Add new layer to the cascade: Until fi , di rates are met for this layer
Increase feature number & train new strong classifier with AdaBoost
Determine rates of the updated layer on training set
Our achievements – so far
1) Feature Computation – as fast as possible
2) Feature Selection – select the most informative features
3) Real-timeliness: focus mainly on potentially positive image areas (potentially faces)
TrainingSet
(sub-windows)
IntegralRepresentation
Featurecomputation
AdaBoostFeature Selection
Cascade trainerTesting phaseTraining phase
Strong Classifier 1(cascade stage 1)
Strong Classifier N(cascade stage N)
Classifier cascade framework
Strong Classifier 2(cascade stage 2)
FACE IDENTIFIED
Viola & Jones Algorithm - Visualization
Viola & Jones Algorithm visualized
VIOLA & JONES ALGORITHM O PENCV IMPLEMENTAION VISU
ALIZATION
Viola & Jones Algorithm
Detector 10 31 50 65 78 95 167 422Viola-Jones 76.1% 88.4% 91.4% 92.0% 92.1% 92.9% 93.9% 94.1%Rowley-Baluja-Kanade 83.2% 86.0% - - 89.2% 89.2% 90.1% 89.9%Schneiderman-Kanade - - - 94.4% - - - -
False detections
Viola & Jones prepared their final Detector cascade: 38 layers, 6060 total features included 1st classifier- layer, 2-features
50% FP rate, 99.9% TP rate 2nd classifier- layer, 10-features
20% FP rate, 99.9% TP rate next 2 layers 25-features each, next 3 layers 50-features each and so on…
Tested on the MIT+MCU test set a 384x288 pixel image on an PC (dated 2001) took about 0.067 seconds
Detection rates for various numbers of false positives on the MIT+MCU test set containing 130 images and 507 faces (Viola & Jones 2002)
Bibliography
• Rapid Object Detection using a Boosted Cascade of Simple Features by Viola & Jones (2001)
• Robust Real-Time Face Detection by Viola & Jones (2004)
• A Short Introduction to Boosting by Freund & Schapire (1999)
• http://webdocs.cs.ualberta.ca/~nray1/CMPUT466_551/ViolaJones.ppt