face recognition using surf interest points

"Politehnica" University of Timi!oara Faculty of Automation and Computers Departament of Computer and Software Engineering 2, Vasile Pârvan Bv., 300223 – Timişoara, Romania Tel: +40 256 403261, Fax: +40 256 403214 Web: http://www.cs.upt.ro

FACE RECOGNITION USING SURF

INTEREST POINTS

Dissertation Thesis

Supervisor: Prof. Dr. Eng. Marius CRIȘAN Áron VIRGINÁS-TAR

Timişoara, 2013

Abstract

Interest point detection and feature extraction is the initial step in manycomputer vision and object recognition applications. In this thesis we proposea method for using interest points in a face recognition system.

We detect interest points in facial images with the SURF algorithm andthen match them against each other using approximate nearest neighborsearch. SURF is a fast and robust interest point detector and descriptor firstpublished in 2006 that is rapidly gaining popularity as a more computation-ally e�cient alternative to SIFT.

We propose a spatially constrained interest point matching scheme inorder to improve execution time and reduce the number of false matches. Thesimilarity of two facial images is given by a combination of the number andaverage quality of the interest points matched between them. The databaseimage that yields the highest value for this similarity measure will be selectedas a match for the query image.

We implement the proposed face recognition scheme in C++ and test iton the AT&T Face Database. The system relies on a single template imageper subject and is shown to be e↵ective for frontal face identification.

Contents

1 Introduction 51.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Face Recognition 82.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Challenges in Face Recognition . . . . . . . . . . . . . . . . . 92.3 Face Recognition Techniques . . . . . . . . . . . . . . . . . . . 11

2.3.1 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Fisherfaces . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Linear Subspaces . . . . . . . . . . . . . . . . . . . . . 142.3.4 Geometrical Feature Matching . . . . . . . . . . . . . . 152.3.5 Hidden Markov Models (HMMs) . . . . . . . . . . . . 162.3.6 Support Vector Machines (SVMs) . . . . . . . . . . . . 172.3.7 Neural Networks (NNs) . . . . . . . . . . . . . . . . . . 172.3.8 Face Recognition Using Interest Points . . . . . . . . . 18

3 Theoretical Foundation 193.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Notable Detectors . . . . . . . . . . . . . . . . . . . . . 203.2 Speeded-Up Robust Features (SURF) . . . . . . . . . . . . . . 24

3.2.1 Interest Point Detection . . . . . . . . . . . . . . . . . 243.2.2 Interest Point Description . . . . . . . . . . . . . . . . 283.2.3 Characteristics . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Interest Point Matching . . . . . . . . . . . . . . . . . . . . . 313.3.1 Brute-Force Solution . . . . . . . . . . . . . . . . . . . 323.3.2 Approximate Nearest Neighbor Search (ANNS) . . . . 32

3.4 Histogram Equalization . . . . . . . . . . . . . . . . . . . . . . 33

1

4 Proposed Solution 364.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Preconditions . . . . . . . . . . . . . . . . . . . . . . . 364.1.2 Interest Point Detection and Matching . . . . . . . . . 364.1.3 Similarity Measure . . . . . . . . . . . . . . . . . . . . 38

4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 394.2.1 Functionality and Usage . . . . . . . . . . . . . . . . . 394.2.2 Technologies Used . . . . . . . . . . . . . . . . . . . . 40

5 Experimentation 425.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . 425.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . 435.2.2 Recognition Rate . . . . . . . . . . . . . . . . . . . . . 435.2.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Contributions 46

7 Conclusions and Future Work 477.1 Limitations and Possible Improvements . . . . . . . . . . . . . 477.2 Other Future Work . . . . . . . . . . . . . . . . . . . . . . . . 48

A Examples 49

2

List of Figures

2.1 Types of variation in facial images . . . . . . . . . . . . . . . . 10

3.1 Neighborhood of a point in scale-space . . . . . . . . . . . . . 223.2 Computation of the SIFT interest point descriptor . . . . . . . 243.3 Using integral images to calculate the sum of intensities inside

a rectangular image region . . . . . . . . . . . . . . . . . . . . 253.4 9⇥ 9 box filters for the Gaussian second order partial deriva-

tives and their approximations . . . . . . . . . . . . . . . . . . 263.5 Comparison of the SIFT and SURF scale-space representation 273.6 Filters D

yy

and Dxy

for two successive scale levels . . . . . . . 273.7 Filter sizes for the first three octaves . . . . . . . . . . . . . . 283.8 Orientation assignment in SURF . . . . . . . . . . . . . . . . 293.9 Haar wavelet filters used to describe the interest points in the

x and y direction . . . . . . . . . . . . . . . . . . . . . . . . . 293.10 Examples of sub-region feature vectors . . . . . . . . . . . . . 303.11 Construction of the SURF descriptor . . . . . . . . . . . . . . 303.12 Four images and their respective histograms . . . . . . . . . . 343.13 E↵ect of histogram equalization on an image . . . . . . . . . . 35

4.1 Interest points detected in a grayscale face images using SURF 374.2 Matching interest points between face images . . . . . . . . . 37

5.1 Example of manually created occluded test image . . . . . . . 43

3

List of Tables

5.1 Parameter values that yielded the best results for the AT&TFace Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Recognition rates for di↵erent query sets . . . . . . . . . . . . 445.3 Statistics about the number of interest points and good matches 445.4 Statistics about the similarity measure . . . . . . . . . . . . . 45

A.1 Example of successful recognition . . . . . . . . . . . . . . . . 49A.2 Example of no match being found . . . . . . . . . . . . . . . . 51

4

Chapter 1

Introduction

1.1 Motivation

As our society becomes more and more dependent on automatic informationprocessing, the need for reliable computerized personal identification meth-ods has resulted in an increased interest in biometrics. The most importantbiometric methods currently investigated include fingerprints, iris, speech,signature dynamics, and face recognition. Iris recognition is extremely ac-curate but expensive to implement and intrusive. Fingerprint-based iden-tification is reliable and less intrusive, but it requires the full cooperationof the individuals being identified, and therefore it becomes impractical inmany real-life scenarios. Some systems integrate more biometric methodsfor increased security, e.g. fingerprints and voice recognition. These hybridsystems, however, usually trade o↵ flexibility for accuracy.

Face recognition seems to be a good compromise between reliability, flex-ibility and social acceptance. While it provides a lower security level inunconstrained acquisition conditions, it has the advantage of being able tofunction in an unobtrusive way. In consequence, face recognition has beenthe subject of intensive research in the last few decades.

Although research conducted in this area goes back as early as the 1960s,the problem of automatic face recognition is still not adequately solved.While recent years have seen significant progress in face recognition, withmany systems capable of achieving recognition rates greater than 90% undercontrolled conditions, real-world scenarios still remain a challenge due to thewide range of variations the face acquisition process is subject to.

Face recognition is essentially a pattern matching problem that can beaddressed by identifying characteristic features in the facial images and thenmatching them against each other. In recent years several robust interest

5

point detectors have been developed and used with considerable success inobject recognition. They capture local image features around characteristicpoints and are invariant to scale, rotation and, to some extent, illumina-tion. Hence, interest point detectors seem to be good candidates for beingemployed in a face recognition system.

1.2 Objectives

The main objective of this thesis was to design, implement and validate aface recognition scheme based on interest point matching. We have identifiedthe following partial objectives:

• study current face recognition techniques and identify their advantagesand drawbacks;

• study the state of the art in interest point detection and select anadequate detector to be used in a face recognition system;

• devise an interest point matching scheme that suits the purpose of facerecognition;

• implement a proof-of-concept face recognition system based on the pro-posed technique;

• validate the face recognition technique on a face database;

• analyze the advantages and drawbacks of our system and propose so-lutions for eventual future improvement.

1.3 Thesis Outline

Chapter 2 o↵ers a general overview of the face recognition problem, includingthe problem statement and specific challenges. In section 2.3 we o↵er a briefsurvey of di↵erent face recognition techniques, with special regard to previouse↵orts in interest point-based face identification.

In chapter 3 we present the theoretical background of the face recognitionscheme introduced in this thesis. In section 3.1 we introduce the notion ofinterest point detection an briefly present some well-known interest pointdetectors. The next section describes in more detail the SURF interest pointdetector and descriptor, which lies at the foundation of our face recognitionsystem. In section 3.3 we study the problem of e�ciently matching theinterest points extracted from two di↵erent images.

6

The face recognition method we propose in this thesis is described inchapter 4. First we o↵er a theoretical overview of the system in section 4.1,in which we expose the operating principles of the system. In section 4.2 wepresent the details of our implementation, including the functionality o↵eredby the application and the third-party technologies used.

The experimental methodology and the results we have obtained are pre-sented in chapter 5. The experimental results are collected into tables andanalyzed in order to identify the advantages and drawbacks of our system.

In chapter 6 we highlight the contributions of this thesis, while in chapter7 we draw some conclusions. In 7.1 we take into account the most signifi-cant limitations of our face recognition scheme and o↵er some guidelines forpossible future improvements.

Appendix A contains some detailed execution examples for facial imagesfrom the AT&T Face Database.

7

Chapter 2

Face Recognition

2.1 Overview

The problem of face recognition can be viewed as a standard pattern clas-sification or machine learning problem, and can be informally expressed asfollows:

Problem 1. Given static images or a video capture of a scene, identify orverify one or more persons in the scene by comparing with facial imagesstored in a database.

Face recognition systems generally fall into one of the following two cat-egories:

Face verification is a 1:1 match that compares a face image against a tem-plate face image whose identity is known;

Face identification is a 1:N problem, in which we compare a query faceimage against multiple template images from a face database to classifythe query face.

The method described in this thesis belongs to the latter category. Faceverification is essentially a special case of face identification, in which thedatabase contains a single template image. Hence, every face identificationmethod will be applicable for verification as well.

The face recognition process can be divided in two major steps:

1. Face detection and normalization is the step responsible for identifyingand extracting human faces from scene images and bringing them to astandard form that is suitable to be analyzed by a face identificationalgorithm;

8

2. Face identification refers to the step that receives the normalized facialimage as input and attempts to identify the subject by comparing theimage to the template faces from the database.

The algorithms that implement both parts are called fully automatic algo-rithms, while those that consist only of the second part are referred to aspartially automatic. In this thesis we only consider the step of face identifi-cation, face detection and normalization are out of scope for our research.

Depending on the kind of input images, one can distinguish the followingthree categories of face recognition algorithms:

• frontal ;

• profile;

• view-tolerant.

The classical approach is frontal recognition, while view-tolerant algorithmsare usually more robust and applicable in less ideal scenarios. Stand-aloneprofile-based recognition techniques have a rather marginal significance, how-ever they are practical as part of hybrid recognition schemes [3]. The faceidentification method presented in this thesis follows the classical approachof frontal recognition.

2.2 Challenges in Face Recognition

Current face recognition systems usually work well if the test and trainingimages are captured under similar conditions. Nevertheless, real-world sce-narios remain a challenge, because the face acquisition process is subject todi↵erent kinds of variation, which can significantly a↵ect the e�ciency of theface recognition process:

• variations in illumination due to varying light conditions in the mo-ment of image acquisition;

• orientation (or pose) changes, which introduce projective deformationsand even self-occlusion;

• di↵erent facial expressions can significantly alter the geometry of theface;

• significant time delay between the acquisition of the training and testimages leads to age variation. As the human face changes with age ina non-linear manner, the problem of age variation is often seen as oneof the most challenging issues in face recognition;

9

• occlusions can also a↵ect the performance of a face recognition system,especially if they are located in the upper side of the face. Sources ofocclusion include glasses, sunglasses, scarves, hats, or any other objectsthat cover a portion of the face.

Figure 2.1 illustrates di↵erent kinds of variation that can a↵ect the facerecognition process. The facial images used in this figure have been extractedfrom references [5], [36] and [37].

Illumination variation

Di↵erent poses

Di↵erent facial expressions

Occlusion

Figure 2.1: Types of variation in facial images

In general, the approaches proposed for dealing with variation belong tothe following three major classes of solutions [4]:

10

• invariant features aim to extract and utilize features that are invariantto the variations presented above;

• canonical forms attempt to normalize away the variation by imageprocessing techniques or by synthesizing a canonical form from the testimage;

• variation-modeling refers to the techniques that attempt to identifyand model the extent of variation.

Though each of the methods from above works well in a given scenario,performance degrades rapidly when other variations are also present.

The wide range of variation that can a↵ect the performance of face recog-nition systems has motivated many researchers to compile di↵erent facedatabases. These databases are collections of face images specially designedto cover di↵erent types of variation.

2.3 Face Recognition Techniques

In the last two decades numerous di↵erent techniques have been proposedfor modeling the idiosyncrasies of the human face and designing face identi-fication systems. In this section we will briefly describe the most significantapproaches.

Most face recognition methods are based on intensity images and 2Dimage processing, and can achieve a recognition rate above 90% under con-trolled conditions. Face recognition systems based on infrared photographyare considered out of scope for this brief survey.

2.3.1 Eigenfaces

Automatic face recognition can be seen as a template matching problem,where recognition has to be performed in a high-dimensional space. Thehigher the dimensionality of the search space, the higher the computationalcost of finding a match. Therefore, a solution is to apply a dimensionalityreduction technique to project the problem in a lower-dimensional space.Eigenfaces represent one of the first approaches in this sense [3].

The eigenface method [6], also known as Karhunen-Loeve expansion orthe Principal Component Analysis (PCA) approach, is one of the most thor-oughly investigated linear approaches to face recognition. Originally pub-lished in the early 90s by Turk and Pentland, the eigenface approach is thefirst genuinely successful method for automatic face recognition [4]. It is

11

based on the idea that any face image can be approximately reconstructedstarting from a set of generic face images, so-called eigenfaces, and a collec-tion of weights corresponding to each eigenface.

The eigenface method achieves a recognition rate of around 80% whenaveraged over di↵erent types of variation.

Eigenface Generation

Let I be an image represented as a matrix of N ⇥N intensity values. Suchan image can be linearized in an array of N2 elements, so that it represents apoint in an N2-dimensional space. Because images of human faces are similarin overall configuration, they will not be distributed randomly in this hugeimage space, and thus can be described by a relatively small subspace. PCAattempts to find the vectors that best describe the distribution of face imageswithin the entire image space.

Let �1, �2, . . . , �M

denote the linearized training images. The averageintensity vector of this set is defined as

=1

M

MX

i=1

�i

. (2.1)

Each face vector di↵ers from the average vector by �i

= �i

� i

. These di↵er-ence vectors are grouped together into anM⇥N2 matrixA = [�1,�2, . . . ,�M

].The covariance matrix C can then be obtained as the outer product of A,i.e.

C = AAT . (2.2)

The covariance matrix hasN2⇥N2 elements, and thus it would be compu-tationally nonviable to determine its N2 eigenvalues for typical image sizes.On the other hand, if the number of images is less then the dimension of thespace, i.e. M < N2, the covariance matrix will only have M rather than N2

meaningful eigenvectors. Therefore, we only need to find the eigenvectors ofan equivalent M ⇥M matrix. Turk and Pentland observe that for all i < M ,the eigenvector v

i

of C coincides with the i-th eigenvector of the matrix L,constructed as follows:

Lmn

= �T

m

�n

. (2.3)

Once the eigenvectors vi

are determined, the eigenfaces ui

are computedby summing the dot products of a given eigenvector with each of the di↵erencevectors, as follows:

ui

=MX

j=1

vij

�j

. (2.4)

12

If the training set contained M > N faces, the eigenfaces are sorted indescendant order according to eigenvalue, and only the first N eigenfaces areretained.

Using Eigenfaces to Classify a Face Image

Once obtained a set of eigenfaces, classifying new faces is computationallysimple. Given an image �, which has been normalized and scaled to the pixelsize of the training images, a vector of M weights !

i

is calculated as follows:

!i

= ui

(�� ). (2.5)

The weights form a vector ⌦ = (!1,!2, . . . ,!M

), which describes the contri-bution of each eigenface in representing the query image.

By comparing the Euclidean distance between the weight vector obtainedfor the query image with the weight vectors of the template images, we candecide which template image is the most similar to the query image.

2.3.2 Fisherfaces

While very e↵ective under controlled lighting conditions and constrainedpose, the eigenface method is highly sensitive to pose and illumination vari-ation. To address this problem, Belhumeur. et al. have proposed a methodcalled Fisherfaces [7]. It is similar in nature to the eigenface technique,as it also maps high-dimensional face images to a low-dimensional space us-ing linear projection. Rather than performing PCA, however, the Fisherfacealgorithm relies on Linear Discriminant Analysis (LDA) using Fisher’s Lin-ear Discriminant (FLD). It is based on the assumption that variations inlighting and pose over a single individual alter the face image more than thevariations between distinct individuals under similar acquisition conditions.

The Fisherface technique uses class specific linear methods for dimension-ality reduction, in contrast to the eigenface method, which applied the sametransformation to each image. Considering each face vector as a point in anN -dimensional space, FLD is applied to maximize the ratio of the between-class scatter and the within-class scatter, such that the images are groupedin distinct clusters.

Let µ be the overall mean image, µi

the mean image of class Xi

and Ni

the number of samples in class Xi

. The between-class scatter matrix SB

isdefined as

SB

=cX

i=1

Ni

(µi

� µ)(µi

� µ)T , (2.6)

13

where c represents the total number of classes. The within-class scattermatrix is given by

SW

=cX

i=1

X

x

j

2Xi

(xj

� µi

)(xj

� µi

)T . (2.7)

In order to maximize the S

B

S

W

ratio, an optimal projection matrix Wopt

isfirst computed. Let W be the set of the eigenvectors of S

B

and SW

, sorted indescending order according to eigenvalue. W

opt

is chosen as the matrix withorthonormal columns that maximizes the ratio of the determinant of thebetween-class scatter matrix to the determinant of the within-class scattermatrix, i.e.

Wopt

= argumentW

|W TSB

W ||W TS

W

W | = [w1,w2, . . . ,wM

]. (2.8)

Experiments conducted by Belhumeur. et al. have shown that the Fish-erface method is able to achieve a recognition rate above 90% for light sourcedirections up to 45� from the camera axis. In contrast, the eigenface methodhas an error rate above 50% for the same training set.

2.3.3 Linear Subspaces

Belhumeur. et al. [7] attempt overcome the negative e↵ects of illuminationvariation by constructing an illumination-invariant 3D model for each face.Faces are described as Lambertian surfaces, i.e. ideal di↵usely reflectingsurfaces, for which the apparent brightness of a given point is independentof the observer’s angle of view.

Let p be a point on a Lambertian surface illuminated by a light sourceat infinity. Let s 2 R3 be a column vector given by the product of the lightsource intensity with the unit vector for the light source direction. When thesurface is captured by a camera, the resulting image intensity of the point pis given by

E(p) = a(p)n(p)T s, (2.9)

where n(p) is the unit inward normal vector to the surface at point p anda(p) is the albedo (reflection coe�cient) of the surface at the same point.Lambertian surfaces with lower albedo values appear darker than those withhigher a(p) values under the same illumination conditions. Because the im-age intensity of point p is linear on s, the albedo and surface normal canbe obtained, given three images of the surface taken under three di↵erent,linearly independent light source directions.

14

The images of a Lambertian surface lie in a 3D linear subspace of the high-dimensional image space. The authors use three or more images taken underdi↵erent lighting directions to construct the linear subspaces. Recognitionis then performed by calculating the distance of a new image to each linearsubspace and choosing the face that yields the smallest distance value.

2.3.4 Geometrical Feature Matching

Geometrical feature matching techniques attempt to extract individual fea-tures, such as eyes, nose, mouth and head outline, and construct a face modelbased on these features.

Gabor Filtering

Gabor filters are linear filters used for edge detection and they have provede↵ective in capturing the geometrical features of facial images.

Barbu [10] proposed a face recognition technique based on processingeach facial image with a filter bank containing several 2D anti-symmetricalGabor filters. Facial images are filtered at various orientations, frequenciesand standard deviations. This process produces a feature vector V (I) of sizeW ⇥H ⇥ 2n for each image I, where W is the width of the image, H is itsheight ans n is the number of orientations. Recognition is done by comparingthe feature vectors obtained for di↵erent images using the squared Euclideanmetric. Since the size of each vector depends on the size of the correspondingface image, the images are first resampled to have the same dimensions.

Barbu has obtained a recognition rate of about 90% for a large databaseof frontal images.

Wavelet Decomposition

Wavelet decomposition is an increasingly popular tool in image processingand computer vision. Wavelets are an extension of Fourier analysis and theydecompose complex signals into sums of basis functions.

Kakarwal and Deshmukh [11] proposed a face recognition technique, whichextracts features from facial images using a two-dimensional Discrete WaveletTransform (DWT). Each image is described by a subset of band filtered im-ages containing wavelet coe�cients. These wavelet coe�cients characterizethe face texture and form the basis for building compact feature vectors.

15

Line Edge Map (LEM)

Edge information is insensitive to illumination changes to a certain extent.Edge maps are widely used in various pattern recognition fields and haverecently found their way into face recognition as well.

The LEM method, described by Gao and Leung in [12], extracts lines asfeatures from facial images, and it can be considered a combination of tem-plate matching and geometrical feature matching. The LEM representationrecords the end points of the line segments extracted from a facial image.

The authors propose a distance measure named the Line Segment Haus-dor↵ Distance (LHD), designed to measure the similarity of face LEMs. TheLHD measure combines the orientation distance, the parallel distance andthe perpendicular distance between two line segments. It is more tolerant toperturbations than correlation techniques, since it measures proximity ratherthan exact superposition.

2.3.5 Hidden Markov Models (HMMs)

HMMs are a set of statistical models used to characterize the properties of asignal. A HMM consists of two interrelated processes:

• an underlying hidden Markov chain with a finite number of states,a state transition probability matrix and an initial state probabilitydistribution;

• a set of probability density functions associated with each state.

HMMs have been extensively and e↵ectively used in speech recognition. Theone-dimensional nature of the sound signal makes it well suited to be analyzedby an HMM.

Samaria and Fallside [14] have proposed a method for applying HMMsin face recognition. The input images of width W and height H are firstdivided into overlapping blocks of width W and height L. Each of theseblocks will correspond to a facial region (hair, forehead, eyes, nose, mouth)and is assigned to a state in a left to right 1D continuous HMM.

Each individual in the database is represented by an HMM. The HMMsare trained by 5 images per individual, each image showing the subject in adi↵erent state. Recognition is done by comparing the probabilities obtainedfor the observation vectors of the query face to the probability matrices ofthe HMMs.

16

2.3.6 Support Vector Machines (SVMs)

SVMs are are supervised learning models that can solve a classical two-classpattern recognition problem. An SVM is a binary classifier that finds theoptimal linear decision surface based on structural risk minimization. Thedecision surface is a weighted combination of training elements. These ele-ments are called support vectors and they characterize the boundary betweenthe two classes of elements.

Jonathon Phillips [13] adapts SVMs to face recognition by modifying theinterpretation of the output of an SVM classifier and devising a representa-tion that transforms the face recognition problem into a two-class problem.The problem is formulated in di↵erence space, which captures the dissimi-larities between two facial images. Di↵erence space is characterized by thefollowing two classes:

• the dissimilarities between images of the same individual;

• dissimilarities between images of di↵erent subjects.

These two classes become the input of the SVM classifier. The decisionsurface produced by the SVM is interpreted as a similarity metric betweentwo facial images.

2.3.7 Neural Networks (NNs)

Bio-inspired techniques have been proved e�cient in solving a wide rangeof computationally hard problems, including pattern recognition and dataprocessing. The models proposed for face recognition are mostly based onNNs.

First a dimensionality reduction technique is applied to transform theimages into a relatively low-dimensional space. The techniques proposed forfeature extraction include PCA [15], LDA [17] and Bunch Graph Method(BGM) [16]. The features thus extracted from the database images are thenused to train a NN for supervised learning.

Radial Basis Function (RBF) NNs perform well for face recognition prob-lems, as they have a compact topology and their learning speed is fast. On theother hand, face recognition approaches based on NNs encounter problemswhen the number of image classes is large. Moreover, NNs are not suitable fora single template image recognition task, because multiple template imagesper person are needed for training the network [3].

17

2.3.8 Face Recognition Using Interest Points

Face recognition schemes based on interest point matching are similar inprinciple to the geometrical feature matching techniques presented earlier.Such systems use either the SIFT or the SURF detector for local featureextraction, and then match the interest points extracted from the queryimage against the keypoints extracted from the template images.

The main di↵erence between the schemes lies in the manner in which theinterest points are matched and the most similar face image is selected. Themost basic solution is the SIFT-based technique described in [16]. Matchesfor a query interest point are searched for in the whole template image. Onlythose matches are considered which are significantly better than the secondbest match. In the end the most similar template image is considered the onewith the most matching interest points. This scheme has the disadvantagethat matches are also searched and sometimes found in image areas wherethey cannot possibly belong to the same object.

Reference [17] introduces a more elaborate matching solution using a non-overlapping grid-based search. This reduces the complexity and the numberof false matches by eliminating improbable candidates.

In [18] the authors propose a mixed scheme, which compares the matchesfrom a rectangular neighborhood of the query interest point to global matchesin order to decide which matches are valid.

18

Chapter 3

Theoretical Foundation

3.1 Interest Point Detection

Interest point detection refers to the process of identifying interest points inan image. Interest points or keypoints are points that have a high degreeof variation in multiple directions and are rich in terms of local informationcontent. Corners are generally good interest point candidates, but so areblack dots on a white background or any image location with a significanttexture. Interest points are expected to capture local image features in arobust and scale-independent manner in order to facilitate reliable matchingfor object recognition. Interest points are also expected to be repeatable,meaning that they can be detected independent of the changes in the imageacquisition conditions, such as the camera’s angle relative to the scene orvarying illumination conditions [21].

Each interest point is described by a feature vector (or descriptor) whichcaptures the characteristics of the point’s neighborhood. Interest points ex-tracted from two di↵erent images can then be matched against each otherby comparing their descriptors using some distance metric.

Historically, the notion of interest point detection goes back to the earliernotion of corner detection. Corner detectors were designed to detect pointsat the intersection of two edges, e.g., L-corners, T-junctions, Y-junctions,X-junctions, etc. In practice, however, most corner detection algorithms aresensitive not only to corners, but in general to local image regions with ahigh degree of variation in multiple directions. Hence, these algorithm areessentially interest point detectors.

Blob detection is another closely related notion in computer vision. Blobdetectors are aimed at detecting regions in an image which di↵er in brightnessor color from other regions of the same image. Such a region is informally

19

called a blob and has the property that all its points are is some sense similarto each other. In case of many blob detectors, each blob is described by awell-defined point, which may correspond to a local maximum in the operatorresponse or be the center of gravity of a region. Therefore, it can be saidthat many blob detectors also extract characteristic interest points from theimage.

3.1.1 Notable Detectors

Harris Detector

The Harris corner detector [22], proposed back in 1988 by Harris and Stephens,was among the first e↵ective interest point detectors [21]. It is based on theauto-correlation function, which measures the local changes in the image byshifting a window in di↵erent directions.

Given a point (x, y) and a shift (�x,�y), the auto-correlation functionis defined as

c(x, y,�x,�y) =X

(u,v)2W

w(u, v)(I(u, v)� I(u+�x, v +�y))2, (3.1)

where I is the image function and W is a window centered at (x, y). w(u, v)is a Gaussian weighting factor of the form

w(u, v) = e�(u�x)2�(v�y)2

2�2 , (3.2)

where �2 represents the desired variance. The Gaussian is applied in orderto obtain a smooth circular window, and thus reduce the noisiness of theresponse.

The shifted function is approximated by a first-order Taylor expansion,which in matrix form can be written as follows:

I(u+�x, v +�y) ⇡ I(u, v) + [Ix

(u, v), Iy

(u, v)][�x,�y]T , (3.3)

where Ix

and Iy

are partial derivatives of I. This is used to approximate theauto-correlation function with the following quadratic formula:

c(x, y,�x,�y) = [�x,�y]M(x, y)[�x,�y]T , (3.4)

where

M(x, y) =

PW

Ix

(x, y)2P

W

Ix

(x, y)Iy

(x, y)PW

Ix

(x, y)Iy

(x, y)P

W

Iy

(x, y)2

�(3.5)

is the moment matrix that captures the intensity structure of the local neigh-borhood.

20

Let �1 and �1 denote the eigenvalues of M . The eigenvalues will providea description of how the auto-correlation measure changes in space. Basedon these eigenvalues, Harris defines the following response measure:

R = �1�2 � ↵(�1 + �2)2. (3.6)

Large positive values of R will indicate a corner region, while small valueswill indicate flat regions where the intensity is approximately constant.

The sum and the product of the eigenvalues can be obtained from thetrace and the determinant of M , without the need to explicitly compute theeigenvalues themselves:

�1 + �2 = tr(H)�1�2 = det(H)

(3.7)

Interest points are identified as local maxima of R that are above a spec-ified threshold. The local maximum in an image is defined as a point greaterthan all its neighbors from a 3⇥ 3 neighborhood.

Early approaches to interest point detection, such as the Harris detec-tor, lacked invariance to scale and were sensitive to projective distortionand changes in illumination conditions [27]. More recent feature detectorsattempt to achieve scale-invariance based on Lindberg’s scale-space theory.Scale-space theory relies on building a multi-scale image pyramid by succes-sively reducing the size of the input image trough smoothing and resampling[23]. Mikolajczyk and Schmid have extended the Harris detector with auto-matic scale selection, creating a scale-invariant variant of the detector [24].

Scale-Invariant Feature Transform (SIFT)

SIFT is a scale-invariant interest point detector proposed by Lowe in 1999[25] and later revised in [26]. It detects interest points by convolving theimage with Gaussian filters at di↵erent scales and looking for locations thatare maxima or minima of the Di↵erence of Gaussians (DoG) function.

Keypoint Localization: In the first step SIFT detects so-called scale-space extrema. The scale space L(x, y, �) of an image is defined as the con-volution of a variable-scale Gaussian filter G(x, y, �) with the input imageI(x, y). More formally:

L(x, y, �) = G(x, y, �) ⇤ I(x, y), (3.8)

21

where � denotes the scale. The Gaussian filter is defined as follows:

G(x, y, �) =1

2⇡�2e�

x

2+y

2

2�2 . (3.9)

In practice, the scale-space is constructed by iteratively smoothing and down-scaling the image. Lowe achieves this through bilinear interpolation with apixel spacing of 1.5. Scales are grouped into octaves, with each octave en-compassing the scales from � to 2�.

In order to locate scale-space extrema, the SIFT algorithm calculatesthe DoG images from adjacent scales separated by a constant multiplicativefactor k:

D(x, y, �) = L(x, y, k�)� L(x, y, �) (3.10)

In order to find the local maxima and minima of D(x, y, �), each samplepoint is compared to its neighbors from the current and the adjacent scales.As illustrated by Figure 3.1, there are 8 neighbors in the current image and9+ 9 neighbors in the scale above and below. A point is selected only if it islarger or smaller than all of its 26 neighbors.

Scale-space extrema detection produces too many interest point candi-dates, some of which are unstable. In the next step the algorithm analyzesthe neighborhood of each point in order to eliminate candidates that havelow contrast or are poorly localized along an edge.

Figure 3.1: Neighborhood of a point in scale-space [26]

Eliminating Edge Responses: The DoG function will have a strong re-sponse along edges. In order to eliminate edge responses, Lowe calculatesthe principal curvatures of D from a 2 ⇥ 2 Hessian matrix computed at thelocation and scale of the interest point:

H =

D

xx

Dxy

Dxy

Dyy

�, (3.11)

22

where Dxx

, Dxx

and Dyy

are partial derivatives of the DoG function. Theycan be estimated by taking the di↵erences of neighboring sample points.

The eigenvalues �1 and �2 of H are proportional to the principal curva-tures of D. Based on the approach proposed by Harris and Stephens [22],Lowe avoids the explicit computation of the eigenvalues. Instead, he obtainstheir sum from the trace of H and their product from the determinant of H.To check that the ratio of the principal curvatures is below some thresholdr, one only needs to check whether

tr(H)2

det(H)<

(r + 1)2

r. (3.12)

The keypoints which satisfy this inequality will be rejected. Lowe pro-poses a value of r = 10, by which he eliminates the interest point candidatesthat have a ratio between the principal curvatures greater than 10.

Orientation Assignment: SIFT assigns a consistent orientation to eachinterest point based on local image properties. Thus the interest point de-scriptor can be represented relative to this orientation, and therefore achievein-variance to rotation.

Orientation assignment is performed on the Gaussian-smoothed imageclosest to the keypoint’s scale. For each sample L(x, y) from this image, thegradient magnitude m(x, y) and the orientation ✓(x, y) is computed usingpixel di↵erences:

m(x, y) =p

(L(x+ 1, y)� L(x� 1, y))2 + (L(x, y + 1)� L(x, y � 1))2

✓(x, y) = arctan⇣

L(x,y+1)�L(x,y�1)L(x+1,y)�L(x�1,y)

⌘

(3.13)An orientation histogram is then created from the gradient orientations

of the sample points within a the keypoint’s neighborhood. The orientationhistogram has 36 bins and covers the whole 360� range of orientations. Eachsample added to the histogram is weighted by its gradient magnitude m andby a circular Gaussian window with a � that equals 1.5 times the scale ofthe interest point.

Peaks in the orientation histogram correspond to the dominant directionsof local gradients. The orientation of the highest bin will become the orienta-tion assigned to the interest point. In case there are other local peaks within80% of the highest peak, a new keypoint will be created for each of them andassigned the orientation of the corresponding bin. Hence, for locations withmultiple peaks of similar magnitude, there will be multiple interest pointscreated at the same location and scale, but assigned di↵erent orientations.

23

Keypoint Descriptor: The SIFT keypoint descriptor is created by firstcomputing the gradient magnitude and orientation at each image samplepoint in a 16⇥ 16 region around the keypoint location. These are weightedby a circular Gaussian window in the manner described earlier. The neigh-borhood is then divided in 4 ⇥ 4 subregions and each subregion is assignedan orientation histogram of 8 bins.

Figure 3.2 illustrates this process. Please note that the figure shows a2⇥ 2 array of histograms computed from an 8⇥ 8 set of samples instead ofa 4⇥ 4 array obtained from a 16⇥ 16 neighborhood, as described above.

Figure 3.2: Computation of the SIFT interest point descriptor [26]

The descriptor of the keypoint is finally obtained by concatenating all theorientation histogram arrays. Thus we get a feature vector of 4⇥4⇥8 = 128elements for each interest point. This vector is then normalized to unit lengthin order to guarantee invariance to a�ne illumination changes. The e↵ectsof non-linear illumination changes are reduced by thresholding the values inthe unit feature vector to be no larger than 0.2, and then renormalizing tounit length.

3.2 Speeded-Up Robust Features (SURF)

SURF is a robust scale and rotation invariant interest point detector anddescriptor originally published by Bay et al. in 2006 [27]. It was intended asa significantly faster but comparably e↵ective alternative to SIFT.

3.2.1 Interest Point Detection

SURF uses the determinant of an approximate Hessian matrix as the coreof the detector and detects blob-like structures at locations where the de-

24

terminant is at maximum. The algorithm employs integral images in theHessian matrix approximation, which allow for fast computation of box typeconvolution filters.

Integral Images

The entry of an integral image I⌃ in point x = (x, y) is the sum of all pixelsin the input image I within a rectangular region formed by the origin and x.More formally:

I⌃(x) =ixX

i=0

iyX

j=0

I(i, j). (3.14)

Once the integral image has been computed, it takes only three additionsto calculate the sum of the intensities over any rectangular area, as illustratedby Figure 3.3. Thus the calculation time becomes independent of the sizeof the convolution filter. This allows for the use of larger filters without atrade-o↵ in computational e�ciency.

Figure 3.3: Using integral images to calculate the sum of intensities inside arectangular image region [27]

Hessian Matrix Approximation

Given a point x = (x, y) in an image I, the Hessian matrix H(x, �) in x atscale � is defined as follows:

H(x, �) =

Lxx

(x, �) Lxy

(x, �)Lxy

(x, �) Lyy

(x, �)

�, (3.15)

25

where Lxx

(x, �), Lxy

(x, �) and Lyy

(x, �) are the convolutions of the Gaussiansecond order partial derivatives with the image I in point x.

SURF uses 9⇥9 box filters denoted Dxx

, Dxy

and Dyy

as approximationsof a Gaussian with � = 1.2 (see Figure 3.4). They represent the lowest scalefor computing blob responses. The approximated determinant of the Hessianmatrix represents the blob response in the image at location x. It can becalculated using the following formula:

det(Happrox

) = Dxx

Dyy

� (!Dxy

)2, (3.16)

where ! is a weight used for energy conservation between the Gaussian ker-nels and the approximated Gaussian kernels. This weight is defined as

! =kL

xy

(1.2)kF

kDyy

(9)kF

kLyy

(1.2)kF

kDxy

(9)kF

⇡ 0.9, (3.17)

where kxkF

stands for the Frobenius norm. In theory, the weighting changeswith the scale. In practice, however, Bay et al. use it as a constant, becausethey have found that it does not have a significant impact on the results.

Figure 3.4: 9⇥ 9 box filters for the Gaussian second order partial derivatives Lyy

and L

xy

(left), respectively their approximations Dyy

and D

xy

(right) [27]

The approximated determinant of the Hessian matrix represents the blobresponse at location x. Responses are stored in a blob response map overdi↵erent scales, and local maxima are detected in a 3 ⇥ 3 ⇥ 3 scale-spaceneighborhood.

Scale-Space Representation

Scale-spaces are usually implemented as an image pyramid constructed byrepeatedly smoothing and sub-sampling an image. This is the approachalso followed by SIFT. Bay et al. propose a di↵erent, computationally moree�cient solution for scale-space representation.

Due to the use of integral images, the cost of applying box filters ofarbitrary size remains constant. Thus, instead of iteratively reducing theimage size, the scale space is analyzed by increasing the filter size for each

26

scale, as illustrated by Figure 3.5. The output of the 9⇥ 9 filter is employedas the initial scale layer. The following layers are then obtained by filteringthe image with gradually bigger masks (15⇥ 15, 21⇥ 21, 27⇥ 27, etc.).

Figure 3.5: Comparison of the SIFT (left) and SURF (right) scale-spacerepresentation [27]

Scales are grouped into octaves, each octave containing a constant numberof scale levels. In total, an octave encompasses a scaling factor of 2, whichimplies that the filter size more than doubles inside an octave. Due to thediscrete nature of integral images, the minimum scale di↵erence betweensubsequent scales depends on the length l0 of the positive or negative lobesof the partial second order derivative in the direction of derivation (x or y).This is set to a third of the filter size length, thus in case of the initial 9⇥ 9filter l0 = 3. To obtain the next level, l0 must be increased by a minimumof 2 pixels (1 on every side) in order to keep the size uneven and ensure thepresence of a central pixel. This results in a total increase of the mask sizeby 6 pixels between consecutive levels, as shown in Figure 3.6.

Figure 3.6: Filters Dyy

(left) and D

xy

(right) for two successive scale levels (9⇥ 9and 15⇥ 15) [27]

The octaves in SURF are overlapping, with the first scale of each octavebeing the second scale of the previous octave. Hence, if the filter sizes forthe first octave are 9, 15, 21 and 27, the second octave will incorporate thefilter sizes 15, 27, 39 and 51. A third octave is computed with the filter sizes

27

27, 51, 75 and 99, and so on. See Figure 3.7 for a graphical representationof the filter sizes used in the first three octaves.

Figure 3.7: Filter sizes for the first three octaves [27]

3.2.2 Interest Point Description

The SURF descriptor describes the distribution of the intensity contentwithin the neighborhood of the interest point, similarly to the gradient in-formation extracted by SIFT. The descriptor uses the distribution of firstorder Haar wavelet responses instead of the gradient magnitudes employedby Lowe. The feature vector thus constructed only has 64 dimensions. Thisreduces the time for feature computation and matching when compared toSIFT. Bay et al. prove that, despite the smaller descriptor size, SURF iscomparably robust to SIFT and its variants [27].

Orientation Assignment

The first step consists of assigning a reproducible orientation to each keypointbased on the information extracted from a circular neighborhood in order toensure invariance to rotation. For this purpose Bay et al. calculate the Haarwavelet responses within a circular neighborhood of radius 6s around theinterest point, where s is the scale at which the interest point was found.

The wavelet responses are calculated in both the x and y direction andweighted with a Gaussian of � = 2s centered at the interest point. Theresponses are then represented as points in a space with the horizontal re-sponse strength along the abscissa and the vertical response strength alongthe ordinate. The dominant orientation is estimated based on the sum ofall responses within a sliding orientation window of size ⇡

3 , as shown in Fig-ure 3.8. The sums obtained for the horizontal and vertical responses yield

28

a local orientation vector. The longest vector over all windows defines theorientation of the interest point.

Figure 3.8: Orientation assignment in SURF [27]

Construction of the Local Feature Vector

In order to extract the descriptor, Bay et al. construct a square window ofsize 20s centered around the interest point and oriented along the orientationpreviously computed.

This region is then split up regularly into 4 ⇥ 4 square sub-regions. Foreach sub-region, the Haar wavelet responses are computed at 5⇥ 5 regularlyspaced sample points using the filters shown in Figure 3.9. Filter size is 2s.

Figure 3.9: Haar wavelet filters used to describe the interest points in the x (left)and y (right) direction. The dark parts have the weight �1, while the light parts

have +1 [27]

Let dx

denote the Haar wavelet response in the horizontal direction and dy

in the vertical direction (defined in relation to the orientation of the interestpoint). d

x

and dy

are first weighted with a Gaussian of � = 3.3s centeredat the interest point and then summed up over each sub-region. These sumsrepresent the a first set of entries in the descriptor. To this Bay et al. addthe sum of the absolute values of the responses, |d

x

| and |dy

|, in order tobring in information about the polarity of the intensity changes. Hence, each

29

sub-region is assigned a four-dimensional vector v that captures its intensitystructure:

v =⇣X

dx

,X

dy

,X

|dx

| ,X

|dy

|⌘. (3.18)

Figure 3.10 illustrates how this vector captures the nature of the underly-ing intensity pattern of a sub-region for three distinctively di↵erent intensitypatterns. In case of a homogeneous region, all the elements of v are relativelylow values (left). In the presence of frequencies in the x direction, the valueof

P|d

x

| is high, but the other components are low (middle). When theintensity is gradually increasing in the x direction,

Pdx

andP

|dx

| are bothhigh, while the values for the y direction remain low (right).

Figure 3.10: Examples of sub-region feature vectors [27]

The descriptor of the interest point is finally constructed by concatenatingthe intensity vectors obtained for all the sub-regions (16 in number). Thisyields a feature vector of 64 elements for each interest point.

Figure 3.2 summarizes the process of building the SURF descriptor. The2⇥2 sub-divisions of each region represent the actual fields of the descriptor.

Figure 3.11: Construction of the SURF descriptor [27]

30

3.2.3 Characteristics

SURF is, up to some point, similar in principle to SIFT, since they both de-tect interest points based on the spatial distribution of gradient information.Nevertheless, SURF outperforms SIFT in several aspects, as demonstratedin [27]:

• it is less sensitive to noise because it integrates the gradient informa-tion within a subpatch, while SIFT depends on the orientations of theindividual gradients;

• while it runs more than 5 times faster than the SIFT detector, theinterest points detected by SURF are comparable or even better interms of stability and repeatability than those detected by SIFT;

• SURF captures the local intensity structure in a feature vector twiceas compact as its SIFT counterpart (in its standard form).

These benefits contributed to our choice of using SURF instead of SIFTas the interest point detector employed in our face recognition system.

3.3 Interest Point Matching

The interest points extracted from two di↵erent images can be matchedagainst each other using Nearest Neighbor Search (NNS). This is a funda-mental optimization problem in computer vision, and it can be formulatedas follows:

p⇤ = armingp2P

d (p,q), (3.19)

where p⇤ is the nearest neighbor of the query vector q in a multi-dimensionaldataset P , which is a non-empty subset of some vector space. d is a distancemetric used for quantifying how close one vector is to another.

In our case the dataset and the query are given by the interest pointdescriptors, while the distance measure is either the Euclidean or the Manhat-tan distance. The Euclidean distance d

E

between two vectors p = (p1, p2, . . . , pn)and q = (q1, q2, . . . , qn) is given by

dE

(p,q) =

vuutnX

i=1

(pi

� qi

)2, (3.20)

while the Manhattan distance dM

is defined as

dM

(p,q) =nX

i=1

|pi

� qi

| . (3.21)

31

k-NNS is a variant of NNS which aims to identify the top k nearestneighbors of the query.

3.3.1 Brute-Force Solution

The naıve solution to the NNS problem is to compute the distance fromthe query interest point to every single point in the dataset and then selectthe closest match by comparing the distances. While it guarantees to findthe exact nearest neighbors in any case, this approach has a computationalcomplexity of O(nd), where n is the number of elements in the dataset andd is the dimensionality of each entry. This complexity can be expensive forlarge datasets.

Several data structures and related algorithms have been proposed forNNS in an endeavor to reduce the complexity of the search, the most promi-nent of them being the k-d tree algorithm proposed by Freidman et al. in1977 [28]. While these approaches work well for low-dimensional data, theyquickly lose their e↵ectiveness as the dimensionality increases [30]. In fact,for high-dimensional spaces they do not perform essentially better than thebrute-force search method [33].

The di�culty of finding the exact nearest neighbors in acceptable timehas led to the development of Approximate Nearest Neighbor Search (ANNS)algorithms. They reduce execution time by searching for a good enoughapproximation of the nearest neighbor instead of the exact nearest neighbor.

3.3.2 Approximate Nearest Neighbor Search (ANNS)

In the last two decades numerous di↵erent approaches have been proposedfor solving the ANNS and the k-ANNS problem. Hereupon we only mentiona few of them in a very condensed manner.

Arya et al. [31] have implemented ANNS by modifying the original k-dtree algorithm. They introduce the notion of "-approximate nearest neighbor.Given any positive real threshold value ", a data point p is an "-approximatenearest neighbor of q if its distance from q is within a factor of (1+ ") of thedistance to the true nearest neighbor, i.e d(p,q) (1 + ")d(p,p⇤).

The dataset is first preprocessed into a tree-like structure. The algorithmthen parses the tree until an "-approximate nearest neighbor of q is found. Apriority queue is used to speed up the search by visiting tree nodes in orderof their distance from the query point. Given a set of n data points in reald-dimensional space Rd, the search tree can be constructed in O(dn log n)time and O(dn) space.

32

A similar k-d tree based algorithm is proposed by Beis and Lowe in [29].The main di↵erence is that they rely on a stopping criterion based on ex-amining a fixed number E

max

of leaf nodes instead of using a distance ratiothreshold. Thus they obtain slightly better execution times than Arya et al.

The original k-d tree algorithm splits the data in half at each level ofthe tree on the dimension for which the data exhibits the greatest variance.Silpa-Anan and Hartley [32] propose a randomized variant of the k-d treealgorithm, in which the split dimension is chosen randomly from the first Ddimensions on which the data has the greatest variance.

Automatic Algorithm Configuration

The optimal algorithm for ANNS is highly dependent on the structure of thedataset (e.g. whether there is any correlation between the features in thedataset) and the desired search precision [33].

If we consider the algorithm itself as a parameter of a generic NNS rou-tine, the problem of finding the best algorithm for a given scenario can beformulated as an optimization problem in parameter space.

Based on this observation, Muja and Lowe [33] propose a solution for au-tomatically determining the best suited ANNS algorithm for a given dataset.The cost is computed as a weighted combination of the search time, treebuild time, and tree memory overhead.

The above solution is implemented by the FLANN [34] library alongseveral fast ANNS algorithms. See section 4.2.2 for more details about thislibrary and the way we used it in this thesis.

3.4 Histogram Equalization

The SURF interest point detector is robust enough to overcome slight varia-tions in illumination. Nevertheless, we chose to apply histogram equalizationas a preprocessing step in order to normalize illumination. Hence, we also in-clude a brief explanation of the image histogram and histogram equalizationin this chapter.

The histogram of an image captures the frequency of each intensity valuewithin the image. Each intensity level is denoted a bin and the height ofthe bin corresponds to the number of pixels having that particular intensity.Figure 3.12 shows the histograms of four di↵erent images that depict thesame scene.

33

Figure 3.12: Four images and their respective histograms [2]

Histogram equalization is a technique for adjusting image intensities andit belongs to the more general class of histogram remapping methods. Itachieves a non-linear remapping of intensity levels based on the cumulativedistribution function (CDF) (or cumulative histogram) of the image.

Let H(i) denote the histogram of a grayscale image I(x, y), where i is abin number between 0 and 255 and H(i) represents the height of bin i. Inorder to use the CDF as the remapping function, we must ensure that they-axis is in the range [0, 255]. Hence, H(i) first needs to be normalized bydividing each bin by the total number of pixels n. We thus obtain p(i), whichcaptures the probability of a pixel having the intensity level i in the image.The CDF is then calculated from p(i) using the following formula:

c(j) =iX

j=0

p(i). (3.22)

The histogram equalized image I 0(x, y) can then be obtained from theoriginal image and the cumulative histogram using the following remappingformula:

I 0(x, y) = c(I(x, y)). (3.23)

34

Histogram equalization enhances the global contrast of most images, asillustrated by Figure 3.13.

Figure 3.13: E↵ect of histogram equalization on an image [2]

35

Chapter 4

Proposed Solution

4.1 Outline

In this section we present a theoretical overview of the frontal face recognitionsystem developed as part of this thesis. Details about the proof-of-conceptimplementation of the system will follow in the next section.

4.1.1 Preconditions

In this thesis we only handle the identification step of the face recognitionprocess described in section 3.1. Hence, for simplicity, we assume that boththe query face and the database images are normalized and have the exactsame pixel resolution. Thus we can apply a single coordinate system toboth images and calculate the spatial distances between di↵erent points withminimal e↵ort.

The reasoning behind the face recognition technique presented in thischapter relies heavily on the non-violation of this precondition. The possi-bility of extending the matching scheme to be able to handle facial imagesof di↵erent resolution will be discussed in section 7.1.

4.1.2 Interest Point Detection and Matching

We apply the SURF detector to detect interest points in the query image.These interest points are then matched against the interest points previouslyextracted from each database image. The interest point detection step ispreceded in all cases by a preprocessing step consisting of histogram equal-ization. Figure 4.1 shows the interest points extracted from a grayscale faceimages of 92⇥ 112 pixels originating from the AT&T Face Database [36].

36

Figure 4.1: Interest points detected in a grayscaleface images using SURF

Since the input images are assumed to be normalized and identicallysampled frontal pictures of human faces, we can assume that the same regionsof the face will occupy approximately the same position within both thequery and the template images. Hence, we can avoid the complexity ofsearching for possible matches for a query interest point in the whole templateimage. Instead, for every interest point x = (x, y) 2 I we only consider thoseinterest points which lie within a circle of a predefined radius r centered inx0 = (x, y) 2 J (see Figure 4.2). By applying this spatial constraint, we cansignificantly reduce the chance of finding false matches between points thathave similar local properties but belong to very di↵erent facial features.

Figure 4.2: Matching interest points between face images: Each interest pointfrom the query image is only matched against those interest points from the

database image that reside in the neighboring area (marked with a blue circle)

We match interest points using approximate k-NNS. For each interestpoint from the query image we find the two closest matches among the in-terest points situated in the neighborhood of the query point’s projection onthe database image. The nearest neighbor becomes a candidate match.

We then calculate the ratio of the feature distances that characterize thebest and the second best match. The candidate match is kept only if thisratio is below a predefined threshold ↵, otherwise it is rejected. In otherwords: we only consider those matches, which are significantly better thanthe second best candidate. Thus we can drastically reduce the number offalse matches. The candidate match is also ignored in the special case when

37

there is only one interest point in the search area, and, in consequence, thereis no second best match to compare with.

4.1.3 Similarity Measure

After the matching step we have obtained a number of n good matches foreach database image. Each match is characterized by the distance betweenthe descriptors of the two corresponding interest points. The smaller thisdistance, the more similar features do the points exhibit.

From our initial hypothesis we can deduce that the spatial proximity ofthe interest points also contributes to the quality of the match. The closerthe two interest points are to each other in space, the higher the probabilitythat they belong to the same area of the face. We quantify this using theEuclidean distance between the coordinates of the points. Given two interestpoints p = (x1, y1) 2 I and q = (x2, y2) 2 J , we calculate their spatialdistance using the following formula:

ds

(p,q) =p(x1 � x2)2 + (y1 � y2)2. (4.1)

The combination of these two distances yields the quality of the match.We define the quality of the match between interest points p and q as follows:

Q(p,q) =1

1 + ds

(p,q)dd

(p,q), (4.2)

where ds

represents the spatial distance between the interest points and dd

is the distance between their descriptors. Q takes values from the interval(0, 1], where 1 stands for the quality of the theoretical perfect match.

The average quality Q of the interest point matches between two imageswill be one of the measures used to determine their grade of similarity.

The other component will be a function of n, since more matches indicatea higher number of similar features between the images. For this purposewe have chosen the logarithmic function, because its curve has a larger slopefor small values of n, and thus lnn makes a greater di↵erence in the criticalscenario when there are only a few matches for all faces.

Hence, we define the similarity measure between two face images I andJ , as follows:

S(I, J) =

⇢Q lnn n � n

min

0 otherwise, (4.3)

where nmin

is a predefined threshold that represents the minimum numberof matches needed to consider the similarity measure su�ciently reliable.

38

Once we have calculated S for all the template images from the database,we can choose the database image that is most similar to the query image Ias follows:

Jbest

= argumentJ

S(I, J) (4.4)

In the particular case when all database images yield S = 0, we consider thatthere is no good enough match in the database for the query image, thereforeJbest

2 ?.

4.2 Implementation Details

Based on the algorithm described in the previous section, we have imple-mented a proof-of-concept face recognition system as a command line appli-cation written in C++. In this section we present the functionality o↵eredby our application, as well as some implementation details, including thethird-party libraries we used.

4.2.1 Functionality and Usage

To run the application, one has to issue the following command:

frec [options] query-img-file db-config-file

Here query-img-file represents the path to the face image we want toclassify, while db-config-file stands for the path to the database configu-ration file. The latter is a simple YAML file that specifies the paths to theimages belonging to the database. The example below presents the expectedstructure of the database configuration file:

%YAML:1.0

folder: /home/aron/workspace/frec/testdata/database10

subjects:

- { name:Subject 1, filename:s01, extension:pgm }










39

By default the application only produces a simple output: the name ofthe subject recognized in the query image, or the message ”No match” incase the query face is not su�ciently similar to any of the template images.The following optional parameters can be used for a more verbose execution:

-d enables detailed output mode;

-v enables visualization mode.

In detailed output mode the program displays statistics about the numberof interest points found in the images, the number and quality of the matchesfound, as well as the amount of similarity between the query image and eachimage from the database.

When visualization mode is enabled, the application displays the face im-ages being processed in a window and illustrates the interest point matchingprocess in detail.

Execution Workflow

The application first parses the database configuration file and locates thetemplate images. Template images are expected to be found at the toplevel of the database folder and be named using the following convention:{filename}.{extension}.

Next, for each image from the database, the program looks for a file named{filename} keypoints.yml, which contains the interest points previouslyextracted and serialized. If these files are missing, i.e., the given database hasnot been previously used, the application extracts and serializes the interestpoints and their descriptors on the fly.

The query image is matched against each template image and the simi-larity measures are computed. The best similarity measure and the index ofthe corresponding database image are retained at each step. Finally, if thebest similarity value is greater than zero, the program displays the name ofthe subject, otherwise it announces the fact that no match has been found.

4.2.2 Technologies Used

Our application uses the the OpenCV [35] library for detecting and match-ing SURF interest points, and also for general image manipulation. OpenCV(Open Source Computer Vision), started in 1999 as an Intel research initia-tive, is an open source library mainly aimed at real-time computer vision.The most important OpenCV classes used by our application are the follow-ing:

40

SurfFeatureDetector – implements the SURF interest point detector. It isdefined in opencv2/nonfree/features2d.hpp and extends the genericFeatureDetector. The SURF detector is invoked with the defaultparameters of 4 octaves and 2 layers per octave, as well as a Hessianthreshold that equals 400;

SurfDescriptorExtractor – extends DescriptorExtractor and is respon-sible for computing the descriptors of the interest points detected bySurfFeatureDetector;

FlannBaseMatcher – performs quick and e�cient matching using the FLANN[34] library. It implements the DescriptorMatcher interface.

FLANN (Fast Library for Approximate Nearest Neighbors) is a librarydeveloped by Muja for performing fast approximate nearest neighbor searchin high dimensional spaces. It contains a collection of algorithms found towork best for nearest neighbor search and a system for automatically choos-ing the best algorithm and optimum parameters depending on the dataset.See 3.3.2 for more details about the automatic algorithm selection techniqueemployed by FLANN.

41

Chapter 5

Experimentation

5.1 Experimental Methodology

We have conducted all tests and experiments on a subset of the AT&T Cam-bridge Laboratories Face Database [36]. The database contains facial imagesof 40 di↵erent subjects. For each subject there are 10 images, which covervarious poses and facial expressions.

From this database we have randomly chosen subsets of 5,10 and 20 sub-jects for our experiments. For each subject we have then manually selectedone frontal image with a facial expression as neutral as possible. These im-ages were used as facial templates for the respective subjects. The remainingimages were used as queries.

We have compiled several sets of query images, each capturing a di↵erentscenario:

• frontal with neutral facial expression;

• head rotation (horizontal);

• head rotation (vertical);

• frontal with di↵erent facial expression (smiling, laughing);

• occlusion (eyes covered).

Since the AT&T Face Database does not contain occluded face images, wehave simulated occlusion by drawing a black stripe to cover the eyes of thesubject (see Figure 5.1).

42

Figure 5.1: Example of manually created occluded test image

5.2 Results

In this section we present the results of our tests and experiments. Firstwe discuss the configurable parameters r, ↵ and n

min

, then we present therecognition rates achieved by our experimental setup.

5.2.1 Parameters

The parameters described in 4.1.2 have a significance influence on the per-formance of the system. The values that yielded the best results in ourexperimental setup are presented in Table 5.1. We have obtained these val-ues empirically, using as guidelines the parameter values suggested in theliterature.

Parameter Valuer 15-20↵ 0.5

nmin

3-4

Table 5.1: Parameter values that yielded the bestresults for the AT&T Face Database

Please note that r is highly dependent on the pixel size of the input im-ages, hence the value range suggested here is only valid for the very databasewe used for testing. In general, r should increase proportionally with pixelsize, but determining an exact relation requires further investigation and wasconsidered beyond the scope of our current experiments.

5.2.2 Recognition Rate

Table 5.2 shows the average recognition rates obtained for the query setsdescribed in 5.1. Each query set contained 20 query images.

43

Query Set Correctly Identified Recognition RateFrontal 19 95%Head rotation (horizontal) 9 45%Head rotation (vertical) 8 40%Facial expression 14 70%Occlusion 16 80%

Table 5.2: Recognition rates for di↵erent query sets

It can be seen that the system performs very well for frontal images witha neutral facial expression. The recognition rate of 95% obtained for this sce-nario is comparable to those reported for other face recognition techniques.On the other hand, in the presence of considerable pose variation the per-formance of the system degrades drastically, e.g. head rotation of about 30�

yields recognition rates below 50%.Due to the local nature of the features matched by our system, occlusions

and slight di↵erences in facial expression are handled significantly better thanpose variation. The system is often able to find enough matching interestpoints to be able to calculate a reliable similarity measure even if the queryface is partially occluded.

5.2.3 Statistics

In Table 5.3 we have gathered di↵erent statistics regarding the number ofinterest points detected in the test images and the number of good matchesbetween them. As can be seen from the table, the number of matches betweentwo images showing the same subject is larger by an order of magnitude thanthe number of good matches between images of two di↵erent subjects.

Description ValueAverage number of interest points 35.07

Avg. no. of good matchesSame subject 10.90

Di↵erent subjects 0.67

Table 5.3: Statistics about the number of interest points and good matches

Table 5.4 presents statistics about the average quality of interest pointmatches and the average values obtained for the similarity measure whencomparing facial images of the same subject, respectively di↵erent subjects.

44

Description Value

Avg. quality of good matchesSame subject 0.47


Average similaritySame subject 1.11


Table 5.4: Statistics about the similarity measure

aa

45

Chapter 6

Contributions

In this thesis we have proposed a method for using SURF interest pointsfor face identification. This method is applicable if the query image andthe database images are normalized and have the same pixel size. We havechosen SURF over other interest point detectors due to its superior speed.

The main contribution of this thesis lies in the matching procedure andthe similarity measure. Unlike previous keypoint-based systems presentedin section 2.3.8, we search for matching interest points only in the circularvicinity of the query point’s projection on the template image. Thus wedecrease the complexity of the search procedure and drastically reduce thechance of false matches.

The similarity measure we use in our system is more complex than thoseproposed in related publications. While other works rely either on the num-ber of good matches or the average spatial or feature distance between thematching interest points, we combine all these into a more elaborate qualitymeasure in an attempt to find the best possible matches.

46

Chapter 7

Conclusions and Future Work

In this thesis we have shown that local image features can be e↵ectivelyused for frontal face recognition when combined with a spatially constrainedmatching scheme. Our system has the advantage of being able to performreliable frontal recognition with only one template image per subject andcan handle occlusions tolerably well. On the other hand, it also has severaldisadvantages which should be addressed as part of eventual future research.

7.1 Limitations and Possible Improvements

In this section we analyze the main drawbacks of our system and proposeguidelines for possible future improvement. We have identified the followingintrinsic drawbacks in our face recognition system:

Lack of view-tolerance: The main drawback of our system lies in the factthat the quality measure based on the spatial distance is not e↵ectivein situations when significant pose variation is present. This happensbecause our system relies heavily on the presumption that both thequery image and the database images show the face from approximatelythe same angle, and thus corresponding facial features appear in thesame areas.

Improving view-tolerance could be on of the most important subjectsof future research. We could possibly achieve a view-tolerant systemby using a training set of more than one template image per subject,and thus covering several di↵erent orientations.

Too restrictive preconditions: Another disadvantage of our face recogni-tion system is that it expects all images to have the same pixel size, thuswe cannot fully exploit the scale-independence of the SURF detector.

47

This is a much simpler problem to solve than view-tolerance. We couldovercome this limitation by using a size-adaptive search neighborhoodthat is adjusted relative to the size of each image instead of being con-stant. We also need to modify the distance calculation formula to takeinto consideration the size di↵erences between the images.

7.2 Other Future Work

In this section we present other possible directions for future research, whichgo beyond the immediate solutions that address the limitations of the system.

Parallelize the matching process: Even though we use the fastest possi-ble algorithms for interest point detection and matching, the executiontime required for matching will always be a function of the number oftemplate images and the average number of interest points detected inthem. We could reduce the execution time by parallelizing the match-ing process.

Since template images are completely independent of each other, theirinterest points could be matched in parallel against those of the queryimage. Furthermore, each interest point within the same query imagecould also be matched in parallel against the interest points of thetemplate image.

Extend the system to achieve fully automatic recognition: Startingfrom the face identification scheme proposed in this thesis, we could de-sign and implement a fully automatic face recognition system. In orderto achieve this, we will have to address the problem of face detectionand normalization.

We could also experiment with the possibility of achieving real-timerecognition by connecting the system to image acquisition hardware.

48

Appendix A

Examples

Hereupon we present some concrete examples of execution for our application.The test images have been taken from the AT&T Face Database [36].

Example 1: In Table A.1 we present the details of a successful scenario,in which the query face is correctly identified from a database of 10 subjects.

Table A.1: Example of successful recognition

Query

Subject 1

No. of candidates: 58No. of good matches: 0Avg. feature distance: –Avg. spatial distance: –

Avg. quality: –Similarity: 0

Subject 2

No. of candidates: 59No. of good matches: 10Avg. feature distance: 0.306Avg. spatial distance: 2.994

Avg. quality: 0.580Similarity: 1.334

49

Subject 3



Subject 4



Subject 5



Subject 6


Avg. quality: 0.217Similarity: 0

Subject 7



Subject 8



50

Subject 9



Subject 10



Match: Subject 2

Exaple 2: Table A.2 presents a scenario when no match is found for thequery image among 5 template images.

Table A.2: Example of no match being found

Query

Subject 1



Subject 2



51

Subject 3



Subject 4



Subject 5



No match

52

Bibliography

[1] S. Z. Li, A. K. Jain (eds.), “Handbook of Face Recognition”, SpringerNew York, ISBN: 0-387-40595-X (2005)

[2] T. B. Moeslund, “Introduction to Video and Image Processing – Build-ing Real Systems and Applications”, Springer London, ISBN: 978-1-4471-2502-0 (2012)

[3] A. F. Abate, M. Nappi, D. Riccio and G. Sabatino, “2D and 3D FaceRecognition: A Survey”, in Pattern Recognition Letters, vol. 28, pp.1885-1906 (2007)

[4] A. S. Tolba, A. H. El-Baz and A. A. El-Harby, “Face Recognition: ALiterature Review”, in International Journal of Information and Com-munication Engineering, vol. 2, no. 2, pp. 88103 (2006)

[5] P. N. Belhumeur, “Ongoing Challenges in Face Recognition”, in Fron-tiers of Engineering: Reports on Leading-Edge Engineering from the2005 Symposium, ISBN-13: 978-0-309-10102-8, pp. 5-14 (2006)

[6] M. Turk and A. Pentland, “Face Recognition Using Eigenfaces”, inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, Maui, Hawaii, pp. 586-591 (1991)

[7] P. Belhumeur, J. Hespanha and D. Kriegman, “Eigenfaces vs. Fisher-faces: Recognition Using Class Specific Linear Projection”, in IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 19, no.7, pp. 711-720 (1997)

[8] R. Nicole, “Distortion Invariant Object Recognition in the DynamicLink Architecture”, in IEEE Transactions on Computers, vol. 42, no. 3,pp. 300-310 (1993)

[9] M. Sharif, A. Khalid, M. Raza and S. Moshin, “Face Recognition usingGabor Filters”, in Journal of Applied Computer Science & Mathematics,no. 11, Suceava, pp. 53-57 (2011)

53

[10] T. Barbu, “Gabor Filter-based Face Recognition Technique”, in Pro-ceedings of the Romanian Academy Series A, vol. 11, no. 3, pp. 277-283(2010)

[11] S. Kakarwal and R. Deshmukh, “Wavelet Transform based Feature Ex-traction for Face Recognition”, in International Journal of ComputerScience and Application, pp. 100-104 (2010)

[12] Y. Gao and M. K. H. Leung, “Face Recognition Using Line Edge Map”,in IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 6, pp. 764-779 (2002)

[13] P. Jonathon Phillips, “Support Vector Machines Applied to Face Recog-nition”, in Advances in Neural Information Processing Systems 11, MITPress, pp. 803-809 (1999)

[14] F. Samaria and F. Fallside, “Face Identification and Feature Extrac-tion Using Hidden Markov Models”, in Image Processing: Theory andApplication, G. Vernazza, ed., Elsevier (1993)

[15] P. Latha, L. Ganesan and S. Annadurai, “Face Recognition Using NeuralNetworks”, in Signal Processing: An International Journal (SPIJ), vol.3, no. 5, pp. 153-160 (2010)

[16] D. Aradhana, H. Girish, K. Karibasappa and A. Chennakeshava Reddy,“Face Recognition by Bunch Graph Method Using a Group Based Adap-tive Tolerant Neural Network”, in British Journal of Mathematics &Computer Science, vol. 1, no. 4, pp. 194-203 (2011)

[17] K. Rama Linga Reddy, G. R. Babu and K. L. Kishore, “Face Recogni-tion Based on Eigen Features of Multi Scaled Face Components and anArtificial Neural Network”, in International Journal of Security and ItsApplications, vol. 5, no. 3, pp. 23-44 (2011)

[18] M. Aly, “Face Recognition using SIFT Features”, Technical Report,California Institute of Technology, USA (2006)www.vision.caltech.edu/malaa/publications/aly06face.pdf

[19] P. Dreuw, P. Steingrube, H. Hanselmann and H. Ney, “SURF-Face: FaceRecognition Under Viewpoint Consistency Constraints”, in Proceedingsof the British Machine Conference, vol. 7, pp. 1-11 (2009)

[20] G. Du, F. Su and A. Cai, “Face Recognition Using SURF Features”,in SPIE Proceedings Vol. 7496 MIPPR 2009: Pattern Recognition andComputer Vision (2009)

54

[21] C. Schmid, R. Mohr and C. Bauckhag, “Evaluation of Interest PointDetectors”, in International Journal of Computer Vision, vol. 37, no. 2,pp. 152-172 (2000)

[22] C. Harris and M. Stephens, “A Combined Corner and Edge Detector”,in Proceedings of the Alvey Vision Conference, pp. 147-151 (1988)

[23] T. Lindeberg, “Scale-Space Theory: A Basic Tool for Analysing Struc-tures at Di↵erent Scales”, in Journal of Applied Statistics, vol. 21, no.2, pp. 225-270 (1994)

[24] K. Mikolajczyk and C. Schmod, “Scale & A�ne Invariant Interest PointDetectors”, in International Journal of Computer Vision, vol. 60, no. 1,pp. 63-86 (2004)

[25] D. G. Lowe, “Object Recognition from Local Scale-Invariant Features”,in Proceedings of the Seventh IEEE International Conference on Com-puter Vision, vol. 2, pp. 1150-1157 (1999)

[26] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Key-points”, in International Journal of Computer Vision, vol. 60, no. 2,pp. 91-110 (2004)

[27] H. Bay, T. Tuytelaars and L. van Gool, “SURF: Speeded Up RobustFeatures”, in Proceedings of the 9th European Conference on ComputerVision, pp. 404-417 (2006)

[28] J. H. Friedman, J. L. Bentley and R. A. Finkel, “An Algorithm for Find-ing Best Matches in Logarithmic Expected Time”, in ACM Transactionson Mathematical Software, vol. 3, no. 3, pp. 209-226 (1977)

[29] J. S. Beis and D. G. Lowe, “Shape Indexing Using Approximate Nearest-Neighbour Search in High-Dimensional Spaces”, in Proceedings of the1997 Conference on Computer Vision and Pattern Recognition (CVPR’97), pp. 1000-1006 (1997)

[30] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: TowardsRemoving the Curse of Dimensionality”, in Proceedings of the 30th An-nual ACM Symposium on Theory of Computing (STOC ’98), pp. 604-613 (1998)

[31] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman and A. Y. Wu,“An Optimal Algorithm for Approximate Nearest Neighbor Searchingin Fixed Dimensions”, in Journal of the ACM (JACM), vol. 45, no. 6,pp. 891-923 (1998)

55

[32] C. Silpa-Anan, R. Hartley, “Optimised KD-trees for fast image descrip-tor matching”, in IEEE Conference on Computer Vision and PatternRecognition (CVPR 2008), pp. 1-8 (2008)

[33] M. Muja and D. G. Lowe, “Fast Approximate Nearest Neighbors withAutomatic Algorithm Configuration”, in International Conference onComputer Vision Theory and Applications (VISAPP’09) (2009)

[34] M. Muja, “FLANN: Fast Library for Approximate Nearest Neighbors”(2011)http://mloss.org/software/view/143/

[35] “OpenCV: Open Source Computer Vision Library”http://opencv.org/

[36] “AT&T Laboratories Cambridge Face Database”http://www.cl.cam.ac.uk/research/dtg/attarchive/

facedatabase.html

[37] “Yale Face Database”http://cvc.yale.edu/projects/yalefaces/yalefaces.html

56

http://mloss.org/software/view/143/

http://opencv.org/

http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

http://cvc.yale.edu/projects/yalefaces/yalefaces.html

face recognition using surf interest points

Documents