Cognitive Vision Model Gauging Interest in
Advertising Hoardings
Saad Choudri
MSc Cognitive Systems Session 2005/2006 The candidate confirms that the work submitted is his own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.
(Signature of student) ___________________________________
Acknowledgments At the outset, I would like to acknowledge the support of my parents Rishad and Samar Choudri
who encouraged and sponsored my MSc Cognitive Systems. My other family members and friends
who supported me considerably include Sana Lutfullah, Rayika, Zayd and Khizar Diwan to whom I
am thankful.
Prof. David Hogg (Supervisor) and Dr. Andy Bulpit (Assessor), with their unmatched experience in
Computer Vision and guidance, fueled the “from idea to project” concept to help me refine the
conception of this project.
Prof. David Hogg supported this project from the very start and most of all encouraged a self-driven
approach which in itself had immense advantages including novelty. Dr. Hannah Dee (stand-in
Supervisor) played a vital role in his absence to steer the report toward its current state and make me
think “Japanese garden” vs. “English garden”.
I am very grateful to Eric Atwell for giving me an edge for the road ahead and to Katja Markert,
Prof. Tony Cohn, Prof. Ken Brodlie, Martyn Clark and Tony Jenkins for various roles they played.
I would also like to thank all members of [email protected] including Savio Pirondini,
Graham Hardman and Pritpal Rehal for none other than their support.
Special thanks to Simon Baker and Ralph Gross at Carnegie Mellon University for making available
the Pose Illumination and Expression (PIE) database of 41368 images.
Thanks to Lee Kitching and Khurram Ahmed for participating in a few evaluation videos. Mention
must also be made of Arnold, the silverfish in my room who, if he could, would gladly have eaten
this report.
Last but not least, I would like to thank administration members of the School of Computing,
especially Mrs. Irene Rudling, Yasmeen, Judy and Teresa for, among other assistance, helping me
locate Prof. Hogg for my million plus questions.
ii
Abstract
In order to gauge pedestrian interest in advertising hoardings, a gaze or head pose estimation system
is required. Proposed here, is a novel “spirit-level” approach to head pose estimation where heads
may be as small as 20 pixels high and lack detail. This adds to the few existing models that deal
with such head sizes. The heads are found using a Viola-Jones model for face detection. A binary
feature vector drawn horizontally from approximately the centre of the eyes and tip of the nose
consists of skin pixels as a bubble against non-skin pixels. The feature vector is classified by a
support vector machine classifier previously trained and containing a number of support vectors for
three generalised head poses; left, right and centre. Designed specifically to gauge interest in
advertisement hoardings, the model is complemented with a regional interest gauge to measure
interest in specific objects. A hoarding is divided into nine regions and 5 discretised face kernel
templates are used to classify the region of interest, in a combination of two or more classifications.
The “spirit-level” system performs at par with other existing systems at 89% accuracy.
iii
Table of Contents
Chapter 1: Introduction........................................................................................................................ 1
1.1 Motivation ................................................................................................................................. 1
1.2 Project Overview....................................................................................................................... 1
1.2.1 Aim and Objective ............................................................................................................. 1
1.2.3 Requirements ..................................................................................................................... 2
1.2.4 Enhancements .................................................................................................................... 2
1.2.5 Deliverables ....................................................................................................................... 2
1.3 Scope ......................................................................................................................................... 2
1.4 Methodology ............................................................................................................................. 4
1.4.1 Outline Description and Specification ............................................................................... 4
1.4.1.1 Research Methods....................................................................................................... 5
1.4.1.2 Schedule...................................................................................................................... 5
1.4.2 Development and Validation.............................................................................................. 5
Chapter 2: Background and Previous Work ........................................................................................ 7
2.1 Literature Review: Background ................................................................................................ 7
2.2 Literature Review: Possible Solutions....................................................................................... 8
2.2.1 Determining Direction of Gaze at Close Range................................................................. 8
2.2.2 Determining Direction of Gaze at a Distance .................................................................... 9
2.2.3 Detecting the Face and Eyes ............................................................................................ 11
2.3 Literature Review: Machine Learning..................................................................................... 14
2.3.1 Regression Trees vs. Classification Trees........................................................................ 15
2.3.2 Support Vector Machines vs. Neural Networks............................................................... 15
2.3.3 Partition-Based vs. Hierarchical Clustering ..................................................................... 17
2.3.4 Mahalanobis Distance vs. Euclidean Distance................................................................. 17
Chapter 3: Model and Experiment Design......................................................................................... 19
3.1 Development Environment...................................................................................................... 19
3.2 Dataset ..................................................................................................................................... 19
3.3 Component Analysis and Testing............................................................................................ 21
3.4 Testing and Training Data ....................................................................................................... 22
Chapter 4: Experimental Feature Based Model ................................................................................. 23
4.1 Plan and Architecture .............................................................................................................. 23
4.1.1 Integrating MPT and Feature Extraction.......................................................................... 24
4.1.2 Classification.................................................................................................................... 26
4.2 Evaluation................................................................................................................................ 27
iv
Chapter 5: Spirit-Level and Face Kernels.......................................................................................... 31
5.1 Plan and Architecture .............................................................................................................. 31
5.1.1 Initial Version .................................................................................................................. 33
5.1.2 Image Segmentation Explored ......................................................................................... 38
5.2 Spirit-Level Approach ............................................................................................................. 43
5.3 Cross-Referencing Objects With Pose Using Face Kernels .................................................... 44
5.4 Evaluation of the Final Model including Spirit-Level vs. 13 Features.................................... 46
5.5 Final Model Combined............................................................................................................ 49
Chapter 6: Evaluation Against Ground Truth.................................................................................... 51
6.1 The Set Up............................................................................................................................... 51
6.2 MPT’s Suitability .................................................................................................................... 52
6.3 Robustness............................................................................................................................... 53
6.4 DoG Estimation Ability........................................................................................................... 54
6.5 Evaluation Summary ............................................................................................................... 58
Chapter 7: Future Work ..................................................................................................................... 59
Chapter 8: Conclusion ....................................................................................................................... 60
References.......................................................................................................................................... 61
Appendix A: Reflection ..................................................................................................................... 65
Appendix B: Objectives and Deliverables Form ............................................................................... 66
Appendix C: Marking Scheme and Header Sheet ............................................................................. 67
Appendix D: Gantt Chart and Project Management .......................................................................... 70
Appendix E: CD ................................................................................................................................ 71
Appendix F: Dataset .......................................................................................................................... 72
Appendix G: 13 Feature Analysis Samples ....................................................................................... 74
Appendix H: Image Segmentation Samples ...................................................................................... 76
Appendix I: Ground Truth Evaluation Form ..................................................................................... 83
v
Chapter 1: Introduction Provided here is the basis and scope for the project to set the scene for the rest of the report.
1.1 Motivation
During 2005, an estimated £196m [1] was invested in outdoor advertising in the United Kingdom.
Competing firms in consumer markets employ advertising companies to get their messages across.
Whether advertisement hoardings, “in-game advertising” or internet advertising, success is often
gauged in sales reports. Amongst the mediums mentioned, hoardings are widely used. It is easy to
make a pretty picture and attach a message to it but to actually determine what people are looking at,
what market demographics are looking, what expressions the observers have and how long they
gaze for is data that is sought after by all. Gaze estimation data is essential in that it allows
advertisers to see what attracts attention most, the model in the image, the product message, the
company logo, and possibly even the location of the hoarding. Viable solutions can be adapted from
within the field of Computer Vision (CV). Benefits will include allowing firms to test their
approaches to advertising on a smaller scale, before making large commitments to clients. But while
gaze estimation has been researched extensively for close-range subjects, limited attention has been
given to outdoor situations.
1.2 Project Overview
1.2.1 Aim and Objective
This projects aim was to design, implement and evaluate a CV model to gauge if pedestrians are
showing interest in advertisement hoardings. This was to be done by estimating their viewing
direction and duration. Hence, it primarily set out to answer the question; “is someone in the video
looking towards the hoarding or not, and if so then for how long”. The approach was to combine
state-of-the-art face and eye detection techniques with self-implemented algorithms. Though this
was only the beginning of a solution for the motivating requirements, it was a challenging task to
undertake and provides a novel platform for future work.
The educational objective here was to get “hands-on” experience in implementing CV techniques
and Machine Learning (ML) for a “real-world” problem. This is always the ideal way to learn and is
the purpose of most academic projects. The acquisition of this experience and knowledge gained
along the way was intended to be good enough to tackle several areas where CV could be used to
solve problems. This endeavour was also an exercise in project management skills- the ability to
schedule, organise, research and troubleshoot effectively.
1
1.2.3 Requirements
The minimum requirements provided a basis to expand on. They were:
1. Integrate off-the-shelf face and eye detection software.
2. Estimate viewing direction using a face and eye detection package augmented with novel
algorithms and, or, approaches through literature research.
3. Devise a measure of interest from viewing angle and duration of viewing.
4. Evaluate the system documented in a report.
Since this project was an idea, i.e., without an industry representative, this was the “requirements
elicitation phase” in the “requirements engineering process” as described in [2] and [3].
1.2.4 Enhancements
There were a number of potential optional enhancements that could have been incorporated. These
included:
1. Incorporation of facial expression or gesture recognition.
2. Cross-reference of gaze to objects in the display to gauge level of interest in an object.
1.2.5 Deliverables
These deliverables were discussed and agreed:
1. Project report with an evaluation from ground truth, i.e. video footage from a hoarding.
2. The video used in the evaluation, could be from a ground level hoarding etc.
3. Software, i.e. source code.
1.3 Scope
An elaboration on the minimum requirements and what the project was meant to do is described
here. This report’s style assumes the reader is familiar with CV. Integrating off-the-shelf face and
eye detection software involved studying to see if the suggested (in formal discussions) Machine
Perception Toolbox (MPT) would be ideal to use. Similar steps to the Component Based Software
Engineering (CBSE) methodology described in [2] and [3] were taken at the start to see if the
Machine Perception Toolbox (MPT) [29] component would be ideal with careful attention to ensure
that no requirements adjustment [2] be required. The decision was made on the basis of; 1) if there
were other papers describing its use; 2) if it performed well enough for the algorithm under
development to be evaluated. A detailed study of all existing face detection packages was outside
the scope of this project and non-functional aspects such as memory usage were not deciding
2
factors. Though the concept of emergence [3, 5] within such a component based system existed,
such details would have disrupted the schedule.
Figure 1.1 (from left to right) Right, centre and left poses (laterally inverted). Source [4]
Estimating viewing direction, requirement two, was to be dependent on the training images
available and the fact that hoardings in general vary in size, shape, and location. Thus it was decided
that three generalised limits as shown in Figure 1.1 would be the basic left, centre and right
representatives for the development of the project, and if the model correctly classified these above
a Zero R baseline, it would be a decent proposal to the problem. It is important to note that head
pose estimation is mainly for pedestrians. It does not apply to anyone in a motorised vehicle where
the face is often obscured or unclear. Therefore it is more useful in countries where target
consumers travel generally on foot, such as most of Europe and the United States rather than
countries like Pakistan, where consumers are normally hidden in automobiles. It was also not
necessarily required to work on faces with accessories such as hats or veils, but if it did this was an
advantage. Spectacles and sunglasses were considered a challenge and were included in what the
system should have worked with. As for the machine learning methods chosen, a brief justification
is provided rather than an extensive survey. This project was meant to explore a non-model based
approach. Therefore techniques such as principal component analysis (PCA) [9] were left out.
Requirement 3, devising an interest measure, should be interpreted in its basic sense as taking the
gaze estimation model from requirement 2 and using it to count in total the seconds or “duration”
that the billboard was viewed for. This does not take into account how many people viewed it but
simply a “viewed or not viewed” count, per face, per frame. This is not required of all the iterations
and prototypes that were developed. Instead only the prototype which outperforms the rest,
according to an evaluation criteria discussed later, will be further tuned to measure this duration in
either a text form or an image annotation.
The further enhancements include two possibilities. The first is facial gesture or facial expression
recognition, and the second involves cross-referencing objects in the hoarding with the estimated
head pose. The latter was not supposed to say precisely what object was being viewed but instead a
regional interest estimate would have sufficed. This was because object recognition or any similar
approaches were beyond the scope of this project. Therefore a count will be incremented for each
region of the hoarding as it is viewed.
3
The deliverables included a report, software and evaluation video. Among these points it should be
clarified that a hoarding and billboard can be at any level and assumed to be of a size that is
normally larger than A1. Both refer mainly to a large outdoor signage or advertisement [6, 7].
Therefore the evaluation video was not specific to any hoarding. In fact, since putting a camera in
front of a billboard and not concealing it has the affect of people looking directly at the camera;
recording may be done from a simulated location. The software is in the form of a prototype source
code and not a user interfaced application as specified.
1.4 Methodology
As with most projects not involving third-party stakeholders, especially academic research oriented
types, conventional methodologies such as “the waterfall model” cannot be applied as they do not
usually require signoffs. In fact almost all projects that are not critical systems generally follow an
adapted or modified selected methodology [3]. Though publications demonstrate inconsistency in
opinion to what “prototyping” actually is [8], this methodology with its branch of “evolutionary
prototyping” is best suited to this project. Figure 1.2 best describes the overall methodology used.
Figure 1.2 Evolutionary development illustration. Source [2]
1.4.1 Outline Description and Specification
In an industrial project "outline description" and "specification" come from external requirements.
Here they emerged from the minimum requirements. Often during this phase, coined in [3] as
"requirements engineering", a risk analysis, feasibility study and cost benefit analysis are conducted.
This was avoided due to the projects nature. Instead, the schedule described in Section 1.4.1.2 was
hoped to take into account risks such as "feature-creep" [3]. Considering the fact that this
experimental project had an iterative method of development, an industry representative was not
interviewed. This is because it was feared that this would have lead to taking an overly expansive
approach with too many requirements (for example, gauging what a person is thinking) which could
have lead to failure [2, 3, 8]. In terms of “requirements analysis”, Chapter 2 presents an in-depth
problem and solution review that is a key step in this software process.
4
1.4.1.1 Research Methods
This project was centred on academia and thus it was essential to study relevant available literature
as opposed to implemented systems since the latter would have been difficult to access and appeared
not to exist. Several research papers are available on gaze and pose estimation, and face and gesture
recognition, and these were the primary source for literature research and background reading. Also,
the research was prototype specific. For example, the first prototype was based on using features
such as the eyes and the face to estimate viewing direction. For this, papers related to feature usage
were read and a few alternative techniques were looked at until a suited solution and final system
was achieved. However, the research was only a means to gain background knowledge into the
problem. The main drive behind this project was to experiment with state-of-the-art techniques
learnt during the academic year in conjunction with an “out of the box” fresh approach. This was
necessary since the project is entirely experimental and thus far appears not to have been
implemented before. To research into the problem, papers on billboard surveys, advertising
productivity and survey techniques were reviewed.
1.4.1.2 Schedule
The Gantt chart is shown in Appendix D. The project made considerable head-way within the first
few months. The plan was devised with larger milestones as past experience showed that it was not
possible to predict exactly how long smaller tasks took. For example, it took longer to determine
how best to segment a face from the background, in a non-model based way without a spleen curve
fitting algorithm, than it did to code the entire final system. This was unanticipated at the start of the
project.
1.4.2 Development and Validation
Two main prototypes were developed each with several different versions. Figure 1.3 shows 5
stages that outline this report. The first stage deals with experimentations and finding the best
combination of face and eye detection which is found throughout the report. Stage 2 shows the first
prototype which works with 13 features and caters to the minimum requirements. This is described
in Chapter 4. It is evaluated using datasets collected and described in Chapter 3. The evaluation
criterion is mainly true negatives and true positives also referred to as truth, accuracy and precision
in this report. However as explained later, algorithm bias and other factors are also taken into
account for comparison purposes. Stage 3 shows two systems. The “spirit-level” prototype,
described in Chapter 5 (part of the final model) evolved after the feature based approach described
in Chapter 4. It moves beyond the minimum requirements by dividing “not looking” into “left or
right”. It also performs far better as shown in Chapter 5’s evaluation. The further enhancement of
cross referencing objects in the billboard with head pose was also successfully developed, evaluated
and combined with the final system or model described in Chapter 5.
5
Chapter 6 puts the final model described in Chapter 5 to a “ground truth” test. Several frames of
processed video were captured at 2 second intervals over the duration of the evaluation videos shot
at various locations. These were manually evaluated and several confusion matrices were made. All
aspects were tested. The basis of testing only the final model on outdoor footage was primarily that
most texts [2, 3 & 8] agree that a prototype should be executable and the intermediate versions need
not be fully functional but should be at a simulated level. Therefore, each prototype version was
evaluated in a controlled environment as mentioned. Only the final selection was developed further
and evaluated in Chapter 6. To ensure all deadlines were met it was not possible to evaluate every
frame after it was processed which is another reason why 2 second intervals were chosen.
MPT
Face & Eyes
DoG
2
L R S
?
1
Looking?
YesNo
Experimentation with face and eye detection Chapters 3 to 6
Meeting minimum requirements with a feature
based approach Chapter 4
A novel “spirit-level” approach that addresses the minimum requirements adds features and performs better
than that in Chapter 4. Evaluated and chosen for
final system Chapter 5
3
A regional detector for cross referencing objects in
hoardings with head pose. This is added to the final
system Chapter 5
evaluated
evaluated
GROUND TRUTH EVALUATION OF FINAL
SYSTEM Chapter 6
4
5 Figure 1.3 Outline of main prototypes and report structure.
The report combines various stages of the development into similar sections for the sake of brevity
and flow. For example, the dataset selection in Chapter 3 was done over many days. First, several
photographs were taken of a subject to test the face detection component and after that a test and
training set was selected. In this report, it has all been put into one section. This can be found in
other areas as well to keep within the reports page limit. Section 5.5 describes the final model.
Appendix E contains the remaining deliverables other than the report and evaluation.
6
Chapter 2: Background and Previous Work A detailed problem analysis and extensive review into state-of-the-art CV techniques follows.
2.1 Literature Review: Background
Advertising expenditure within the U.K in 2005 stood at £ 3,027m with an increase of 4.4% forecast
for 2006. 6.5% of this was spent on outdoor advertisements (hoardings or billboards) [1]. Despite
the adspend, research on advertisings productivity is limited and fragmentary [10]. Long standing
regulatory debates also exist, but only a limited number of academic studies explore why firms use
the medium. Users claim that billboards have unique advantages not offered by other media and are
afraid that their companies will lose up to 20% sales if billboards were banned [11]. Outdoor
advertising has increased in popularity owing to its creative treatment which aims to provide new
ways of using this traditional medium [12]. However, a study in [13] reveals that treatment such as
“smart-boards” produced the lowest level of recall.
From another perspective, Advertising Standards Authority (ASA) prepares compliance reports by
assessing billboards and administers the British Codes of Advertising and Sales Promotion by
visiting boards and often adhering to complaints. Between 1st January 2002 and 30th June 2002
1577 outdoor advertisements were assessed among which a few were in breach of codes and had to
be dealt with. If a gaze estimation system was present with facial gesture recognition, inappropriate
advertisements could be identified earlier [14].
A non-CV based application that attempted to blur the boundaries between web-browsing and art-
making was called Jumboscope [15] based on Kerne’s “Collage Machine” described in [16]. The
project interactively, by detecting interest through motion sensors and a touch screen display
interaction changed collages on a large touch screen display, rating the most popular collages. The
CV technique, eye-tracking, has been used in advertising to gauge interest in internet advertisements
and improve web-designing. In fact it has become a necessity giving designers insight into the
effectiveness of their websites [17]. The most relevant use of eye-tracking is described in [18] where
research was conducted to see if adolescents attend to warnings in cigarette advertisements. The test
involved 326 adolescents who viewed several cigarette advertisements with mandated and non-
mandated health warnings. It was followed by a recall test hoping to link cognitive processes to eye
ball movement. It showed that new non-mandated warnings were most effective.
7
As is evident, there does not appear to be a system in place using state-of-the-art CV techniques for
outdoor advertisements, and if there was one it would be extremely useful. The next section presents
a detailed survey of solutions that were considered to solve this problem.
2.2 Literature Review: Possible Solutions
In order to develop a system to gauge pedestrian interest in advertisement hoardings the key was to
understand how to determine gaze so as to answer the question; “are they looking at the hoarding?”
There are several other applications requiring a good estimator of the direction of gaze and therefore
research was carried out for each area. The most prominent of these areas found in research papers
was that of human computer interfaces (HCI) and surveillance. For example, the handicapped can
use gaze to control a mouse cursor in cases where they may only be able to move their faces or
eyeballs [19]. Surveillance videos can be processed and queries in investigations, such as, if a
person was being followed, can also be answered [25]. Various approaches have been used but most
combine head pose estimation with eye location estimations and, or, body pose or sometimes
probabilistic local descriptions to find facial features like the skin.
2.2.1 Determining Direction of Gaze at Close Range
The most common and basic way to go about estimating direction of gaze (DoG) is to find the eyes
and the face associated with the eyes in an image. From this certain characteristics are extracted that
are then used. Head pose estimation and DoG in such systems depends completely on the ability of
the system to track certain facial features reliably but feature tracking itself is error-prone unless
verification and forward estimation of higher level information is done [19]. To estimate DoG in
video streams sometimes velocity and direction of movement is used in symbiosis. In some cases
just the face and direction of motion are used because problems with locating features in small poor
quality images where the face may be 20 pixels high is difficult, and eyes, for example, cannot be
found.
Gee and Cipolla’s work in [20] is heavily based on facial features and uses 3D geometric
relationships between these features. It has been applied to paintings to determine where the subject
is looking. They look at points such as nose tip and base, define a facial plane with the far corners of
the eyes and mouth and with various ratios move to two main methods, one that uses 3-D
information provided by the position of the nose tip and the image, and the other which exploits a
planar skew-symmetry. [20] While this is an excellent method for accurate gaze estimation in large
paintings it cannot be applied to images of low quality where it becomes extremely difficult to track
certain features accurately to build upon. However, their use of ratios for eye planes and other
parts of the face is built upon and abstractions can be found throughout the project.
8
As mentioned, significant work has been carried out for HCI to estimate DoG. In [21] Morimoto et
al. base their technique on the theory of spherical optical surfaces and using the Gullstrand model of
the eye to estimate the positions of the centres of the cornea and pupil in 3D. They propose a method
for computing the 3D position of an eye and its gaze direction from a single camera with two near
infra-red light sources. This approach allows for free face motion and does not require calibration
with the user before each user session [21]. Zelinsky and Heinzmann in [19] also propose a similar
method which reconstructs the eye in 3-D to account for loss of detail through illumination and
distance variation. Besides their dependency on the subject’s nearness, these two systems relied on a
“world” based coordinate system. This decision was based on Brooks’ [48] views against a “sense-
model-plan-act methodology” [48]. Results of a picture based system were far better than the
“world” based coordinate system since it avoided the “cumbersome, complex, and slow complete
world modelling approach” [48].
Other techniques include oculography, limbus, pupil and eyelid tracking, a contact lens method,
corneal and pupil reflection relationships, Purkinje image tracking, artificial neural networks,
morphable models and geometry [22]. Wang et al., in [22], propose their “one circle” algorithm for
estimating DoG combining projective geometry and anthropometric properties of the eyeball. Their
method differs from others in that they treat the image of the iris contour correctly as an ellipse,
accuracy is improved since their method only needs the image of one eye and thus greater zooming
can be done with certain mechanisms in place to avoid losing the eye while tracking it. They rely
heavily on estimating the ellipse of the iris contour and the edges of the iris. Their results were
found to be better than other non-intrusive approaches such as one proposed by Zelinsky and
Heinzmann in [19] [22].
As far as tracking is concerned, especially that of the face’s movement and flow, edge detection can
be used and combined with other estimators as has been done in [23] to understand the importance
of motion information in human-robot attention. However none of these approaches can
successfully be employed in the task at hand as they significantly rely on near proximity. Also, it
will not be enough to implement an eye ball rotation tracking system. Even though they may
produce accurate gaze estimates for a subject looking within the narrow field of view allowed by the
eye movements alone, they cannot cope with larger gaze shifts which involve a movement of the
face [20].
2.2.2 Determining Direction of Gaze at a Distance
In [24] Voit and Nickel propose a “smart room” based neural network approach to face pose
estimation and horizontal face rotation. Their motivation was to move beyond the subject sitting in
one position but is still confined to a “smart room” fitted with 4 cameras at each corner. It appears
that it can easily be extended to a market place or some outside area to suit the project at hand. A
9
good reason for this is that it does not require images to have clearly visible features. This is
beneficial because most head pose estimation techniques do and thus are not suited to distances and
the outside world. However what can be problematic is a crowd of people. The system will have to
track faces that are common to the cameras for accuracy.
Other approaches that tackle detailed-feature based problems include that of Efros who showed how
to distinguish between human activities including walking and running. Efros’ system compared the
gross properties of motion using a descriptor derived from frame-to frame optic flow and performed
an exhaustive search over extensive representative data. Although the aim was not to estimate face
pose it showed how the use of a system descriptor invariant to lighting and clothing can be useful
[25]. There are many examples where solely trajectory information is used for surveillance purposes
including that in MITs AI lab to monitor urban sites [25]. At the University of Leeds, Prof. David
Hogg and Dr. Hannah Dee use a Markov chain with penalties associated with state transitions. It
returns a score for observed trajectories which essentially encodes how directly a person made his or
her way toward predefined scene exits [25]. Such techniques can be used to identify the direction of
motion combined with other information which has been done in the work of Robertson et al. in
[25], a remarkable and major shift from the feature tracking approaches. It combines head pose and
trajectory information using Bayes rule to obtain a joint distribution [25].
This remarkable approach [25], which may be considered as an “explicit” technique, discussed later,
for estimating DoG, caters to situations where the persons face is typically 20 pixels high. It uses a
feature vector based on skin detection to estimate the orientation of the face which is discretised into
8 different orientations relative to the camera position, serving as a compass. A sampling method
returns a distribution over face poses and the general direction and face pose [25]. The first
component as stated is a descriptor based on skin colour. This is extracted for each face in a large
training database and labelled with one of 8 distinct head poses. The labelled database is queried to
find either a nearest neighbour match for an unseen descriptor or is non-parametrically sampled to
provide an approximation to a distribution over possible face poses. Skin plays the key role because
the amount of skin visible of a persons face gives a pretty good idea of the persons DoG. However
to obtain this descriptor they manually intervene to a small degree. A mean-shift tracker is manually
initialised on the face and a skin-colour histogram in RGB-space with 10-bins over a hand selected
region of one frame in the current sequence is defined. Then they compute a probability that every
pixel in the face image which the tracker produces is drawn from this predefined skin histogram.
Each pixel in each face image is drawn from a specific RGB bin and then assigned the relevant
weight which can be interpreted as a probability that the pixel comes from the skin model. The
weighted images therefore define their feature vectors for face orientation per frame [25]. For the
matching part, they use a binary tree in which each node in the tree divides the set of representative
images below itself into roughly equal halves. Such a structure can be searched in roughly log-n
10
time to give an approximate nearest neighbour result which has not been used as it takes longer to
compute. Through this they achieve an 80% success rate. They then, as described, fuse together the
individual DoG obtained from direction of motion and head pose using Bayesian fusion [25]. Their
approach works very well on various video streams and on severely distorted tiny subjects in
footage such as that in football fields (Figure 2.1).
This is indeed a state-of-the-art technique and allowed for other similar tactics to be explored. The
algorithms implemented and described in Chapter 5 borrow from this approach but bypass certain
areas where their [25] approach cannot be used. It is assumed in [25] that the amount of skin to non-
skin pixels of a subjects face is an invariant cue to estimate DoG. Theoretically this system should
work with the current problem since they further combine body motion to compensate the face
kernel matching portion of their algorithm. The discretised faces they show appear as though they
may cause problems with bearded and veiled faces. Also, the DoG estimate is over a relatively large
area shown in Figure 2.1 which is clearly wider than a billboard and therefore does not offer a
precise estimation. Thus a precise estimation was required that did not assume the availability of
specifically located skin patches on the face.
Figure 2.1 System described in [25] implemented on soccer players. Source [25]
2.2.3 Detecting the Face and Eyes
The first step was to integrate components that detect the face and the eyes of subjects in video
streams. Since this model did not need to work in real-time and could run in a batch-processing time
frame the priority was not efficiency but accuracy in performance. Essentially it was the algorithm
for DoG that was the purpose of this project. The face detection problem is one that deals with
determining if there is any face in the image and then returning the location and extent of each [26].
Ideally the whole procedure must perform in a robust manner, invariant to illumination and scale
and orientation change [26]. This usually relies on independent decisions being made regarding the
presence of a face within an image leading to a large number of evaluations, approximately 50,000
in a 320 x 240 image [27].
11
A very successful widely used model for face detection is that proposed by Viola and Jones in [28]
who essentially use a boosted collection of features to classify image windows based on the
AdaBoost algorithm of Freund and Schapire [27]. A classifier comprises of an interpretable set of
features, one to many, that are each made up of a binary threshold function and a rectangular filter.
The rectangular filter is a linear function of an image, i.e. considering the filter is made up of two
rectangles, the sum of the pixels in one is deducted from the sum of the pixels in another and if the
threshold is exceeded a positive vote is cast; otherwise a negative one. Weights are assigned to these
features in the final classifier using a confidence rated AdaBoost procedure. Correctly labelled
examples are given a lower weight while incorrectly labelled examples are given a higher weight.
The weights on each feature are encoded in votes of each feature, i.e. yes or no gets a stronger
weighting after training for that feature. Then a cascade of classifiers is used to preserve efficiency
so those image windows that can easily be rejected as not being faces are not passed on any further,
while those that can be are sent downward into the hierarchy of classifiers. Each classifier further
down is trained on false positives of those before it. Each classifier also has more features. Thus
harder problems are dealt with more carefully and the resolution of image windows takes longer
while traversing the hierarchy [27, 28].
Another comparable approach is that proposed by Rowley et al. The Rowley-Kanades detector uses
a multi-layer neural network trained with face and non-face prototypes at different scales. The use of
non-face appearance allows to better differentiate boundaries of facial classes. The system assumes
a range of working image sizes starting at 20 x 20 pixels and performs a multi-scale search on the
image. The system allows a configuration of its tolerance for lateral views. This process is known to
be computationally expensive [26].
Santana et al. in [26] propose their face detection and tracking system as being faster than both the
Rowley-Kanades and the Viola-Jones system for face detection in video streams. They also group
face detectors into two main families implicit and explicit [26]. Implicit face detectors work while
searching exhaustively for a previously learned pattern at every position and different scales of an
input image, e.g. the Rowly-Kanades detector and the Viola-Jones detector [28] [26]. Explicit face
detectors end up increasing processing time since they take into account face knowledge explicitly
and combine cues such as colour, motion, facial geometry and appearance. An example of this
approach may be that in [25] and the Gee and Cipolla approach in [20].
Their approach [26] combines both implicit and explicit detectors in an advantageous fashion. In
their schema they have two main sections; “after no detection” and “after recent detection.” In the
former, the Viola-Jones detector model based OpenCV brute force detector is used combined with
another local context model which they claim works better on low resolution images provided the
face and shoulders are visible. This system assumes frontal faces only, i.e. those that are not for
12
example, profile views but can still be rotated. After using this method to detect a face, skin colour-
detection is performed to heuristically remove elements that are not part of the face and an ellipse is
fitted to the blob in order to rotate it to a vertical position. Within the skin blob eyes are then
located.
Eyes are particularly darker than their surroundings, so this is one approach. Secondly a Viola-Jones
based eye detector is used within the blobs within a minimum size of 16x12 pixels. This detector
scales images up in case the minimum size is too small. Lastly, if all else fails a Viola-Jones eye pair
detector is used. This is then followed by a normalisation procedure and a pattern matching
confirmation. In the case of the latter section, the position and size and colour using a red-green
normalised colour space and patterns of the eyes and whole face are used from the former section.
Using a number of approaches centred on areas where faces were previously detected their system
cascades through several steps and proves to be rather robust and accurate [26].
They tested their system, the Viola-Jones detector and the Rowley-Kanades system on 26338
images and showed that their system outperforms the others. Their system was 2.5 times faster than
the Viola-Jones detector and 10 times faster than the other. However, in terms of accuracy there is
only a 2 percent positive difference from the Viola-Jones detector [26].
Apart from the OpenCV brute force face detector is that provided in the free MPT [29]. As
discussed in [26] OpenCV was slower than their system. In [30] Benoit et al. propose a system to
measure a drivers fatigue, stress and other such related symptoms that could have severe
consequences. They use mpiSearch a part of the MPT which is a black and white, real time and
frontal face finder using a Viola and Jones style approach to find the face. While mpiSearch already
works in close to real-time for 320x200 pixel sized frames, they propose a means to make it 2.7
times faster. Therefore, rather than trying to come up with a system like that in [26] or designing a
new one from scratch, i.e. re-inventing the wheel, literature leads one to use the MPT. It is also wise
to do face detection for each image rather than track a face, so that is an advantage. They [30] prefer
mpiSearch but also acknowledge drawbacks such as decreasing frame rates which apparently do not
take place in OpenCV. Another project in which mpiSearch has been used successfully is [31]
where a human robot interaction method is developed based on spatial aspects, such as a person’s
proximity to detect an interaction partner. They use mpiSearch for face localisation, and consider it
a robust face detector. Using this, a robot measures the distance and direction toward a person based
on the person’s size and face. Here [31] it is stated that a face of 12x2 pixels is the minimum that
can be detected by mpiSearch. Going further and as discussed in [26], there are eye detectors also
available using the Viola-Jones style approach. In [30], although they are using mpiSearch, they use
another method for detecting the eyes because the eye detector with the MPT, called eyefinder,
takes too much computing time due to its spatial feature detection [30]. They instead use an
13
approach relying on the fact that the eye region is the only region of the face with both horizontal
and vertical contours. So applying two oriented low pass filters for each orientation and multiplying
the result gives them the area with an abundance of horizontal and vertical contours and thus the eye
region. Taking a further six operations per pixel they manage to point to the centre of the eyes.
Apart from those discussed, several other techniques exist for frontal upright faces in images. But in
reality natural scenes contain an abundance of rotated or profile faces that are not reliably detected.
Reliable non-upright face detection was first presented in a paper by Rowley et al., who improved
their Rowley-Kanade detector by training two neural network classifiers, one to estimate the pose of
the face in the detection window and the other a face detector. Faces are detected in three steps. An
estimation of the pose of the head is done which is then used to de-rotate the image window. This is
then classified by a second detector. The final detection rate is the product of the correct
classification rate of the two classifiers and thus is affected by their individual errors [27]. Viola and
Jones extend their model, as mentioned above, in [28] for non-frontal faces using a two stage
approach. First the pose of each window is estimated using a decision tree constructed using features
like those described in [9]. In the second stage one of N pose specific Viola-Jones detectors are used
to classify the window [27]. Once the specific detectors are trained and available, an alternative
detection process can be tested as well [27]. “In this case all N detectors are evaluated and the union
of their detections are reported” [27].
The discussion has so far demonstrated that estimating DoG especially for outdoor scenes, where
subjects are far from the camera, can be a major problem which has not been researched enough. It
is not unique to estimating interest in advertisement hoardings but in surveillance, security and even
strategy building for sports training, by studying, e.g., footballers DoG which helps to detect signs to
anticipate opponent teams coming moves [25]. A number of face and eye detection techniques have
also been discussed.
2.3 Literature Review: Machine Learning
Since there was no ideal solution found for this problem, a new approach had to be devised and an
extensive review of machine learning techniques was conducted. Problems which require the
induction of general functions from specific training examples need machine learning [33]. The
DoG component in this model is one such problem and so a number of machine learning techniques
were reviewed. The regression tree, support vector machine and a probabilistic distribution and
density model [32], the Normal distribution, have been employed. For clustering purposes the K-
means clustering algorithm has been used with the Euclidean distance measure. While it is not
possible to cover all the machine learning techniques and categories [32], a brief justification of the
use of these techniques follows. Detailed reviews can be found in [9] [32] and [33].
14
2.3.1 Regression Trees vs. Classification Trees
Numerous learning methods exist and it is often difficult to choose [33]. Decision tree learning is a
popular method for inductive inference since it is quick and robust to noisy data. It is well suited to
problems where; 1) instances are represented by a small number of disjoint possible values, 2) the
target function has discrete output values, such as yes and no, 3) when an if-then-else solution is
required, 4) where the training data may contain errors and, 5) where attribute values may be
missing. Classification trees are representative of this type of tree [33]. However the current
problem requires a tree that caters to instances with continuous values. Therefore a regression tree
has been chosen.
Regression trees have specifically been designed to approximate real-valued functions instead of
being used for classification [37, 32, 33]. Built through “binary recursive partitioning” [37], the
training data is iteratively split into partitions minimizing the sum of the squared deviations from the
mean in the separate parts. Once the deviations equate to zero or the maximum specified size is
reached the node is considered a terminal node [37]. This is very different from the classification
trees ID3 and C4.5, where the former uses information-gain and the latter uses gain-ratio to decide a
split and construct the tree [33]. Another deciding factor was that a regression tree appears to have
been successfully used in [25]. With both categories of trees, deciding the tree’s depth to avoid over-
fitting remains a problem and pruning needs to be done. Thus a different type of machine learning
technique is also required.
2.3.2 Support Vector Machines vs. Neural Networks
Various regression-type problems can be catered to by neural network architectures [37]. Neural
networks are among the most effective learning methods known. They fit well to problems with
noisy data and those where the output is from cameras and microphones. Their accuracy is
comparable to decision trees but they require longer training periods. They also work well for
continuous values but the biggest problem and main discouraging factor is the “black magic”, so-to-
speak, in classification and training [33, 34]. Having already selected a tree based learning method
an instance based type was still required. This included, for example, the K nearest neighbour
classifier [33]. The Naïve Bayes classifier was also an option but is by default meant only for
numerical values and this could have been a restriction [33].
The vision community considers classifiers as a means to an end, looking for simple, reliable and
effective techniques to get the job done. “The support vector machine (SVM) is such a technique.
This should be the first classifier you think of when you wish to build a classifier from examples
(unless the examples come from a known distribution, which hardly ever happens)” [9]. The SVM
emerged, like neural networks, from early work on perceptrons [34] [32]. With the ability to use
several kernel mapping functions or “kernel tricks”, it is ideal for linearly and non-linearly separable
15
problems. The data to be classified in this particular case appears to be linearly separable so a linear
kernel function is adopted with a least-squares method for the separating hyper plane. Even if it is
not linearly separable the SVM is known to find a decent separation. This configuration also
appears [9, 32] to be a default configuration for the basic SVM, though some texts [35] consider a
radial basis function as its default “kernel trick”. Nonetheless, in essence, a SVM separates two
groups of data using a hyper plane hoping to obtain a maximum distance of each point in the space
from that hyper plane [9, 32, 35]. But this may not always be possible, and even if it is, there is, as
with any other classifier, a chance that the model will over-fit test data. Overfitting can be catered to
by tweaking the support vectors until the bias that is created to either one of the groups is closest to
zero. This is another reason why the linear function is adopted with an SVM that caters to a binary
data configuration. Also, because of the support vectors, while the neural network and SVM are
considered comparable or the latter considered superior [9] the “black magic” or “voodoo” affect
such as that described in [34] is not present. This model based, rather than pattern based (e.g. [42]),
classifier falls under the predictive category of classifiers in that it allows us to classify objects of
interest given known values of other variables [32]. Thus falling within the boundaries of a fully
supervised solution.
A SVM may be regarded as an instance based learner which expends more effort in classifying new
instances than actually learning from the dataset [32, 33]. This is a disadvantage and when the
training data is large the SVM requires a lot of memory to run. However, in this case, the SVM will
have a small number of training examples. The SVM has another advantageous attribute; it can be
converted into other types of classifiers. For example, by using a sigmoid kernel function it can be
converted into a two layer feed-forward neural network [35].
There are of course several other types of methods, for example, the class conditional category such
as Bayesian classifiers [32]. Other than the restriction posed by the first-order Bayesian classifier,
Naïve Bayes, mentioned above, these are considered more suited to the language community for
tasks including the classification of text documents [33]. The EM algorithm is another example
falling into a similar category but it is ideally suited to maximise the likelihood score function given
a probabilistic model, often a mixture model, with unobserved variables [33]. This project relied on
observation so it did not seem suitable.
Having selected a descriptive [32] (one that summarises large data without a notion of
generalisation), inductive, tree based approach, the regression-tree, and an instance-based, predictive
approach, the SVM, a probabilistic approach was still required. This was not the primary motive to
adopt the Normal Distribution and we return to this in Section 2.3.4.
16
2.3.3 Partition-Based vs. Hierarchical Clustering
Just as machine learning is part of data mining, so is clustering [32]. In the vision community, image
pixels often have to be clustered together for segmentation. In fact this is among the most frequently
used methods of image segmentation [9]. Clustering is an unsupervised task, as mentioned, since the
training data does not mention precisely what we are trying to learn. Thus it can also be classified as
a descriptive [32] method. [38] and [39] provide an in-depth discussion on the different possible
types of clustering algorithms. For the sake of brevity, here the division is made into two broad
types, hierarchical and partition-based [9, 32, 38]. The former may be further divided into two types,
divisive and agglomerative. Both these methods are extremely memory intensive [9, 32].
Agglomerative clustering is a non-parametric clustering method and has been successfully used in
[40]. It returns the same results every time it is given the same input. On each iteration a pixel will
be compared to every other pixel if applied to this problem. This would become extremely resource
intensive [41]. Divisive methods suffer from similar problems. Partition-based methods are often
preferred. These methods use “greedy” interactions to come up with good overall representations
[32]. Among these methods the K-means algorithm is very popular using Euclidean distance as a
similarity measure [9, 32, 38, 39].
K-means is an iterative improvement algorithm which starts from a randomly chosen clustering of
points. The “means” are calculated on each iteration and the cluster centres continue to shift while
grouping the pixels using the additive measure Euclidean distance. The value of “K” determines the
random seeds or cluster centres at the start. This value also determines the final number of clusters.
Determining the value of “K” can be a problem just as predefining the best split and merge for
hierarchical methods is a problem [32, 38]. However it is a much faster algorithm and the problem
of defining “K” can be overcome. With text, knowing the different parts-of-speech can help define
the number of clusters [38]. Similarly, in this case, through observation the number of clusters can
be determined. Although it seems possible to take a subset and apply the agglomerative measure to
it to determine the “K” to data size ratio, it is beyond the scope of this project to do so. There is
another problem that all centroid based algorithms face. Outliers have a very strong influence on the
final decision. In [38] it is suggested that this be rectified by using a medoid approach, i.e. the
median as opposed to the mean.
There are a number of other techniques for image segmentation, using Eigen vectors, for example,
or probabilistic mixture models such as that implemented in [44] (one reason for which Normal
distribution has been chosen) but K-means sufficed for this purpose.
2.3.4 Mahalanobis Distance vs. Euclidean Distance
Euclidean distance, as mentioned, is a measure that is used as the basic similarity measure for the K-
means clustering algorithm. Though a number of measures exist such as the cosine angle between
17
two vectors for document grouping [39] it is a recommended measure to use with this algorithm [9,
32, 38, 39, 43].
From [33], in a plane with p1 at (x1, y1) and p2 at (x2, y2), the Euclidean distance is given by:
)²)y - (y )² x- ((x 2121 + (2.1)
Evident from Equation 2.1, Euclidean distance does not take into account correlations in p-space
and assigns simply on the basis of proximity. Mahalanobis distance on the other hand does cater to
these correlations and additionally is scale-invariant. [32]
( )
)()(21
21
1
2 ||2
1)(μμ
π
−−− ∑=−
∑
xx T
pexf (2.2)
Equation 2.2 from [32] is the definition of a p-dimensional Normal distribution. The exponent in
this equation;
(2.3) )()( 1 μμ −− ∑ − xxT
is the scalar value known as the Mahalanobis distance between the data point χ and mean μ, [32]
denoted as:
(2.4) )(2 μ−Σ xr
The denominator in Equation 2.2 is simply a normalising constant to ensure that the function
equates to the standard probability density function scale of zero to one [32]. Other than this
important feature, under the central limit theorem and fairly broad assumptions the mean of N
independent random variables often have a Normal distribution [32]. Though it was only applied
where the data permitted its use, the Normal distribution was therefore chosen as the probabilistic
model.
In this way the SVM, regression tree, K-means, Euclidean Distance and the Normal distribution
were chosen as the experimental classification and clustering techniques.
18
Chapter 3: Model and Experiment Design Various decisions and choices are justified and discussed in this Chapter.
3.1 Development Environment Matlab was the chosen environment for this project since it supports rapid prototyping by providing
several built-in functions. Its image processing toolbox is ideally suited to vision type problems. The
regression tree, K-means, SVM and Euclidean distance earlier chosen are already provided and
downloadable from the Internet. It is also extremely easy to manipulate matrices and therefore the
frames of videos [47]. Additionally an extensive computer vision library [45] is available for Matlab
if there are functions not already present. A Normal distribution function using the Z-Score method
was not available and so it was implemented using concepts in [36]. A suitable face and eye
detection package was required that integrates well with Matlab. As discussed in the literature
review; the MPT Beta Version 0.4b was obtained from [29].
3.2 Dataset
This section describes the images that were used to test the face and eye detection component and
train and test the individual prototypes. The final evaluation test videos are described in Chapter 6.
Figure 3.1 13 poses; 9 in the horizontal sweep separated by approximately 22.5°, 2 above and below the
central camera and 2 in the corners of the room. Source [4]
Initially, work began on random images selected from the internet and movies. It was also thought
that entire feature films would be ideal for training as the subjects never look at the camera. This
was not enough and an alternative was required. Among the image databases freely available the
pose illumination and expression database (PIE) [4] from Carnegie Mellon University was
considered as it catered to the minimum requirements and the further enhancements described in
Chapter 1. Though there are a number of image databases with a large number of subjects with
significant pose and illumination variation, this dataset consists of a much wider illumination and
19
pose variation augmented with expression variation [4]. Figures 3.1, 3.2 and 3.3 show the pose,
illumination and expression variations respectively. Details are provided in [4].
Figure 3.2 Illumination variation illustration. Source [4]
The danger of over-fitting caused by optimising the algorithm on the test set existed. As long as a
model is developed that estimates DoG, the problem of over-fitting can be dealt with in the future.
Figure 3.3 Expression variation illustration. Source [4]
The image set selection from the PIE database was done on the grounds of what the face and eye
detection component detected. In order to test that component a test set was required where the
background was uniform and the angle-sweep was at shorter intervals. The PIE database does not fit
this requirement and so a floor plan as shown in Figure 3.4 was devised. A single subject was
photographed in up to 70 different face poses. The subject positioned his face from position 1 to 5
with the face tilt at 0° (i.e. straight on), 24° (upward) and then at -24° (downward). Camera A was
positioned approximately 12° above the subjects eye line, and camera B, roughly 12° below the
subjects eye line. In one round with both cameras and all three face tilts, 30 images were taken with
the subject not looking at the camera.
20
Figure 3.4 Each marking on the wall, from 1 to 9, is approximately 11.25 ° apart. There are two camera
positions above and below the subject, A (12° above) and B (12° below).
The same was repeated with the subjects eyeballs fixed on the lens and face poses as before. This
was done to add to the PIE database if need be. With a few extra free-look sweeps (Appendix E)
approximately 70 images were acquired. To get from position 5 to 9 a horizontal image
transformation was applied to all images of face poses 1 to 4, inspired by [46]. Therefore a total of
126 images are obtained.
3.3 Component Analysis and Testing MPT includes a face detector (mpiSearch), an eye detector (eyefinder), a blink detector and a colour
tracker. The component that is tested here is the face detector for reasons already discussed.
Conducting informal experiments to verify that stated about mpiSearch in the literature review
showed that contrary to [26] this version of the MPT does have trouble with 16x12 pixel sized faces.
However, faces of 20 pixels high can easily be detected by mpiSearch still be used for this project.
Informally, it was found that while the eye detector of the MPT, “eyefinder”, is slow, it works at
real-time in areas dictated by the mpiSearch face coordinates.
MpiSearch detected the generalised subset of images shown in Figure 3.5. Only 29 images out of
126 images were detected of head poses between wall positions 3 and 7 (Figure 3.4). This result was
obtained using the AdaBoost algorithm built into mpiSearch without which fewer faces were
detected. Also the face tilt detected was between 5 o and -5o to camera level. However, informal
experiments with the “eyefinder” suggested that its face finding capabilities were better than
mpiSearch, but it is much slower as discussed in Section 2.2.3. Though attempts were made to find
out exactly why this difference existed from Machine Perception Laboratory, they were
unsuccessful. Further experiments also showed that mpiSearch focuses on [28], the Viola-Jones
approach for frontal faces instead of other rotations and angles, as they discuss in [27].
Even though the problem of frontal faces being detected existed, mpiSearch was used to begin the
project as there is only limited face to torso and eye movement flexibility that allows a subject to
gaze at an advertising hoarding. This usually means a near frontal-face view. This was learnt from
21
studying how people look at hoardings while walking by and the observation was backed by a video
analysis. Appendix E has video clips that provide evidence of the same. Also its dimension
limitations to faces 20 pixels high incorporated into the assumption that the number of cameras, i.e.
from 1 to N was according to the size of the area to be covered and the distance from the ground. As
this was the most important component to help discover a solution to the problem of DoG, it had to
be tested in advance. However, functions like the SVM, and a blob finding algorithm, that are later
used, are tested during development and approved simply on the basis of them working and giving
desired results.
Figure 3.5 A subset of the images against a cluttered background showing angles detected by mpiSearch.
3.4 Testing and Training Data
The previous section suggests that poses C05, C27 and C29 from Figure 3.1 be selected for the
subject looking left, centre (“straight-on”) and right respectively. Another deciding factor for this set
was that images of pose C05 have a skin coloured background. For a skin based approach to be tried
this obstacle can be an opportunity to make the algorithm invariant to background colour. The
images were selected on the basis of the subject’s physical anthropology variation, appearance,
expression, and illumination. The training data of 73 images with the break up as shown in
Appendix F as Figure F.1, was selected to allow variation to avoid over-fitting. Some images have
repeated subjects with and without spectacles. This data set was used for the feature based and skin
based prototypes described in Chapters 4 and 5 respectively. The test data of 133 images was
selected to include unseen poses not present in the training data. The “looking” images have a few
repeated subjects to test if classifiers are biased towards seen data. The “looking” and “not-looking”
test sets are shown as Figure F.2 and Figure F.3 in Appendix F.
Cross-referencing face pose with objects in the billboard or hoarding also required a dataset since
the dataset for the minimum requirements could fall under the category of “looking” and “not
looking” rather than, for example, “top-left corner”. For this purpose a training and test set was
selected of poses C05(left), C27(centre), C29 (right), C09 (down) and C07(up). Between 10 and 15
subjects of each pose were selected for training and 5 of each for testing. This dataset was carefully
selected to maximise subject appearance and physical anthropology variation. This dataset is shown
as Figure F.4 and Figure F.5 in Appendix F.
22
Chapter 4: Experimental Feature Based Model This Chapter highlights the feature based prototype and its versions developed that cater to the minimum requirements.
4.1 Plan and Architecture
The literature reviewed in Section 2.2 offers some interesting and novel techniques to go about the
solution to this problem. Some of those cannot be applied here while others give insight into
possibilities. Two main approaches can be chalked out, a feature based approach, and a skin based
approach. The first step taken was to adopt a feature based approach. This was the simplest way to
begin and allowed a detailed study of the human face to be conducted, thus allowing for ideas to
emerge. It avoided problems of segmentation, and as Robertson et al. claim in [25] skin cannot be
represented in any colour space. Figure 4.1 illustrates this first prototype which caters to the
minimum requirements. Section 4.1.1 describes stage 1 and stage 2 of the diagram and Section 4.1.2
describes stage 3 and how the various machine learning techniques chosen in Section 2.3 were
experimented with. They are evaluated in Section 4.2.
Figure 4.1 Outline of the feature based prototype. All the versions share this architecture. Stage 1 involved the
expansion of face coordinates found by mpiSearch and image processing for eyefinder. The DoG component
first extracted 13 features in stage 2 and a number of classifiers were used in stage 3.
Direction of Gaze Algorithm
Looking?
YesNo
1
2
Eye coordinates
MPT:mpiSearch
INPUT Expanded Face Region
MPT:eyefinder
Feature
Classification
3
OUTPUT
23
4.1.1 Integrating MPT and Feature Extraction
Figure 4.2 13 Picture coordinates with respect to the x axis and y axis of the face box. A1 = centre of the eye
plane on x (a.k.a NCX) and y (NTY). A2 = mouth or upper lip on y (NBY). B3= subjects right eye (given by
eyefinder) on x (REX) and y (REY). B4 = subjects left eye on x (LEX) and y (LEY). C is the face box drawn
from coordinates returned by mpiSearch. E6 (RECY) and E5 (RECX) are the intersection of the right eye, and
D8 (LECX) and D9 (LECY) are the intersection of the left eye ordinates with the contour ordinates on x and
y. F9 (FCY) and F10 (FCX) are approximate ordinates of the centre of the face.
As explained in Section 2.2, a picture based coordinate system is being used rather than a “world”
based system. The first step taken was to try and adapt Gee and Cipollas work in [20] by using the
eye points returned by eye finder as a starting point to identify the tilt of the face. In its initial stages
it was stopped and a new approach was required. The line running vertically down the face in
Figures 4.2 and 4.3 represent this implementation. Figure 4.3 shows why it could not be taken
further. The image on the extreme right shows the subject’s right eye detected much further down
than it is and this changes the angle of the constructed tilt-line. This took place often and because of
the inconsistency of the eyefinder sometimes showed a completely opposite tilt.
Figure 4.3 The images show how face tilts produce an angle against a possible normal that may be produced
parallel to the y-axis of face box.
It was not possible to use eye ball based systems but it was a possibility to use the location of the
eye axis centre on the x axis. Figure 4.2 illustrates thirteen coordinates that were introduced over a
period of time; the building blocks for this prototype and its versions. This seemed an intuitive step
to take and allowed several issues to come to light. One of these was that the face detector always
24
centred the face bounding box on the eyes. Due to this NCX, for example, was often the same
regardless of where the person was looking. To cater to this, a novel medoid contouring approach
was applied. As shown in Figure 4.2 (right) this resulted in six features. By taking the intersection of
the eyes and the contour on the axis of the face bounding box, it was possible to determine where in
the now enlarged bounding box the face was. LECX (read as Left-Eye-Contour-X axis), LECY,
RECX and RECY were obtained in this manner. Through observation the face has a larger number
of different isosurfaces than the rest of the face. Therefore the face would have more isolines or
contours. Taking a median of the ordinates of these contours would give an approximate centre to
the face. FCY and FCX are obtained in this way. A centroid, i.e. using the mean, would be
susceptible to outliers. The rest of the features are self explanatory from Figure 4.2.
In this prototype, to speed up processing, mpiSearch defined the region where eyefinder searched.
This was done after various attempts to speed up processing and it worked well until major
problems were identified; Figure 4.4 illustrates. To get around these problems the face box was
expanded with a growth procedure.
Figure 4.4 Extreme right: eyefinder result within original face box. Centre right: eyefinder within grown face
box parameters (grown box not shown). Centre left: best possible face growth of 1 to 20 pixels for either side
depending on the image dimensions. Extreme left: face box and eyes drawn only from the eyefinder showing
that “Centre left” is a near approximation and that the eyefinders face is a closer fit than that found by
mpiSearch (extreme right). Source [49]
The growth procedure was checked on a number of images and then implemented. It grew the face
box 20 pixels in either direction or until the image dimensions were reached. This catered to the
problem to a large extent. Another problem this growth procedure catered to was that of not
detecting eyes at all. Occasionally the eyes were not detected as the eyefinder typically finds a face
first and then the eyes. If the face is found then the eyefinder cannot find a face within a face and so
it does not detect anything or sometimes only one eye.
25
4.1.2 Classification
Training and testing were done in different stages. During training all the parameters were extracted
from the training set and were used to build a regression tree. The tree is shown in Figure 4.5. The
results were interesting and a functional model was created to cater to the minimum requirements.
Then an SVM was used with the same parameters in place of the regression tree. The evaluation in
Section 4.2 describes the results. Each classification method was tried in combination and single
parameters were also used for classifying all the images on their own.
Figure 4.5 Regression tree of the top performing tree based system with 63% accuracy. 121=Yes, 110=No.
Since there was no available Normal distribution function that matched the current need, it was
implemented, but only after checking to see if it would be suitable to employ having already
encountered the problem of the face box being centred on the eyes. For this a non-parametric
density model, the histogram, was used (subset shown in Appendix G, Figure G.1 to Figure G.6).
This was inadequate because histograms do not provide a smooth estimate [32] and so the mean and
standard deviation of each parameter was plotted (Figure 4.6). As shown the ranges suggested a
Normal or Gaussian distribution may be fitted and so it was. The training values of each parameter
were used to compute the mean and standard deviation for the Z-score formula for “not looking”
and “looking” curves. Therefore there were 26 Gaussians in total. A Z-score was computed for each
of the 13 features and was looked up in an “areas under the standard normal probability distribution”
table implemented from [36]. A new vector was created that took a 1 if the feature probability for
“yes” was higher, or 0 if “no” was higher. If there are more 0’s in this vector than 1s then it was not
looking, otherwise it was a “looking” face. The number 2 was assigned for equal probabilities. This
26
offered a means to further tweak those parameters that caused problems, i.e., they were represented
by 0’s and 1’s so those which said 1 when they should have said 0 were removed.
Figure 4.6 Parameters plotted as ranges from μ to ± 1σ. Green = looking, Red = not looking, 1=NBY, 2=
NTY, 3=NCX, 4=LEX, 5= REX, 6=LEY, 7= REY, 8=LECX, 9=LECY, 10= RECX, 11= RECY, 12= FCX,
13=FCY.
4.2 Evaluation
The various versions of this prototype were evaluated by constructing confusion matrices for each.
While it was not possible to present all the confusion matrices here, a summary of true positives and
negatives, and total positives and negatives is given. The latter two measures provide a bias
estimate. This was important especially for this prototype since the degrees-of-freedom was quit
high. That is, the ratio of parameters to test images was high. The Zero R baseline was used, i.e. if
there were fifty positive images and thirty negative images then the baseline becomes the positive
images divided by the total number. Confusion matrices for the two top performing versions are
Tables 4.1 and 4.2.
Classification Truth Looking Not Looking
Looking 48 16 Not Looking 27 25
Table 4.1 Confusion matrix for the top performing regression based version with 63% accuracy.
27
Classification Truth Looking Not Looking
Looking 44 20 Not Looking 24 28
Table 4.2 Confusion matrix for the top performing SVM version with 62% accuracy.
Figure 4.7 shows 11 versions of this feature based prototype constructed from in-depth parameter
tweaking and trying up to 5 different combinations of each of the 3 classifier-versions. Beside each
accuracy bar in Figure 4.7, are the total positive and total negative bars from the confusion matrices.
The difference between these is the key to determining the best version along with accuracy, rather
than just using accuracy as the evaluation criteria. This was possible since the number of “looking”
and “not looking” images in the test set was almost equal. “PAT” or “probabilistic version - all
features together” had the lowest accuracy with the greatest bias toward “no”. All the SVM based
systems had the lowest bias and the regression based system the highest accuracy.
59
53
5962 61 63
6056
5962
42
20
25
30
35
40
45
50
55
60
65
70
75
80
Yes 59 48 58 69 64 75 66 66 56 72 23
Accuracy 59 53 59 62 61 63 60 56 59 62 42
No 58 58 62 54 58 48 54 44 62 50 69
SAABT ST3T ST3LBT SAABLBT SAT RAABT RT3T RT3LBT RAT2T RAT PAT
Figure 4.711 selected versions of the feature based prototype plotted with accuracy (precision) and bias.
KEY: PREFIX: P=Probability R=Regression S=SVM, AABT - all above baseline together, T3T - top 3 via
accuracy together, T3LBT- top 3 least bias and above baseline together, AABLBT- all above baseline least
bias together, AT - All features together, AT2T-All rounder top 2.
As the positive, or looking, images are larger in number, the baseline is 55% for this prototype. Both
versions have clearly passed the baseline thus providing plausible solutions to the DoG problem.
The matrices shown represent the results of the entire test set on the SVM based version and the
regression tree version, both with the least biased parameters above baseline. Please note that the
test set is much larger that 116 images but mpiSearch has trouble detecting many of these images.
This is discussed again later in this Section. The result of the top performing probabilistic version is
shown in Table 4.3.
28
Classification Truth Looking Not Looking Equally Likely
Looking 14 46 1 Not Looking 16 35 1
Table 4.3 Confusion matrix for the top performing Gaussian version with 42% accuracy.
Table 4.3 shows a major problem with the probabilistic model, that of equal probabilities. This was
due to vectors with equal votes. While a joint probability may have solved this problem, at this stage
in the project it was more about a hunt for the ideal solution rather than a retreat with a possible
solution with such low accuracy. Individual parameters were used to classify images and that is how
it was established which to put into the combinations. Figures 4.8 to 4.10 illustrate the confusion
matrix results for each of the different classifier-versions.
44
56
4751
42
56
47
56
62
56 5659
54
59
5355
20
2530
35
40
4550
55
60
65
7075
80
Yes 56 50 69 41 42 52 39 59 61 72 66 56 36 73 67 72
Accuracy 44 56 47 51 42 56 47 56 62 56 56 59 54 59 53 55
No 29 63 21 63 42 62 56 52 63 37 44 63 77 42 35 35
NCX-FCX NCY-FCY Combined NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.8 All features using the regression tree.
5955
43
6158
60
55 5351
64
44
6359
20
25
30
35
40
45
50
55
60
65
70
75
80
Yes 61 48 48 61 64 63 60 53 55 77 22 69 59
Accuracy 59 55 43 61 58 60 55 53 51 64 44 63 59
No 56 63 40 62 50 58 56 54 46 48 71 56 58
NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.9 All features using the SVM.
29
36
4643
30
50
4447 47
45 45
56
43
53
20
25
30
35
40
45
50
55
60
65
70
75
Yes 30 33 48 28 62 56 39 39 31 34 66 36 51
Accuracy 36 46 43 30 50 44 47 47 45 45 56 43 53
No 46 66 40 40 39 46 56 60 68 60 44 57 61
NBY NTY NCX LEX LEY REX REY LECX LECY RECX RECY FCX FCY
Figure 4.10 All features using Gaussians.
During the course of the experiments conducted it was found that the regression tree misclassified
instances it had already seen. This is a good sign showing that the tree summarises information
rather than reconstructing it. It is evident that the SVM and the regression tree are top performers.
However, all the SVM approaches had a lower bias and from Figure 4.6 one of its top performing
versions was only 1 % lower than the top performing regression tree version. The top performing
single features were also with the SVM with the RECX (Right-Eye-Contour-X axis) at 64%
accuracy and FCX (Face-Centre (from contour)-X axis) at 63% accuracy. All this indicates that the
SVM is an excellent classifier to use. It also provides a bias-reading after training and thus can be
tweaked to bring the bias-reading as close to zero as possible. The regression tree would require
pruning which is a tedious and delicate process and the Gaussian based versions suffer from equal
probabilities besides poor performance.
Therefore after 15 different versions, 5 for each classifier-version, the SVM with all the least bias
features above the baseline is chosen to be the top system from the feature based prototype to serve
as an interest level gauge for any resolution of face that mpiSearch and eyefinder can detect.
30
Chapter 5: Spirit-Level and Face Kernels Following is a description of the final models prototype and its various versions. These were developed to move beyond the minimum requirements catered to by the SVM feature based approach (Chapter 4) and to incorporate the gaze to object cross-referencing enhancement.
5.1 Plan and Architecture
2
L R S
3
DoG
Face & Eyes
MPT:eyefinder
INPUT
OUTPUT ?
Corrected Coordinates
1
Figure 5.1 Outline of the final model. Stage 1 involved the correction of face and eye coordinates using
several ratios and subsampling. The “spirit-level” model is presented in stage 2 which goes further than the
feature based model in classification and accuracy. Stage 3 is the regional interest detector.
The final model or system can be broken up into 3 main stages as shown in Figure 5.1; the face and
eye detection package, the main DoG component which classifies “not looking” specifically as
either “left” or “right” and the regional interest detector which works if the subjects gaze is
“straight-on” (or “centre”). Although the feature based prototype did work with “left”, “right” and
“straight-on” input and output it was not specifically designed to do so. The final version of stage 2
is described in Section 5.2. Its initial version and reasons for the extensive image segmentation
techniques developed and described in Section 5.1.2 is described in Section 5.1.1. Stage 3, the
regional interest detecting further enhancement, is described in Section 5.3. It was not feasible to
develop the facial expression and gesture recognition system since the heads were very small and
the project schedule could not allow it. Section 5.4 contains an evaluation of the various versions of
the two main components. It also contains a comparison between the best “spirit-level” version and
the SVM feature based approach described in Chapter 4. In Section 5.5 the components are put
together and the final models interface is explained.
31
Before discussing the new approach it should be noted that mpiSearch was replaced by eyefinder.
This was because mpiSearch had trouble with detecting faces and eyefinder detected many more.
Using the eyefinder within mpiSearch, as discussed in Chapter 4, was prone to precision problems
and what was not discussed earlier is the fact that mpiSearch caused segmentation violation
problems more frequently than eyefinder. The former issue greatly affected the overall performance
as this approach relies on a pin point accurate segment of the face. The latter issue occurred usually
with memory mismanagement. Though the memory management problem was with the
development environment (for some unexplained reason by MathWorks and Machine Perception
Laboratory), the slower eyefinder reduced its frequency. Since the main algorithm for DoG was
under development and the face detector was only a means to test this algorithm the speed decrease
was not thought to be a major problem. The only problem identified by switching was a comparison
between the two main prototypes. The source data-set remained the same but the sub-sets were now
different. They were sub-sets because both mpiSearch and eyefinder did not detect all the subjects
and when they did eyefinder detected more. However with confusion matrices in place problems for
comparisons were minimised. Also, the only major difference was the detection of a few more faces
so this did not affect the comparison.
a b c d e f g h i
Figure 5.2 A subset of possible pedestrian appearances that this algorithm should cater to. Face segments
shown below each subject.
As discussed in Section 2.2. Robertson et al.’s model in [25] could not be applied as it did not give
precise gaze estimation. Also, a model that moved beyond that proposed in Chapter 4 was sought
after which had improved performance and estimation quality. With this new prototype it was
possible to determine, more precisely than the feature based model, where the subject was looking.
Therefore it became possible to tell if the hoarding should be in a different position as well as if
people are looking at it, or if people are paying more attention to another hoarding and if so then
why.
While observing various images of people and video footage it was realised that there is one section
or segment of the face that is usually visible shown in Figure 5.2. As shown, regardless of whether
the person has a beard or not or if the face is covered, the skin is visible in this segment. Segment
“f” of the lady wearing a burqa (i.e., full veil) shows an abundance of cloth. As discussed later, skin-
32
pixels were assumed to be those clustered-pixels (i.e., encoded with a cluster number) that were
found to occur most between the mouth plane and the eye plane. A detailed explanation of these
planes may be found in [20]. Therefore if an eye detector found the eyes the section may be drawn
and the cloth would act in place of the skin in that region. But a face and eye detector superior to
eyefinder and mpiSearch needs to be used which does not need a visible face and works with eyes as
well. Nevertheless, all the prototypes and versions developed, especially the final proposed model,
are ready to work with any face and eye detector package. This is proved in the Chapter 6 where
robustness is evaluated. The following Sections describe the various steps that lead up to the final
“spirit-level” approach.
5.1.1 Initial Version
The initial version took a larger section centered on the tip of the nose referring to it as the third
quarter from the top of the face coordinates returned by the eyefinder. However, its results
motivated further refinements and the resulting novel algorithm. Work first began with RGB images
that were converted to greyscale and then standardised using the median of pixels since that was
affected less by outliers [43, 38]. K-means was also used with a medoid approach [38] for the same
reason. Regional clustering was done to obtain a mean skin cluster value and a mean background
cluster value over the training images. Other than searching for a solution to segment the face from
the background, it was also done to verify Robertson et al.’s claim in [25]; that skin cannot be
represented as, or in, any colour space. Results for non-regional clustering are shown in Figure 5.4,
with a 25 x 18 pixel face. The face was subsampled taking every second pixel to remove extra
information including detail in the background that was similar to that belonging to the skin region.
The concepts of subsampling and regional base clustering were intuitive steps taken and since they
gave good results they were accepted. Figure 5.5 shows segments obtained from an image that was
not subsampled and the same image subsampled. (This Figure is the result of the final model but it
still describes the advantages of subsampling.)
Figure 5.4 Left: K-means with 3 clusters, lightest coloured region is skin. Right: original image.
33
Figure 5.5 Subsampled face of 30 pixels high and original with segments shown.
As opposed to Figure 5.4, Figure 5.6 shows the result of applying K-means in a regional way, i.e.
separately to the area between the eye plane and the mouth plane.
Figure 5.6 Left: Kmeans (medoid) with 2 clusters, red indicating skin. Right: original image.
While this image shows a good segmentation, clustering of this sort was not resource efficient, at
least not in the RGB colour space. Figure 5.7 shows the similarity between the skin coloured
background and the face in a standardised greyscale image converted from an RGB image.
50 100 150 200 250 300
1020304050
50 100 150 200 250 300
1020304050
Figure 5.7 Top: Similarity between face and background toward left of image. Bottom: poor skin and non-
skin segmentation.
The result in Figure 5.6 encouraged a new stance to be taken in colour segmentation. Linear colour
spaces include the CMYK, CIE XYZ and RGB colour spaces. RGB among these is common and
was created for practical reasons. It uses single wavelength primaries. But with all linear colour
34
spaces the coordinates of a colour do not always encode properties that are important in most
applications, and individual coordinates do not capture human intuitions about the topology of
colours [9]. In other words, the RGB colour space cannot be represented in terms of hue, saturation
and value, or HSV, such as in non-linear colour spaces. Also, it was observed that standardisation
did not give desired results using the mean or median. This is necessary since brightness in a RGB
image varies with scale out from the origin. If the image was in HSV format the “value” component
could simply be removed leaving saturation and hue to help segment various objects in the scene.
Figure 5.8 Bottom left: Normal distribution resultant. Bottom right: Nearest match using Euclidean distance.
Simply removing “value”, and using hue and saturation did not work well either. Until then
Euclidean distance was being used as the encoding scheme and a mean value for the skin and non-
skin prototype clusters was the encoding criteria. In order to try a different approach to Euclidean
distance, the standard deviations were calculated by using Euclidean distance to find out which
pixels were skin pixels and which were non-skin. The standard deviation and mean for skin and
non-skin pixels were then used to look at the area under the skin and non-skin Gaussians to find out
where the pixel had a higher probability of belonging. Also, Gaussian smoothing is applied before
segmentation in attempts to improve the cluster assignment. The difference is show in Figure 5.8.
An isotropic low pass Gaussian smoothing was applied to blur images, remove detail and remove
noise [9]. It outputs a weighted average weighted towards the value of the central pixels, of each
35
pixel's neighbourhood as opposed to, for example, the mean filter's uniformly weighted average.
Thus it provided gentler smoothing and preserved edges [50, 51].
There was a slight difference in the results of the Euclidean distance measure and the Mahalanobis
distance variation (Normal distribution). The latter took longer to compute and was prone to equal
probabilities. However at the same time it provided a closer fit. Figure 5.9 shows the difference in
the cross section.
Figure 5.9 Centre image: The Gaussian pixel encoding scheme draws nearer to the actual face than the
Euclidean distance encoding scheme (bottom) does.
The model was then taken to construct binary feature vectors such as those used by Dance et al. in
[52]. This is step 12 in the outline presented in Section 5.2. This was done by dividing the face
segment into a histogram of 12 bins. The skin pixels in each bin were counted and if they exceeded
a mean then a 1 was assigned, otherwise a 0.
S ( straight or centre) 0 0 1 1 1 1 1 1 1 1 0 0 R (right) 0 0 0 0 0 1 1 1 1 1 1 1 L (left) 1 1 1 0 0 0 0 0 0 0 1 1
Table 5.1 Feature vectors for left, right and centre face poses.
As shown in Table 5.1 the feature vectors for straight and right were well defined but that for left
was not. This was because the pixels of images that had a background similar to skin behind the face
were incorrectly labelled. The background was confused as skin and due to illumination variation
the actual faces were not considered to have many skin pixels. The “left” training images, where the
36
person is looking towards the right of the image plane were all affected by this. The result adds
further to Robertson et al.’s claim in [25] that skin cannot be represented in any specific region of
colour space. The pixels were extracted as intensities rather than colour and represented as two
Gaussians, one for skin and one for not skin, each with one standard deviation from the mean.
Although no colour was involved it appears that even a system based on probabilities of varying
intensities cannot work.
Testing (in Section 5.4) revealed that like the feature based approach described in Chapter 4, images
from the “not looking” test set where the poses were C09 and C07 (see Figure 3.1 in Section 3.2)
were mistaken as a “straight-on” face. To cater to this a vertical section was tried as shown in Figure
5.10 but it was not certain how to represent this data, for example beards, and how to make the
sections representative of their classes. There were many problems with the vertical section. The
findings lead to a disappointing conclusion, making a vertical measure of skin bins, a transpose of
this prototype, highly unlikely to work. This variation was to count skin pixels in bins of rows
moving vertically along the length of the face centre. This would incorporate a relationship between
the hair and the face and be merged with the feature vector for the third quarter of the face and the
background which has given a good accuracy. However the problem with the skin and background
propagated in this as well.
Figure 5.10 Vertical section attempt to cater to face tilt with a centred head pose.
The motivation behind taking a vertical cross section and dividing it into 12 bin-rows was that if a
person is looking straight on, there will be more hair in the top bins than if a person was looking
upwards. Also, if a person tilted his face downward, the number of hair pixels or non-skin pixels
would increase and this would not be classified as a “looking” image. It was also observed that
subjects looking up at a camera would have more neck showing, and so there would be more skin at
the bottom of the feature vector. Changing the width of this extracted portion also did not improve
the performance as the distinction between left right and straight-on images was also reduced. The
best possible way to cater to this was considered to be face kernels or a similar system like that
implemented in [25]. These are explored later in Section 5.3.
Returning now to the problem of face segmentation; possibilities existed to tweak the face section
by assuming that pixels in the corners were background. There were potential problems with this. In
37
an unsupervised environment a self corrector would not be possible. It would not be possible to say
that the pixels in the centre are skin pixels and those in a strip on the extreme left are background
since this algorithm works by expanding from the centre of the face. In the case of the faces looking
right, all the skin pixels were in the right side portion of the feature vector but there was little on the
left and so the 1’s started from the extreme right. Determining how much to grow the segment or
selection and what the number of bins should be was a problem for this vertical section. If the
subject was standing next to someone else then their skin may be included in the region. People
standing behind a subject could also cause a problem. However it is assumed that a person behind
the subject would have different skin pixel intensity values through depth in the foveal and shadows
cast by the subject in front. Therefore growth of the face or head box was restricted to be within a
possible shoulder distance boundary. The number 12 for bins was defined keeping in mind the limits
of the eyefinder to detect faces of widths of approximately no less than 15 pixels. A number of
techniques were tried to segment the face from the background and they follow.
5.1.2 Image Segmentation Explored
Figure 5.11 Main image segmentation techniques developed. Column Key: 1 = subject subset, 2 = Euclidean
distance, 3 = 5 regional Gaussians, 4 = 6 regional Gaussians with joint probability, 5 = 6 regional Gaussians in
joint probability represented by regional pixels rather than clustered pixels, 6 = Adaptive Background method,
7 = texture segmentation, 8 = subsampled texture segmentation, 9= ellipse on texture approach, 10 = final
saturation based segmentation.
It has already been established that there is no known generic scheme that may be stored and applied
to the face and background segmentation problem. While [40] and [52] offer state-of-the-art scene
segmentation and object recognition approaches respectively, it was beyond the scope of this project
to implement them. A number of different image segmentation techniques were explored and they
have been discussed here. Figure 5.11 offers a subset of original images, column 1, and results from
38
some of these techniques, columns 2 to 10. Column 2 shows the result of applying a regional based
segmentation to each image. This is done by applying K-means and Euclidean distance to the area
around the face and then to the area considered to be the face. Such approximations are done by
calculating ratios, inspired by [20]. For example, since there is a maximum face pan-range for left
and right looking subjects that eyefinder can detect, a multiplier of 4.5 to the length of the eye plane
of that face gives the constant that can be added and subtracted from the x ordinate of the centre of
the eye plane. Identical steps can be used for the height of the face region except that the multiplier
is 2.2 for the eye plane to obtain the y addition constant. Hence an approximate face area was
obtained. There were a total of 3 background clusters and 2 foreground clusters. Figure H.1 in
Appendix H shows the results of this segmentation technique. This was not adequate and so a series
of Gaussians were applied, 3 for the background and 2 for the foreground, i.e. as an encoding
scheme in place of Euclidean distance. The result is in column 3 and Figure H.2. The problem of
equal probabilities was the main underlying drawback which was catered to by having an equal
number of Gaussians, i.e. 3 for the background and 3 for the foreground. The results are in Figure
H.3 and column 4. While certain problems of the previous method had been overcome the methods
were both inferior to the Euclidean distance method which in itself was poor performing with skin
coloured backgrounds.
Column 5 and Figure H.4 show the results of taking all the background pixels of an image and using
them separately to define the representative Gaussian for their region and taking the foreground
pixels and treating them in the same way, avoiding clustering entirely. Column 5 shows the result of
an adaptive background Gaussian approach where any pixel considered to be part of the background
alters the shape of the background Gaussian inspired by Stauffer and Grimson’s approach in [44].
Figure H.6 shows its result after a few iterations. As is noticeable the problem of segmentation was
still not solved. From successful work in [51] with the RoSo2 algorithm, texture segmentation was
then implemented. The convolution kernels used in that algorithm were a Robert Cross 1st derivative
and a Sobel pair from [50]. Here after considerable experimentation an unusual set of kernels were
settled for including the Sobel pair that segmented the best as shown in Column 7 (Figure 5.11) and
Figure H.7 (Appendix H). Using texture is known to be a good segmentation technique. Here, edge
detection was performed in the spatial domain by correlating the image with the kernels in Table
5.2. High and low pass filters were used to manage intensities. This activity in the spatial domain
was believed to give better results though the frequency domain, e.g. Fourier domain, is much faster
for computation [50]. Squaring was done after this stage so that black to white transitions count the
same as white to black transitions [9, 51, 53]. Smoothing was done then to estimate the mean of the
squared filter outputs, which is then followed by applying a threshold to act as a high pass or low
pass filter. Each correlation has its own tweaked standard deviation and Gaussian smoothing
window size for it to function as desired. Finally the K-means clustering algorithm was applied to
summarise the segmentation results.
39
+2 -2 +1 +2 +1 -1 0 +1 -1 +1 +1 +1 0 -1 -2 +2 0 0 0 -2 0 +2 +1 -1 +1 -1 +1 0 -1 -2 -1 -1 0 +1 +1 +1 -1 -1 0 +1
Diagonal Filter 1
Sobel 2nd derivitave
Sobel 1st
derivitave Diagonal Filter 2 Robert Cross
variation
Table 5.2 Edge detection kernels for texture segmentation.
While the results were good the computational time was slow and so an approach was tried where
skin, background and hair images (Appendix E) were pre-processed using this algorithm and stored
as prototypes. Since there were 5 kernels and up to 20 clusters allowed it was hoped that there
would be a set of unique clusters for each image type but most were identical. This adds another
finding to Robertson et al’s claim in [25] about representing skin, that even texture, which varies for
every object, cannot represent skin in a predefined manner. However this problem was dealt with by
encoding on a regional basis and then removing the most frequent prototype, for example, for skin
from the background set. Finally a small set of prototypes was obtained, separate for each image.
The best result obtained was similar to that achieved earlier.
Having experimented with probabilistic density models, distance measures and texture
segmentation, [9] describes the use of Hough circles for the segmentation task. Inspired by [24] this
was applied to edges found by taking the gradient magnitude, its Laplacian, zero crossings, and then
points from the zero crossing with high gradient magnitude. This was done by a function found in
[45]. After carrying out an extensive parameter experimentation to see what standard deviation and
zero crossing threshold should be used, it was found that no threshold and a standard deviation of 8
was ideal for normal images (Appendix H: Figure H.12) and smoothing of 15 was ideal for texture
segmented images (Appendix H: Figure H.11). Since the Hough transform did not work an ellipse
was required that would fit around these edge points. A least squares ellipse fitting algorithm [54]
described in [55] was used and modified. An ellipse drawing function [56] was used and modified in
places along side the ellipse fitting algorithm to visualise ellipses.
The textured images, since they had their edge points far apart, provided a bias for the ellipse to fit
around well (Figure H.8). The original images required a bounding frame to be made by padding the
images to act as a bias otherwise the ellipse would not fit well. However there was a problem of the
ellipses not fitting well in general for the normal images for which a number of equations had to be
devised through observation. While several face drawing books discuss rule-of-thumb
measurements when getting a human face with the correct proportions down on paper, it was time to
move beyond them as they had already been adopted throughout this project. The ellipse size ratio
was worked out with the following devised formulae where xf and yf are the required values:
face_width=xf × eyeplane_length (5.1)
40
face_height=yf × eyeplane_length (5.2)
To adjust the face ellipse height, yh in the following equation was required.
face_ height = yh × eyeplane_length (5.3)
The equations need to be solved for x and y, which each determine the width ratio and height ratio,
respectively, of the face to the length of the eye plane. Images that had a good fit such as Figure
5.12 were used to find the missing ratios in the equations. There were other formulae as well that
helped move the ellipse to the left and right to centre it exactly on the face. The following ratios Rx
gives the position of the x ordinate of the centre of the ellipse with respect to an approximate face
pose.
Rx=X.centretraining ÷ Skin_Count_Differencetraining (5.4.a)
Ry=Y.centretraining ÷ Skin_Count_Differencetraining (5.4.b)
Skin_Count_Difference= RightSkinCount-LeftSkinCount (5.5)
X.centretest = Rx × Skin_Count_Differencetest (5.6.a)
Y.centretest = Ry × Skin_Count_Differencetest (5.6.b)
The original ellipse constructed is the large outer red ellipse in Figure 5.12. The large outer yellow
ellipse is an inverse of the red one and the centre yellow one is one constructed from parameters of
the original ellipse and size of the face. The reasoning for using ellipses was two fold; one which is
evident, to segment the face from the background, and the second is to use the radians to cross
reference objects in hoardings with face pose by feeding in the radians to an SVM for classification.
Figure 5.12 A well fit ellipse used to calculate the ratios.
As for the number of ellipses shown, the small ellipse is to fit the face, the outer ellipses were to fit
around the head but this did not happen as expected. The ellipses were supposed to fit well around
the edges of the head so that very close regional clustering could have been done to obtain the
perfect segmentation. Though regional clustering was done using the regions provided by the
41
ellipses, it was not as fruitful as desired. Applying the ratios gave results such as that in Figure H.9,
Appendix H, but the results on the textured based images were better (Figure H.10).
Figure 5.13 Comparison of ratio adjustment of ellipses on texture segmented (top) and RGB (bottom) images.
Figure 5.14 Training image represented as HSV and RGB. As is noticeable S, i.e. saturation shows the
maximum difference.
At this point the texture based system that was extremely slow was tested on real data of varying
sizes. Unfortunately all the low resolution images were not segmented as desired. Instead they took
42
unusual forms. Ratios were applied between image size and convolution settings, but the result of
segmentation was as it is shown in Column 8 of Figure 5.11. Having obtained a near decent ellipse
fit on the normal RGB images, experiments began to improve this so that this may be augmented
with the “spirit level” technique to extract the face as well as determine DoG. The various RGB
channels were tried and then HSV. This is when it was realised that saturation, “the property of a
colour that varies in passing from red to pink” [9] is the key to differentiating different similar
coloured objects. A number of images in saturation form are in Figure H.13 in Appendix H.
Saturation segmentation has been shown in Column 10 Figure 5.11. Figure H.14 shows several
segmented faces using images in saturation format, with K-means and Euclidean distance.
5.2 Spirit-Level Approach
So far details have been given about why this approach was introduced, what the motivating factors
were and how saturation with K-means was decided as the main segmentation method. This
approach was combined with the method described in the next Section to form the final system or
model. An overview of the final system is done right after the evaluation so details will not be
discussed here to avoid repetition. Now the number of bins in the feature vector is chosen as 15 to
increase accuracy. In terms of plain English pseudo-code, the steps are:
1. A frame is picked up from video as input.
2. The face and eyes are found.
3. For each face detected the face region is removed using ratios.
4. Gaussian smoothing is done and the image is subsampled to remove detail.
5. The region is converted into HSV space and S is retained.
6. Gaussian smoothing is performed a few times.
7. K-means clustering is done and 4 clusters are formed.
8. A small section from the eyes to the upper lip is analysed for a maximum number of
members of a certain cluster. The mode cluster number is considered as skin.
9. The region is converted into a binary representation with 1 as skin and 0 as non-skin.
10. Skeletonisation is done to break off bits that are usually not skin making the blob finding
smoother. This was implemented from concepts in [58].
11. The largest blob of pixels with 1, i.e. skin pixels, is extracted and the rest is converted to 0
using a blob finding implementation from [59].
12. Ratios are used to find the section shown in Figure 5.5 and this section is converted into a
15 bin binary feature vector.
13. The vector is classified by a few SVMs to suggest if the subject is looking at the board, to
the left or to the right.
14. Annotations are done for any direction the head-pose may be estimated to be in.
43
Figure 15.15 shows a subset of gaze annotated images that move beyond the “yes or no” model
proposed in Chapter 4. The annotation was done as this is the final system.
Figure 15.15 Annotated images from training set to show the “spirit level” approach working. All red arrows
represent “not looking” and green boxes represent “looking”. The top row shows arrows pointing to the right
of the image plane suggesting the subject is looking left and the centre row shows green boxes suggesting the
subjects are looking straight-on. The last row shows subjects looking towards their right hence the red arrow
pointing to the left of the image plane.
5.3 Cross-Referencing Objects With Pose Using Face Kernels
The solution to problem of estimating where in the hoarding people were looking was to introduce
face kernels. Between 10 and 15 images of poses C05(left), C27(centre), C29 (right), C09 (down)
and C07(up) (Figure 3.1) were taken and passed through the final algorithm. An extracted face was
treated as a vector and used to train, along side many others, a number of SVM classifiers. There
was one SVM for “centre or horizontal”, “centre or vertical”, “left or right” and one for “up or
down”, totalling to 4 SVMs. The SVM “centre or horizontal” for example comprises of all the
C27(centre) extracted face representations for centre, and a grouping of all C05(left) and C29(right)
for the horizontal decision. Therefore the support vector groups here were actually a template of
face kernels. Figure 5.16 (left) shows images of pose C05, or “looking left” and their corresponding
extracted faces. Figure 5.16(right) shows how the regional decision is made where the “if”
statements have been illustrated as a decision tree. The kernels are reduced to 20x20 pixels so all of
them are uniform, small and lack significant detail. All the faces were treated as prototypes because
it could happen that a face is not properly segmented and then has nothing to compare itself with.
This is another reason why SVMs are used. This was the motivation against having simply one
representative for each pose (or 3 for each pose such as those shown in Figure 5.17).
As shown these are complete faces, but often incomplete segmentations occur. Therefore this idea
was not developed further. Figure 5.18 shows a frame of annotated video from the final system. A
face is detected and the “spirit level” approach has classified it as “looking at the billboard” shown
by the green box. The system described in this Section, matched it to two SVM’s one for “up or
down” and another for “left or right”, and returned the verdict “up and right” (or region 3). An arrow
has been drawn to show what part of the interest area has been incremented for this purpose. In this
44
case the 9 will be incremented before moving to the next frame. The hoarding can thus be visualised
more easily. The interest level can be visualized in Figure 5.19
Figure 5.16 Left: Pose C05 with all its representative images and their extracted faces which have been
converted into support vectors. Right: SVM classification “if” statements shown as tree. CV and CH are 1 for
centre and 0 for vertical and horizontal respectively. UD and LR give decision according to the various poses.
For example, 5 is C05 and 29 is C29. Thus the region of interest is decided.
Figure 5.17 3 mean kernel clusters of all the extracted faces in Figure 5.16 (pose C05) in saturation mode.
Figure 5.18 Frame of annotated video showing the final system running and how the regional interest detector
is updated.
45
1 2 3
4 5 6
7 8 9
Figure 5.19 3 x 3 matrix drawn on an image of a hoarding (Source [57]) to show how it may be segmented
into 9 regions. For example, if regions 7, 8 and 9 are viewed most then interest was mainly shown there.
5.4 Evaluation of the Final Model including Spirit-Level vs. 13 Features
A number of combinations for classification were tried before deciding on one that has several
example support vectors in it. The current model is the best possible combination. The “spirit-level”
model described in Section 5.2 was tested on the test set and results follow. It should be noted again
that the eyefinder was being used as opposed to mpiSearch, so more faces were detected. Also the
filter of taking only the largest face returned by either detection package was removed so that it may
be used in scenes with many faces of varying sizes. The filter was put in place to channel out
eyefinder, and mpisearch’s problem of detecting false faces. Table 5.3 shows the confusion matrix
of the final “spirit-level” approach tested with no filter for the eyefinder and Table 5.5 shows it with
the filter on.
Classification Truth Looking Not Looking
Looking 56 12 Not Looking 16 44
Table 5.3 Confusion matrix of the “spirit-level” approach with “left” and “right” grouped as not looking.
The confusion matrix includes “left” and “right” as “Not Looking”. The precision or true positives
and negatives is 78% over a baseline of 53%. The main problem areas were again poses like C09
and C07. By removing those images the accuracy increased to 80%. This model identified “left”
and “right” subjects with an accuracy of 84% over a baseline of 46%.
Classification Truth Left Right Left 19 2
Right 5 19
Table 5.4 Confusion matrix of the “spirit-level” approach for “left” and “right”.
46
There was one major problem identified and it was that the model was biased toward the “right”
decision. It could not be tweaked to improve it any further as already the over-fitting and
optimisation fear existed.
Other than removing the face filter eyefinders dll file was also rebuilt. Though the build is from the
same source it seemed to make a difference. Previously this model had an 83% accuracy with a
baseline of 62% (shown below in Table 5.5) but 103 images were detected as opposed to the 133
images in the test-set or the 128 that have been detected above.
Classification Truth Looking Not Looking
Looking 56 8 Not Looking 9 30
Table 5.5 Confusion matrix for the “spirit-level” approach without a face filter and old build for eyefinder.
Using the mean feature vector approach as described in Section 5.1.1, but with this new
segmentation technique, the accuracy was 71% over a 62% baseline (Table 5.6). Raw values before
the mean thresholding were also tried in various combinations but yielded poor results.
Classification Truth Looking Not Looking
Looking 47 17 Not Looking 13 26
Table 5.6 Confusion matrix of the “spirit-level” approach using the 3 mean vectors.
Figure 5.20 compares the feature based model in Chapter 4 and the final “spirit-level” model. As the
test subsets vary the difference between truth (or precision) and the baseline have been used for the
main comparison. Besides that the new proposed model out performs the previously proposed model
in all respects.
Section 5.3’s regional interest detector was also tested and the results were rather promising. In this
case taking the most frequent as a baseline is inappropriate, but doing so gives, 77 % accuracy with
a 20% baseline. As described earlier, the “centre or vertical” and “centre or horizontal” SVM have
“up” and “down”, and “left” and “right” support vectors respectively. Therefore when testing they
are tested separately. Therefore the totals in Table 5.7 are greater than the actual size of the test-set
and, for example, “horizontal” has 10 correct classifications which is “left” (5) and “right” (5)
added up in the test data.
47
44
28
2024
20
56
44
1216 16
0
10
20
30
40
50
60
True Positives TrueNegatives
FalsePositives
FalseNegatives
Bias
13 Features spirit-level
62%
7%
78%
25%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Truth Truth-Baseline
13 Features spirit-level
Figure 5.20 Visualised confusion matrix of the feature based (Table 4.2) and the final “spirit-level” (Table
5.3) approaches.
Classification Truth Left Right Centre Up Down Vertical Horizontal Left 5
Right 2 3 Centre 4 4 2
Up 4 Down 5
Vertical 3 6 Horizontal 10
Table 5.7 Resultant confusion matrix of tests conducted on the SVMs for the regional interest detector.
The regional interest detector SVMs were biased toward the vertical figure. From Table 5.7 the
“centre or vertical” (CV) SVM trained set is 40% accurate. To tackle this removing the eye sockets
was tried but informal experiments showed that this did not pronounce the differences between the
poses. It is also not due to a larger number of vertical, i.e. up or down support vectors. This was left
alone and the code was made to, by default, select the centre region, in case no other decision
changes the interest region. Figure 5.21 shows two subjects and how they were classified by the
regional interest detector.
Figure 5.21 Left: The regional interest detector says that this subject is looking to the top left of the billboard.
Right: It said that this subject is looking towards the top centre of the billboard.
48
5.5 Final Model Combined
The spirit level approach outperformed the previously proposed 13 feature based model and was
chosen as a replacement with its added “left” and “right” feature. The further enhancement also
works well. These two components have been put together as illustrated in Figure 5.22.
Figure 5.22 The final models architecture. The “DoG COMPONENT” contains both the “spirit-level”
component and the regional interest detector to cross reference objects with pose. The output is a table of
results and annotated video. The source code in Appendix E is structured in this way.
49
Figure 5.23 shows and explains a snap shot of annotated video. A thorough evaluation of this final
model is conducted in the next Chapter where a precision of 89% is seen.
Regional interest detector. Gaze is inverted so it represents a billboard being faced by the
observer rather than the subject and is not laterally inverted.
Interest duration (secs) = total “looking” count ÷ 25fps
“Spirit-level” Component annotation for subject
looking “right”. (Laterally inverted) When faces are detected
this text appears and gives the verdict of looking and
not looking
If eyefinder finds a face then it is shown here. The frame count of the clip is
also shown here. Figure 5.23 Frame annotation explained.
50
Chapter 6: Evaluation Against Ground Truth The final model is thoroughly evaluated in this Chapter.
6.1 The Set Up
A detailed evaluation of the final system was carried out to ensure validity as well as verifiability
[2]. 1.5 hours of video footage was filmed from several locations with varying illumination,
elevation, background colour, and crowd trajectory angle to test the model for site installation
variance (Appendix E). Due to computational resource limitations requiring excessive time
consuming manual intervention the following decisions were made. From 1.5 hours of video footage
a few clips were created and processed totaling to approximately 4 minutes. Sample variations
within these clips are shown in Figure 6.1.
1 2 3 4 HIDDEN
FACE
HIDDEN FACE
5 6
HIDDEN FACE
7 8
HIDDEN FACE
Figure 6.1 Sample frames of the processed video clips. For the evaluation form; frame 1 is from clip “Khu”, 2
is from “Man”, 3 is from “Mich_proed”, 4 is from “sub_fork_proed”, 5 and 6 are from “stud_proed”, 7 is from
“sub_fork_40_60” and 8 is from “Close_up”. Frame 1 catered to a skin coloured background. Frame 7, for
example, represented posters inside tube stations between two path ways, and frame 3 represented a billboard
on a high rise.
It was not considered appropriate to compare the “interest duration” measure given by the model
with a manually timed one. A frame by frame analysis made a far better evaluation. With 25 frames-
per-second (fps) video clips and approximately 4 minutes of film, about 6000 frames would have
had to be manually annotated. This was not possible due to the time constraint, and a 2 second
interval (25 fps ∴ 50 frames for 2 seconds) was decided after observing how long individuals took
to change their DoG from objects of interest. 111 frames were obtained in this way. Due to
computational problems during processing, each interval of 50 frames does not tally with the actual
clip’s frame count. The completed evaluation form in Appendix I shows (from left to right) in each
row, the clip’s name (refer to Figure 6.1 for sample frames), a per clip frame count as on the clip
itself, a manual count from 1 to 5751 frames and classification information. Appendix E contains all
51
the extracted frames named with a concatenation of the clips name and the manual frame count, e.g.
Man_5601.jpg. The main evaluation criterion is precision accompanied by confusion matrix
information. Figure 6.2 provides an outline of the evaluation scheme followed and results follow.
MPT:eyefinder
True Face & Eyes
DoG DoG 3
False Face & Eyes
L R S Accept Reject
? 4
1
2
Figure 6.2 Evaluation scheme describing various parts of the model that have been evaluated. DoG represents
the proposed solution as in Chapter 5.
6.2 MPT’s Suitability
Stage 1 in Figure 6.2 shows the eyefinder. Although eyefinder is not the actual DoG model
proposed, it is part of the entire system and is therefore tested. 20 frames were selected from the
captured set, 1 to 501 and 2301 to 2801, and the total faces that lay between face poses C05 and C29
(Figure 3.1) were counted per frame (Appendix I). The first set of frames belonged to “Mich_proed”
and the second set belonged to “sub_fork_proed” (frame 3 and 7 from Figure 6.1 respectively). The
total faces were counted and those faces that appeared to be looking at the screen were also counted.
The faces ranged from 20 pixels to 40 pixels in height. These two separate counts were compared
with the actual correctly detected faces, (when eyefinder detects a face it also detects eyes). The
results are summarised in Table 6.1.
The first row in Table 6.1 represents a high location. MPT’s eyefinder detected only 17% of the
faces that were looking toward the simulated billboard or appeared close to pose C27 (Figure 3.1)
and only 4% of the faces in total. Frames 5 and 6 in Figure 6.1 illustrate this problem with no
detection or annotation visible. The other location yielded better but unsatisfactory results. The
totals favour eyefinder but a higher elevation installation is the most likely utilisation of this model.
Therefore when a better package is available it must replace MPT. Misclassifications were also a
problem. Figure 6.3 illustrates encountered misclassifications.
52
Raw Count Summary Percentage Location Actual
Count True Faces
Found Possible
C27’sActual Detection
Precision C27’s Detection
Precision“Mich_proed” 56 2 12 4% 17%
“sub_fork_proed” 36 14 12 39% 100% Totals 92 16 24 17% 67%
Table 6.1 Summary of MPT:eyefinder detection rate examination. For an elevated location (“Mich_proed”)
this face and eye detector was not acceptable. For a ground level and close proximity location it performed
better.
1 2 3 4
Figure 6.3 Frame 3, for example, shows a red arrow pointing to the left of the image plane. This is the
model’s answer to the eyefinder giving it an area of sky and leaves. The model proved to be robust in this way
dismissing most misclassifications. Frame 4 shows a misclassification being declared as “looking” due to very
little illumination which would have differentiated cloth from skin.
6.3 Robustness
Stage 2 in Figure 6.1 is a branch of the “False Face & Eyes”. The model is essentially a system
comprising of several SVM’s and other components. All systems possess the property of
emergence. Here the concept of emergence is seen in the system’s, or model’s, robustness. The
model should not be completely dependent on a face and eye detection package’s accuracy. For each
frame in the evaluation “True” and “False” face classifications were counted and the models
robustness. As shown in Appendix I, the total number of faces detected was 85 from 111 frames.
Out of those 85 faces 24 were misclassifications (Figure 6.3) so eyefinder had a 72% correct face
detection precision. The model classified 22 of these misclassifications as “not looking”, i.e., either
looking left or right, and 2 as “looking”. Therefore it is 92% (all figures rounded off) robust to input
from any face and eye detection package. This proves to be an imperative attribute that makes the
project commercially viable.
53
6.4 DoG Estimation Ability The following confusion matrices show how the model faired on an overall perspective.
Classification Truth Looking Not Looking
Looking 15 7 Not Looking 0 39
Table 6.2 With a 64% baseline an overall true positive and true negative precision of 89% was achieved to
cater to the minimum requirements.
Table 6.2 shows that the model has an overall combined, i.e. for all locations, classification
precision of 89% above a baseline of 64%. Individually positives were classified with a 68%
precision and negatives with a 100% precision. While the results were slightly biased toward the
negatives’, with an 89% overall precision expectations were exceeded.
In Table 6.2 “not looking” was a combination of “left” and “right” faces to measure how the model
dealt with its minimum requirement and thus verification [2] was completed. The left, right and
straight-on, feature, as described in Chapter 5, was added to complement the basic interest measure
of “looking” and “not looking” already catered to. As shown in Table 6.3, testing them separately
yielded good results. Classification
Truth Left Right Straight Left 7 12 0
Right 1 19 0 Straight 0 7 15
Table 6.3 Using “Right” for the Zero R baseline resulted in 33%. The overall truth in classification was 67%.
This added feature performed at 67% precision. Individually classifying images as “left” was 37%
accurate, “right” was 95% accurate and “straight-on” was 68% accurate. What the overall precision
did not show was that the model was biased toward classifying images as “right”. During training
the SVM had a bias toward a “right” classification and as discussed in Chapter 5 attempts were
made to deal with this to a certain level. It was not anticipated then that this was to cause problems.
Though, here there is a validated model performing at a reasonable precision overfitting is evident.
Since requirements verification was performed keeping elevations and locations in mind, a
validation for each location shown in Figure 6.1 is summarised in Table 6.5 from Appendix I.
Stage 4 involves evaluating the regional interest feature. Since it was not possible, within the
projects schedule, to perform a detailed evaluation here by counting negatives and positives for each
54
region a binary decision is given. Figure 6.4 illustrates how the decision is made. Table 6.4
summarises the regional interest gauge evaluation results.
Mich_proed Classification Classification Truth Left Right Straight Truth Looking Not Looking Left 6 1 0 Looking 5 0
Right 0 2 0 Not Looking 0 9 Straight 0 0 5
sub_fork_proed Classification
Truth Left Right Straight Classification Left 0 2 0 Truth Looking Not Looking
Right 1 6 0 Looking 3 3 Straight 0 3 3 Not Looking 0 9
sub_fork_40_60 Classification
Truth Left Right Straight Classification Left 0 4 0 Truth Looking Not Looking
Right 0 8 0 Looking 1 3 Straight 0 3 1 Not Looking 0 12
Khu Classification
Truth Left Right Straight Classification Left 0 0 0 Truth Looking Not Looking
Right 0 0 0 Looking 1 0 Straight 0 0 1 Not Looking 0 0
Close_up Classification
Truth Left Right Straight Classification Left 1 0 0 Truth Looking Not Looking
Right 0 1 0 Looking 3 1 Straight 0 1 3 Not Looking 0 2
stud_proed Classification
Truth Left Right Straight Classification Left 0 5 0 Truth Looking Not Looking
Right 0 0 0 Looking 2 0 Straight 0 0 2 Not Looking 0 5
Man Classification
Truth Left Right Straight Classification Left 0 0 0 Truth Looking Not Looking
Right 0 2 0 Looking 0 0 Straight 0 0 0 Not Looking 0 2
Table 6.3 A summary break up per location is presented in the confusion matrices, produced from Appendix
I. The minimum feature is presented in the confusion matrices running along the right and the added “left”,
“right” and “straight-on” feature is shown on the left. The name of each clip is provided.
55
1 2
HIDDEN FACE
HIDDEN FACE
4 3
Figure 6.4 Frame 1 and 2 show mistakes. In 1 the frame is confusing and so the model classified it as
“looking at region 2”. In frame 2 the subject is looking straight towards region 5 rather than region 4 as
classified. These two are considered as “False” in the evaluation form. Frames 3 and 4 correctly identify
possible regions of gaze and therefore are considered as “True”.
Regional
Location TRUE FALSE Mich_proed 5 0
Khu 1 0 sub_fork_proed 2 1
Close_up 1 2 sub_fork_40_60 0 1
stud_proed 1 1 Man 0 0
Table 6.4 Summary of regional interest detector for each location. Showing specifically that for the elevated
high position that it should cater to (“Mich_proed”) it performed 100 % correctly.
Table 6.5 summarises the combined final systems precision per location and Figure 6.5 illustrates
these values.
56
Direction of Gaze Model Features Location LRS Yes or No Regional
Mich_proed 93% 100% 100% Khu 100% 100% 100%
sub_fork_proed 60% 80% 67% Close_up 83% 83% 33%
sub_fork_40_60 56% 81% 0% stud_proed 29% 100% 50%
Man 100% 100% 0% Totals 67% 89% 67%
Table 6.5 Precision of all features at all locations. LRS is the “left”, “right” and “straight-on” component,
“Yes or No” is the minimum “looking” or “not looking” component and “Regional” represents the regional
interest detector.
9 3 %10 0 %
6 0 %
8 3 %
56 %
2 9 %
6 7%
10 0 % 10 0 %
8 0 %8 3 % 8 1%
10 0 %
8 9 %
10 0 %
6 7%
3 3 %
0 %
50 %
0 %
10 0 %
10 0 %
10 0 %
6 7%
0 %
10 %
2 0 %
3 0 %
4 0 %
50 %
6 0 %
70 %
8 0 %
9 0 %
10 0 %
Mich_p
roed
Khu
sub_
fork_
proed
Close_
up
sub_
fork_
40_60
stud_
proe
dMan
Total
s
TR U E Y es or N o R eg io nal
Figure 6.5 Visualisation of Table 6.5. The entire system performs very well on “Mich_proed” and on a skin
coloured background of “Khu”. For a ground level application, “sub_fork” it is also suitable but for a close up
with a very large face it is affected by the location of the eyes since they are more precisely found by MPT’s
eyefinder. The regional interest measure has also proved to be working well in a number of settings other than
the high level. The totals suggest an all rounding satisfactory performance.
57
6.5 Evaluation Summary As with any evaluation it is only possible to reveal the presence of errors rather than their absence
[3]. This is the motivation behind the methodology chosen. The evaluation technique used was
hoped to be least biased and at the same time convenient. That is, if frames were manually selected
then automatically preference would have been given to those that favour the model, and if a
random frame grabber was programmed then time consuming computational resource problems
would have been faced. However the scheme had the drawback of not giving a good idea of smaller
video clips such as “Khu” for a skin coloured background and “Man” for a cluttered background
and single subject outdoor analysis.
The skin coloured background was problematic (refer to “Khu.wmv” in Appendix E) only for
robustness and classifications were otherwise accurate owing to a saturation based segmentation
technique. Other than this the major problem was MPT which though is based on the state-of-the-art
Viola-Jones model, is error prone. It has a particularly discouraging problem of not being able to
detect dark skinned subjects unless in ideal illumination conditions. Other factors also apply that
have already been looked at in detail.
Despite component problems, the model performed very well. It is robust and accurate on a number
of sites including the high elevation that it was supposed to perform well on. The overall precision
was 89% for the simple minimum requirements and 67% for the added “left”, “right” and “straight-
on” feature. The “spirit-level” model had a lower accuracy of 78% in the controlled environment.
The “left or right” feature suffered from the SVM bias of a “right” classification which caused the
decline from the controlled environment testing result of 84% thus showing over-fitting and
optimisation. However it is better to have a system that is biased toward the “not looking” side to
suppress the interest gauge rather than show an overly exaggerated result. The regional interest
measure also performed extremely well but showed a decline from the controlled laboratory test-set
environment of 77% to a 67% accuracy or precision. Though this is lower it is still a satisfactory
result.
As demonstrated, a cognitive vision model to gauge interest in advertising hoardings was
successfully developed. Even if precisions are not at 100%, the model can be used not only to gauge
interest in particular advertisements but also to compare advertisements and in several other
applications.
58
Chapter 7: Future Work Possible additions to this project are discussed here.
This project provided a platform for a number of possible extensions. Among the two further
enhancements expression recognition could not be explored mainly to keep within the projects
schedule. Other concerns included the possibilities of heads being too small in footage from high
hoardings so facial expressions would not have been exploitable. However, during the development
it was realised that the heads will usually be big enough to get a rough approximation of the
subject’s expression. Though Figure 7.1 was taken from 40 feet above ground level and the image is
enlarged, super-sampling with interpolation methods can be used to clear up tiny distorted faces to
classify their expressions. The PIE database that has already been acquired and discussed can easily
be used for its expression images. Besides this body speed and trajectory, and posture and head
gestures including nods can complement an expression matching system. Such techniques are
important because it is not likely to have most of the pedestrian reactions shown in Figure 7.2.
These subjects appear to be
smiling
Figure 7.1 Example frame from height showing that expressions in heads 20 pixels high are still visible.
Figure 7.2 Unlikely and ideal facial expressions for a facial expression recognition model. Source [60]
The literature review has already covered techniques for trajectory and velocity incorporation using
Markov models and other probabilistic methods [25]. Other interest cues can be developed. For
example, if a subjects head has turned a considerable pan off the body’s trajectory for an extended
period of time this shows greater interest and goes beyond the current model. For this tracking
techniques need to be developed. There are several other possible extensions. Linking the system to
a National Identification or Drivers License database and adding a face recognition system to study
what demographics show the most interest would help pricing, market penetration and also
inventory control. As a first step, though, PCA should be used with the saturation based
segmentation technique to extract a perfect head. In this way, the regional detector can also be made
more precise. Although non-model approaches were to be tried it is only beneficial to incorporate a
model based approach beyond the ellipses as this project has been completed.
59
Chapter 8: Conclusion The reports conclusion
The need to develop a system that gauged interest in advertisement hoardings existed and for this a
system was developed. Proposed is a cognitive vision model to gauge interest in advertisement
hoardings. This technique of using pin point “spirit-level” binary feature vectors combined with a
regional detector is very useful in a number of settings apart from media research and advertising. It
can be used in HCI, for which many systems have been devised, gaze-navigation systems and
human robot interaction such as that used in [23].The model proved to work very well as shown in
Chapter 6 with an 89% accuracy exceeding all expectations. During the projects course several new
techniques were explored and the final model too is novel. It is hoped that this model and its
documentation will provide a useful platform and study for future gaze estimation research in multi-
resolution video.
60
References All citations within the text are listed here.
[1] The Advertising Association, (2006), Advertising Statistics Yearbook 2006, WARC.
[2]Sommerville I, (2000), Software Engineering, 6th Edition, Addison-Wesley.
[3]Sommerville I, (2004), Software Engineering, 7th Edition, Pearson/Addison-Wesley.
[4] Sim T, Baker S, Bsat M, (2002), The CMU Pose, Illumination, and Expression (PIE)
Database, in Proceedings of the International Conference on Automatic Face and Gesture
Recognition.
[5]Checkland P, (1981), Systems Thinking Systems Practice, Wiley.
[6 ] The American Heritage, (2000), Dictionary of the English Language, 4th edition, Houghton
Mifflin Company.
[7] WordNet, (2003), version 2.0, Princeton University.
[8] Pomberger G, Blaschek G, (1996), Object orientation and prototyping in software engineering,
translated by Bach R, Prentice Hall.
[9] Forsyth D, Ponce J, (2003), Computer Vision: A Modern Approach, Prentice Hall.
[10] Ramaswami S, Kim J, Bhargava M, (2001) Advertising productivity: developing an agenda for
research, in the International Journal of Advertising, vol. 20, no. 4.
[11] Franke G, Taylor C, (2003) Business perceptions of the role of billboards in the U.S. economy,
in the Journal of Advertising Research, vol. 43, no. 2, June 2003.
[12] McEvoy D, (2002) Outdoor: the Creative Medium, Admap, May 2002, Issue 428
[13] Coleman R, Cunningham A, (2004),Outdoor advertising recall: A comparison of newer
technology and traditional billboards, in ESOMAR, Online and Outdoor Conference, Geneva.
[14] Advertising Standards Authority, (2002), Outdoor Advertising Survey 2002: Compliance
Report, [Online] [Accessed: 7th August 2006] URL: http://www.asa.org.uk/NR/rdonlyres/A5F2E70F-
8661-4D06-8284 5726DFC296AF/0/ASA_Poster_Research.pdf
[15] Kerne A, Jumboscope Group, JumboScope: A Site-Specific Installation and Platform, [online]
[Accessed 11th August 2006] URL: http://www.cs.tufts.edu/colloquia/colloquia.php?event=169
[16] Kerne A, (1997), Collage Machine: Temporality and Indeterminacy in Media Browsing via
Interface Ecology, in the Proceedings of the ACM SIGCHI Conference on Human Factors in
Computing Systems Extended Abstracts, pp. 297-298, March 1997.
[17] Boyland M, Janes I, Barber H, (2004), The full picture, ESOMAR, Technovate 2, Barcelona.
[18] Krugman D, Fox R, Fletcher J, Fischer P, Rojas T, (1994), Do Adolescents Attend to Warnings
in Cigarette Advertising? An Eye-Tracking Approach, in the Journal of Advertising Research,
vol. 34, no. 6, November/December 1994.
[19] Heinzmann J, Zelinsky A, (1998) 3-D facial pose and gaze point estimation using a robust
real-time tracking paradigm, in the Proceedings of the 3rd IEEE International Conference on
Automatic Face and Gesture Recognition. pp.142-147, April 1998.
61
[20] Gee A, Cipolla R, (1994), Estimating gaze from a single view of a face, in the Proceedings of
the 12th IAPR International Conference on , vol.1, pp.758-760 October 1994.
[21] Morimoto C, Amir A, Flickner M, (2002) Detecting Eye Position and Gaze from a Single
Camera and 2 Light Sources, in the Proceedings of the 16th International Conference on
Pattern Recognition (ICPR'02). vol. 4. pp. 40314.
[22] Wang J.-G, Sung E, Venkateswarlu R, (2003) , Eye gaze estimation from a single image of one
eye, in the Proceedings of the Ninth IEEE International Conference, vol., no.pp. 136- 143 vol.1,
13-16 October 2003.
[23] Nagai Y, (2005), The Role of Motion Information in Learning Human-Robot Joint Attention, in
the Proceedings of the 2005 IEEE International Conference on. pp. 2069- 2074 April 2005.
[24] Voit M, Nickel K, Stiefelhagen R, (2005), Multi-view face pose estimation using neural
networks, , in the Proceedings of the 2nd Computer and Robot Vision Canadian Conference on
, pp. 347- 352, May 2005.
[25] Robertson N, Reid I, Brady J, (2005), What are you looking at? Gaze estimation in medium-
scale images, in the Proceedings of HAREM 2005.
[26] Castrillon-Santana M, Deniz-Suarez O, Guerra-Artal C, Hernandez-Tejera M, (2005), Realtime
Detection of Faces in Video Streams, in the Proceedings of the 2nd Canadian Conference on
Computer and Robot Vision (CRV'05), pp. 298-305.
[27] Viola P, Jones M, (2003), Fast Multi-View Face Detection, Mitsubishi Electric Research
Labs, Demonstration at the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR'03).
[28] Viola P, Jones M, (2001), Rapid object detection using a boosted cascade of simple features, in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HI.
[29] Machine Perception Laboratory. [online] [Accessed 9th June 2006] URL:
http://mplab.ucsd.edu:16080/
[30] Benoit A, Bonnaud L, Caplier A, Ngo P, Lawson L, Trevisan D, Levacic V, Mancas C, Chanel
G,- (2005), Multimodal Focus Attention Detection in an Augmented Driver Simulator, in
Proceedings of eNTERFACE’ 05 workshop, Mons, Belgium.
[31] Tasaki T, Komatani K, Ogata T, Okuno H.G, (2005), Spatially mapping of friendliness for
human-robot interaction," in Proceedings of IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2005). pp. 1277- 1282.
[32] Hand D, Mannila H, Smyth P, (2001), Principles of Data Mining (Adaptive Computation and
Machine Learning), MIT Press.
[33] Mitchell T, (1997), Machine Learning, McGraw-Hill.
[34] Beal R, Jackson T, (1990) Neural Computing: An Introduction, IOP Publishing Ltd.
[35] DTREG, SVM - Support Vector Machines [online] [Accessed 13th August 2006]
URL:http://www.dtreg.com/svm.htm
[36] Donald S, (2000) , Statistics: a first course, 6 Edition, th McGraw-Hill.
62
[37] Cytel Statistical Software, XLminerTM Online Help, Version 3, [online] [Accessed 13th August
2006] URL: http://www.resample.com/xlminer/help/rtree/rtree_intro.htm
[38] Manning C, Schuetze H, (1999). Foundations of Statistical Natural Language Processing, MIT
Press.
[39] Markert K, Language: Semantic Similarity, [online] [Accessed 13th August 2006] URL:
http://www.comp.leeds.ac.uk/lng/lectures/handout3.pdf
[40] Leibe B, Schiele B, (2003) Interleaved Object Categorization and Segmentation. in
Proceedings of the British Machine Vision Conference (BMVC'03).
[41] Anderson P, (2005), Object Recognition Using An Interest Operator, [online] [Accessed 16th
April 2006] URL: http://www.comp.leeds.ac.uk/fyproj/reports/0405/AndersonP.pdf
[42] Dang M, Choudri S, (2006) Simple Unsupervised Morphology Analysis Algorithm (SUMAA),
in Proceedings of the PASCAL Challenge Workshop, Venice.
[43] Hogg D, (2006), Computer Vision: Lecture slides, University of Leeds, [online] [Accessed 16th
April 2006] URL: http://www.comp.leeds.ac.uk/vsn/
[44] Stauffer C, Grimson W, (1999) Adaptive background mixture models for real-time tracking, in
Computer Vision Pattern Recognition, pp. 246—252.
[45] Vision Group, Leeds University, ftp://smb.csunix.leeds.ac.uk/home/cserv1_a/ssa/vislib/
[46] Pomerleau D, (1989), ALVINN: An autonomous land vehicle in a neural network, in D.S.
Touretzky, editor, Advances in Neural Information Processing Systems, vol. 1, pp. 305-313,
[47] Gonzalez R, Woods R, Eddins S, (2003), Digital Image Processing Using MATLAB, 1st
Edition, Prentice Hall.
[48] Brooks R, (1991), Intelligence Without Reason, in Proceedings of the 12th International
Joint Conference on Artificial Intelligence Sydney Australia, August 1991, pp. 569–595.
[49] UK MSN, [online] [Accessed 16th April 2006] URL: http://uk.msn.com/
[50] Fisher R, Perkins S, Walker A, Wolfart E, The Hypermedia Image Processing Reference,
University of Edinburgh, [online] [Accessed 1st August 2006] Available from URL:
http://homepages.inf.ed.ac.uk/rbf/HIPR2/
[51] Choudri S, (2006), RoSo2 algorithm for Texture Segmentation, Computer Vision: Coursework.
[52] Dance C, Willamowski J, Fan L, Bray C, Csurka G, (2004), Visual categorization with bags of
keypoints. in the Proceedings of ECCV International Workshop on Statistical Learning in
Computer Vision.
[53] Gonzalez, R, Woods R, Eddins S, (2004), Digital Image processing using MATLAB, Upper
Saddle River, NJ, Pearson/Prentice Hall.
[54] Hanselman D, (2005) Ellipsefit.m, University of Maine, Orono, ME 04469 Mastering
MATLAB 7 [online] [Accessed 22nd August 2006] URL:
http://www.mathworks.com/matlabcentral/files/7012/ellipsefit.m
[55]Halif R, Flusser J, (2000), Numerically Stable Direct Least, Squares FItting of Ellipses,
Department of Software Engineering, Charles University, Czech Republic.
63
[56] Lei Wang, (2003), ellipsedraw.m, [online] [Accessed 22nd August 2006] URL:
http://www.mathworks.com/matlabcentral/files/3224/ellipsedraw.m
[57] Google images, [online] [Accessed 22nd August 2006]URL:
http://images.google.co.uk/images?svnum=10&hl=en&lr=&q=coke+billboard&btnG=Search
[58] Young D, Computer Vision: Lecture slides, University of Sussex, [online] [Accessed 10th July
2006] URL: Computer http://www.cogs.susx.ac.uk/courses/compvis/slides_lec5.pdf
[59] Forbes K, Some Simple Image Processing Tools for Matlab, [online] [Accessed 10th July 2006]
URL: http://www.dip.ee.uct.ac.za/~kforbes/KFtools/KFtools.html
[60] Essa I, Pentland A, (1995), Facial Expression Recognition Using a Dynamic Model and Motion
Energy, in the Proceedings of ICCV.
64
Appendix A: Reflection A reflection of experience that should be beneficial to other students in the future. While it is not possible to cover all the experience gained some aspects should be useful for others
to learn from. With any problem the simplest of solutions should be exhausted first. The image
segmentation techniques that were tried tended to be overly complicated at times when the actual
solution settled on was extremely simple to implement. For example, if simple experimentations
with RGB and HSV images had been exhausted earlier the further enhancement of expression
recognition could have been developed or more time could have been spent evaluating more frames.
Possible additions to a saturation based technique include using the mean-shift tracker. Without
using a PCA model those two techniques should give fruitful results in symbiosis.
The development environment is a very important aspect with any computer science problem.
Matlab provided rapid prototype development capabilities but also has poor memory management.
It is essential to keep such factors in mind since this project suffered considerably as a result. It took
2 weeks to manage to successfully process film that was expected to take a few days. Finally 1.5
hours of video was reduced to clips of 1 minute each which took at least half a day to compute. The
other alternative was to use OpenCV libraries and code in C++ but experience suggested that Matlab
was simpler. The lesson here is that if high quality images are to be processed (frames) then RAM
greater than 1GB is required. After every 80 frames (1024x786 pixel dimensions) of a 1 minute AVI
clip (approx.100MB) the system would crash with 1GB RAM. Also the task was divided among 4
computer systems with Intel Pentium 4 processors of 2.8 GHz, each with atleast 50 GB of hard disk
space for virtual memory. The higher resolution was essential to be able to detect smaller heads. The
computers were limited to 4 since the Matlab release being used had license limitations for its image
processing toolbox. This was another problem that should be kept in mind while developing
systems. Since this is an idea converted to a project there were certain aspects that were both pros
and cons. Firstly there was no set specification that allowed the development process to use as a
milestone and there was no existing system to beat. This also meant possible “feature-creep”.
However, at the same time it prevented “feature-creep” since there was no representative from the
industry and allowed development in a very research oriented style. The aspect governing its
success was a clearly defined precise set of minimum requirements from the very start and a specific
evaluation criteria to measure success with. While still on the subject of “idea to a project” it is
important to define an achievable schedule with large buffers in place. Since this is mainly a self-
driven project such buffers are required. For example, in July the development finished as planned
but the evaluation portion used up the buffer due to problems discussed. Lastly, for the report it is
important to document and evaluate everything during development and keep code to reproduce
results.
65
Appendix B: Objectives and Deliverables Form Objectives and deliverables form recording the initial agreement of minimum requirements, deliverables and other aspects of the project.
AIM AND REQUIREMENTS FORM COPIED TO SUPERVISOR AND STUDENT ---------------------------------------------------------- Name of the student: Saad CHOUDRI Email address: scs5sc Degree programme: MCGS - MSc in Cognitive Systems Number of credits: 60 Supervisor: dch Company: (none) The aim is: Development of Computer Vision concepts and skills through experience. The project outline is: Background reading: Reading and understanding face and gesture recognition to understand models such as the Viola and Jones model. Methodology: Prototype Development Product: Software implementation Evaluation of product: Performance evaluation against ground truth. The minimum requirements are: 1. Integrate off the shelf face and eye detection software 2. Estimate viewing direction from a face and eye detection package augmented with novel algorithms and or, approaches through literature research. 3. Devise a measure of interest from viewing angle and duration of viewing. 4. Evaluate the system documented in a report. 5. Enhancement Optionals 1) incorporate facial expression into the interest measure 2)Cross reference objects in the display to gauge level of interest in object. The hardware and software resources are: 1. Video Cameras from Vision Lab 2. Extra disk space as needed on school computers, starting with about 100 MB. 3. Continuous Mat Lab availability and support on school computers. The foundation modules are: 1. COMP5430M 2. COMP5420M The project title is Cognitive Model gauging interests in advertising hoardings.
66
Appendix C: Marking Scheme and Header Sheet The marking scheme, header sheet and comments from the interim report are give below.
67
Appendix D: Gantt Chart and Project Management Reflection on the project management and updates are given below Schedule Feb Mar Apr May Jun Jul Aug Sep Background Reading 1 2 3 4 1 2 3 4 1 Concurrent activities Final Evaluation Poster Report Free slots for delays KEY Completion of 3 stages and 1 week off. Project submission.
Table 1 Original Schedule describing an overview of the planned assigned weeks. For example the whole of
June was to take up the concurrent activities and background reading but only the first 3 weeks of August
were to be used for report writing. The key describes irregularities.
Schedule Feb Mar Apr May Jun July Aug Background Reading 1 2 3 4 1 2 3 4 Concurrent activities Done Final Evaluation Ongoing Poster Report Free slots for delays KEY Completion of 3 stages and 1 week off.
Project submission.
18th July progress meeting.
Table 2 Schedule as on 18th July.
The final evaluation stage threw the project off track and processing continued till the 10th of August
due to lack of resources such as RAM. This delayed all the other activities but the 1st and 2nd drafts
of the entire report were done by the 31st of August and so the buffer zones were very useful and
well planned.
A total of 280 hours of solid productive work were required for the evaluation troubleshooting,
report writing and poster making during August and the entire project was ready for submission 5
days before the deadline, on the 1st of September. Prior to this at least 400 hours were spent since
February. Therefore a total of 640 hours were spent productively for this project.
70
Appendix E: CD All deliverables other than this report, images, videos etc and source code are in the CD enclosed with the printed report handed in on the 6th of September 2006.
The following points explain how to use this CD or Appendix E (E for electronic). A readme.txt file
per section is present explaining the contents. The sections are:
• The folder titled “Deliverables” includes
o Evaluation videos filmed ( edited and as many as could fit on the CD)
o Processed clips ( named in accordance with evaluation form)
o Extracted Frames that were evaluated manually.
o Software code with workspace and usage instructions. (demonstration arrangeable)
• The folder titled “Dataset” contains
o All the training images, testing images, and extra 70 images taken for
The main DoG Component
The further enhancement
MPT testing
• The folder titled “Miscellaneous” contains
o Histograms etc. to understand the parameters of the 13 feature approach
o Background, skin and hair pictures
71
Appendix F: Dataset This Appendix contains thumbnails of the training and test images.
Figure F.1 Training images. 34 centre, 19 left and 20 right facing images. Subjects with glasses and without
glasses have been repeated in some areas. Other images have also been included to avoid optimisation.
Figure F.2 68 “looking” images selected on the basis of appearance variation. Some images have been
repeated to see how classifiers behave with previously seen images as well as unseen.
72
Figure F.3 65 “not looking” test images. Carefully selected images with many unseen poses, varying
appearance, illumination, background and expression, and that are likely to be misclassified.
Figure F.4 Training images for cross referencing objects with pose. 5 different poses used as described.
Figure F.5 Test images for cross referencing objects with pose. Carefully selected subjects with varying
appearance.
73
Appendix G: 13 Feature Analysis Samples Histograms and range analysis for Normal distribution.
L=left, R=right, S=straight or centre, Y=looking, N= not looking
Figure G.1 FCX
Figure G.2 FCY
Figure G.3 REY
This was a subset of the 13 histograms for L R and S that were plotted. The rest may be found in
Appendix E. Following are the same histograms but as yes and no rather than left right and straight.
74
Figure G.4 FCX
Figure G.5 FCY
Figure G.6 REY
The rest of the set of histograms are in Appendix E.
75
Appendix H: Image Segmentation Samples Screen shots of the various implemented image segmentation techniques on 76 training images.
Figure H.1 Face and hair extraction using K-means and Euclidean distance with 3 clusters for the background
and 2 for the subject.
Figure H.2 Results of encoding using 5 Gaussians. Equal probabilities are the biggest problem.
76
Figure H.3 6 Gaussians, 3 for face and 3 for background and used for joint probability.
Figure H.4 The foreground and background pixels individually representing 2 Gaussians
77
Figure H.5 An adaptive background approach using a series of Gaussians.
Figure H.6 After 3 iterations of the adaptive accumulative Gaussian approach from Figure H.5.
78
Figure H.9 Ellipse fit on RGB images
Figure H.10 Ellipse on texture with adjustments. The ellipse has been drawn on the original images but with
the coordinates of the textured image.
80
Figure H.11 Textured image in Column 7 and row D of Figure 5.11 with it edges detected.
Figure H.12 Edge detection on RGB image with bias frame
81
Figure H.13 Training image subset in saturation method with ellipses to show the colour discrimination.
Figure H.14 Faces segmented using saturation and K-means.
82
Appendix I: Ground Truth Evaluation Form Completed Evaluation form for 111 frames as described in Chapter 6. The empty cells are equivalent to zeros and were not required to be filled.
KEY Left-Right-Straight-True-False-NotLooking-Looking
Correct Classification Misclassification
Not Looking Looking
Frame Information Left Right Straight Region
Eyefinder Detection Precision
CVMGIAH
Robustness
Clip Frame No. L R S L R S L R S T F T F N L
Mich_proed 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Mich_proed 50 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 101 101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 151 151 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 201 201 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 251 251 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
Mich_proed 301 301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 351 351 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 401 401 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 450 451 0 0 0 0 1 0 0 0 0 0 0 1 2 1 1
Mich_proed 501 501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 551 551 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
Mich_proed 601 601 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Mich_proed 651 651 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
83
Mich_proed 701 701 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Mich_proed 750 751 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Mich_proed 801 801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 851 851 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
Mich_proed 901 901 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
Mich_proed 950 951 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0
Mich_proed 1001 1001 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 516 1051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 567 1101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 1174 1151 1 0 0 0 0 0 0 0 1 1 0 2 0 0 0
Mich_proed 1225 1201 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 1276 1251 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 1327 1301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 1377 1351 1 0 0 0 0 0 0 0 1 1 0 2 0 0 0
Mich_proed 1428 1401 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 1478 1451 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 1528 1501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 1578 1551 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 1628 1601 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Mich_proed 1678 1651 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mich_proed 1728 1701 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
84
Khu 212 1751 0 0 0 0 0 0 0 0 1 1 1 0 0 0
Khu 204 1801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Khu 312 1851 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Khu 302 1901 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Khu 412 1951 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Khu 404 2001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 5 2051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 55 2301 0 1 0 0 2 0 0 0 1 1 0 4 0 0 0
sub_fork_proed 147 2351 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0
sub_fork_proed 198 2401 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 248 2451 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
sub_fork_proed 298 2501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 348 2601 0 0 0 0 2 0 0 0 0 0 0 2 1 1 0
sub_fork_proed 398 2651 0 0 0 0 2 0 0 1 1 0 1 4 1 1 0
sub_fork_proed 448 2701 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0
sub_fork_proed 498 2751 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 548 2801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_proed 598 2851 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0
85
Close_up 1 2901 0 0 0 0 0 0 0 0 1 0 1 1 0
Close_up 50 2951 1 0 0 0 0 0 0 0 0 0 0 1 0
Close_up 101 3001 0 0 0 0 0 0 0 0 1 1 0 1 0
Close_up 151 3051 0 0 0 0 1 0 1 0
Close_up 200 3101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Close_up 251 3151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Close_up 301 3201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Close_up 351 3251 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Close_up 401 3301 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
Close_up 450 3351 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0
0
sub_fork_40_60 1 3401 0 0 0 0 0 0 1 1 0 1 2 0
sub_fork_40_60 51 3451 0 0 0 0 2 0 1 0 3 0
sub_fork_40_60 101 3501 0 1 0 0 1 0 0 2 0
sub_fork_40_60 151 3551 0 0 0 0 1 0 0 1 0
sub_fork_40_60 201 3601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_40_60 251 3651 0 0 0 0 2 0 0 2 0
sub_fork_40_60 301 3701 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_40_60 351 3751 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_40_60 401 3801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_40_60 451 3851 1 0 1 0
sub_fork_40_60 501 3901 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
86
sub_fork_40_60 551 3951 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_fork_40_60 701 4001 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
sub_fork_40_60 751 4051 0 2 0 0 1 0 0 0 0 0 0 3 1 1
sub_fork_40_60 801 4101 0 1 0 0 0 0 0 0 0 0 1 0
sub_fork_40_60 851 4151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
stud_proed 1 4201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 51 4251 1 1 1 1
stud_proed 101 4301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 151 4351 0 1 0 0 0 0 0 0 0 0 0 1 0
stud_proed 201 4401 0 2 0 0 0 0 0 0 0 0 0 2 0
stud_proed 251 4451 0 0 0 0 0 0 0 1 0 1 1 0
stud_proed 301 4501 0 0 0 0 0 0 0 1 1 0 1 1 1
stud_proed 351 4551 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 401 4601 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
stud_proed 451 4651 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 501 4701 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 551 4751 0 0 0 0 0 0 0 0 0 0 0 0 1 1
stud_proed 601 4801 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 651 4851 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 701 4901 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 751 4951 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 801 5001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
87
stud_proed 851 5051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 901 5101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 951 5151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1001 5201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1051 5251 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
stud_proed 1151 5301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1201 5351 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1251 5401 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1301 5451 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1351 5501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stud_proed 1401 5551 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Man 14 5601 0 0 0 0 0 0 0 0 0 0 0 0 0
Man 65 5651 0 0 0 0 0 0 0 0 0 0 0 0 0
Man 115 5701 0 0 0 0 1 0 0 0 0 0 0 1 1 1
Man 165 5751 0 0 0 0 1 0 0 0 0 0 0 1 0
Totals 7 12 0 1 19 0 0 7 15 10 5 61 24 22 2
88