face occlusion detection using deep convolutional neural...

Face Occlusion Detection Using Deep

Convolutional Neural Networks

Yizhang Xia* and Bailing Zhang

Department of Computer Science and Software Engineering

Xian Jiaotong-Liverpool University, SIP, Suzhou 215123, P. R. China*[email protected]

Frans Coenen

Department of Computer ScienceUniversity of Liverpool, Liverpool L69 3BX, UK

[email protected]

Received 17 November 2015

Accepted 13 September 2016Published 17 November 2016

With the rise of crimes associated with Automated Teller Machines (ATMs), security rein-

forcement by surveillance techniques has been a hot topic on the security agenda. As a result,cameras are frequently installed with ATMs, so as to capture the facial images of users. The

main objective is to support follow-up criminal investigations in the event of an incident.

However, in the case of miss-use, the user's face is often occluded. Therefore, face occlusion

detection has become very important to prevent crimes connected with ATM usage. Traditionalapproaches to solving the problem typically comprise a succession of steps: localization, seg-

mentation, feature extraction and recognition. This paper proposes an end-to-end facial oc-

clusion detection framework, which is robust and e®ective by combining region proposal

algorithm and Convolutional Neural Networks (CNN). The framework utilizes a coarse-to-¯nestrategy, which consists of two CNNs. The ¯rst CNN detects the head element within an upper

body image while the second distinguishes which facial part is occluded from the head image. In

comparison with previous approaches, the usage of CNN is optimal from a system point of viewas the design is based on the end-to-end principle and the model operates directly on image

pixels. For evaluation purposes, a face occlusion database consisting of over 50 000 images, with

annotated facial parts, was used. Experimental results revealed that the proposed framework is

very e®ective. Using the bespoke face occlusion dataset, Aleix and Robert (AR) face dataset andthe Labeled Face in the Wild (LFW) database, we achieved over 85.61%, 97.58% and 100%

accuracies for head detection when the Intersection over Union-section (IoU) is larger than 0.5,

and 94.55%, 98.58% and 95.41% accuracies for occlusion discrimination, respectively.

Keywords : Automated teller machine (ATM); convolutional neural network (CNN); face

occlusion detection; multi-task learning (MTL).

*Corresponding author.

International Journal of Pattern Recognitionand Arti¯cial Intelligence

Vol. 30, No. 9 (2016) 1660010 (24 pages)

#.c World Scienti¯c Publishing CompanyDOI: 10.1142/S0218001416600107

1660010-1

http://dx.doi.org/10.1142/S0218001416600107

1. Introduction

Automated Teller Machines (ATMs) have always been the targets of criminal ac-

tivity since their widespread introduction in the 1970s. For example, fraudsters can

obtain card details and PINs using a wide range of tactics. Among the possible

techniques to defend against ATM crime, real-time automatic alarm systems seem to

be the most straightforward technical solution to maximize protection. This is be-

cause the surveillance cameras are installed in nearly all ATMs. However, current

video surveillance for ATM requires constant sta® monitoring, which has the obvious

disadvantages of human error caused by fatigue or distraction.

Face occlusion detection has been studied for several years with a number of

methods published,26,32 many of which aim to reinforce ATM security. The published

approaches can be roughly categorized into two categories: face or head detection

approaches and occlusion classi¯cation approaches.

In the ¯rst category, the objective is robust face detection algorithms in the

presence of partial occlusions, with two common practices, namely, facial compo-

nent-based approaches and shape-based approaches. Facial component-based ap-

proach, such as that presented,20 detected facial components such as eyes, nose and

mouth, and determines a face area based on the component detection result. For

example, the method proposed in Ref. 20 combined seven AdaBoost-based classi¯ers

for whole face with individual face-part classi¯ers trained on nonoccluded face

sample sets, and a decision tree and Linear Discriminant Analysis (LDA) to classify

nonoccluded faces and various types of occluded faces. Inspired by discriminatively

trained part-based models, El-Barkouky et al.7 proposed a Selective Part Model

(SPM) to detect faces under certain types of partial occlusions. Gul and Farooq15

applied what is known as the Viola–Jones approach,39 with free rectangular features,

to detect left half faces, right half faces and the holistic faces. AdaBoost-based face

detection was also improved upon in Ref. 5 in order to detect partially occluded faces,

which, however, only worked for frontal faces with su±cient resolutions.

Shape-based approaches3,4,17,21,28 detect faces based on the prior knowledge of

head, neck and shoulder shapes. In the scenario of ATM video surveillance, Ref. 28

proposed to compute the lower boundary of the head by moving object edge ex-

traction and head tracking. Motion information was also exploited in Ref. 21 to

detect the head and shoulder shape with the aid of B-spline active contouring. Color

has also been applied as a major clue to detect the head or face, following appropriate

template ¯tting strategies.3,4,17 These approaches, however, are limited to con-

strained poses and well-controlled illumination conditions. In Refs. 3, 4 and 17 the

head or shoulder were detected by using ellipse or what are known as \omega

templates", which, however, will fail when a face is severely occluded and/or the

shapes are severely changed with di®erent face poses.

Some of the previous published researches have tried to solve the face occlusion

detection problem by straightforward classi¯cation.22 For example, by separating a

face area into upper and lower parts, Principal Component Analysis (PCA) and

Y. Xia, B. Zhang & F. Coenen

1660010-2

Support Vector Machine (SVM) were combined to distinguish between normal faces

and partially occluded faces in Ref. 22.

Until recently, the most successful approaches to object detection utilized the

well-known sliding window paradigm,10 in which a computationally e±cient classi¯er

tests for object presence in every candidate image window. The steady increase in

complexity of the core classi¯ers has led to improved detection quality, but at the

cost of signi¯cantly increased computation time per window.6,12,16,36,41 One approach

for overcoming the tension between computational tractability and high detection

quality is through the use of \detection proposals".8,38 If high object recall can be

reached with considerably fewer windows than used by sliding window detectors,

signi¯cant performance improvement can be achieved. Current top performing ob-

ject detectors, when applied to PASCAL benchmark image datasets9 and Ima-

geNet,31 all used detection proposals.6,11,12,16,36,41 According to Ref. 18, approaches

for generating object proposals can be divided into four types: grouping methods,

window scoring methods, alternative methods and baseline methods. Grouping

methods attempt to generate multiple (possibly overlapping) segments that are

likely to correspond to objects. Window scoring methods are used to score each

candidate window according to how likely it is to contain an object. Inspired by the

success of applying object proposal approaches in di®erent object detections, this

paper proposes a face occlusion detection system using the highly ranked object

proposal technique, EdgeBoxes.48

Over the last several years, there has been increasing interests in deep neural

network models for solving various vision problems. One of the most successful deep

learning frameworks is the Convolutional Neural Networks (CNN) architecture,24

which is a bio-inspired hierarchical multilayered neural network that can learn visual

representations directly from raw images. CNN possesses some key properties,

namely translation invariance and spatially local connections (receptive ¯elds).

Pre-trained CNN models can be exploited as generic feature extractors for di®erent

vision tasks.24 Among the various advantages of deep neural networks over classical

machine learning techniques, the most frequently cited examples include the con-

veniences for the implementation of knowledge transfer, Multi-Task Learning

(MTL), attribute learning, multi label classi¯cation and weakly supervised learning.

In this paper, a novel CNN-based approach to face occlusion detection is pro-

posed. A CNN cascade paradigm is adopted, which tackles the occlusion detection

problem in a coarse-to-¯ne manner. The ¯rst CNN implements head/shoulder de-

tection by taking a person's upper body image as input. The second CNN takes the

output of the previous CNN as input and locates and classi¯es di®erent facial parts.

To facilitate the study of various face occlusion problems, a database directed at

di®erent kind of facial occlusions was created. The database consists of over 50 000

images which are demarcated with four facial parts: two eyes, nose and mouth.

To the best of our knowledge, this is the ¯rst work directed at analyzing how face

occlusion can be detected using multi-task CNN. The approach was veri¯ed on our

Face Occlusion Detection Based on Deep CNN

1660010-3

face occlusion dataset, Aleix and Robert (AR) dataset30 and Labeled Face in the

Wild (LFW) dataset,19 obtaining 94.55%, 95.58% and 95.41% accuracies,

respectively.

The rest of the paper is organized as follows. Section 2 provides the problem

descriptions with the introduction of our face occlusion dataset. Section 3 overviews

the proposed method and elaborates on the details of the coarse-to-¯ne framework.

Section 4 reports the experiment results, followed by conclusion in Sec. 5.

2. Face Occlusion Dataset

A normal face image consists of two eyes, a nose and a mouth. The geometrical and

textual information from these components is critical to the recognition or identi¯-

cation of a person. If any of these components is blocked or covered, the face image is

considered to be occluded. Commonly found face occlusions include the faces being

partially covered by a hat, sunglasses, mask or mu²er. Some examples are given in

Fig. 1.

To facilitate research on face occlusion detection, a database of face images which

has di®erent facial regions intentionally covered or occluded was created. The images

were taken using the Microsoft webcam studio camera and AMCap 9.20 edition.

Some example images are illustrated in Figs. 2 and 3. The video was ¯lmed with a

white background. There were 220 people involved, including 140 males and 80

females.

During the photography, a subject was asked to stand in front of the camera with

a set of di®erent speci¯ed poses, including looking right ahead, up and down roughly

45�, and right and left roughly 45�. In addition to taking face images without anyocclusions, a subject was also asked to wear sunglasses, hat (in yellow and black),

white mask and black helmet, again in ¯ve di®erent poses. Each subject has six video

clips recorded, with 30 s for each clip and 25 frames per second. The illumination was

normal o±ce lighting condition. Other conditions, for example, clothing, make-up,

hair style and expression, have not been strictly controlled.

(a) (b) (c)

Fig. 1. Face occlusions examples for withdrawing cash from an ATM scenario.


1660010-4

Although the created dataset is comprehensive, it cannot become public currently

due to legal considerations. To alleviate the problem, we also utilized the AR face

database,30 which is one of the earliest and most popular benchmark face databases.

AR faces have been often used for robust face recognition. It includes a number of

di®erent types of occlusions: faces with sunglasses and face partially covered by a

scarf. The AR faces dataset consists of over 3200 frontal face images taken from 126

subjects, with some examples given in Fig. 4.

The LFW dataset19 was also employed to further evaluate our approach. There

are 13 000 images and 1680 people in the LFW dataset collected from the Internet.

The faces were detected by the OpenCV implementation of the Viola–Jones39 face

detector. The cropped region returned by the detector was then automatically en-

larged by a factor of 2.2 in each dimension to capture more of the head and then

Fig. 2. Upper body of our created dataset.

Fig. 3. Head of our created dataset.


1660010-5

scaled to a uniform size, 250 � 250. For the evaluation presented in this paper, 1000images were selected and the heads manually cropped. Occlusions were created using

black rectangles and facial landmark localization.46 Some examples are given in

Fig. 5.

3. System Overview

Though deep neural networks have achieved remarkable performance, it is still dif-

¯cult to solve many real-world problems by a single CNN model. With the current

technology, the resolution of an input to CNNmust be relatively small. This will cause

some details of an image lost. To get better performance, a common practice of coarse-

to-¯ne paradigm has been applied with CNN design.29,35 Following the same line of

thought, we proposed a two-stage CNN for face occlusion detection, as illustrated in

Fig. 6. The ¯rst CNN detects the head from a person's upper body image while the

second CNN distinguishes which facial part is occluded from the head image.

Fig. 5. Examples of images from the LFW database with occlusions superimposed.

(a) (b)

Fig. 4. Examples of face from the AR face Database. (a) Head and shoulder and (b) head.


1660010-6

3.1. Head detection

To identify the locations of the head in an image, advantages were taken of Region

with Convolutional Neural Networks (R-CNN),13 which is the state-of-the-art object

detector that classi¯es candidate object hypothesis generated by appropriate region

proposal algorithms. The R-CNN leverages several advantages of computer vision

development and CNN, including the superb feature expression capability from a

pre-trained CNN, ¯ne-tuning °exibility for speci¯c objects to be detected and the

ever-increasing e±ciency of object proposal generation schemes.

Among the o®-the-shelf object proposal generation algorithms, the EdgeBoxes

technique48 was chose which has attracted much interest in recent years. EdgeBoxes

is built on the Structural Edge Map to locate object boundaries and ¯nd object

proposals. The number of enclosed edges inside a bounding box is used to rank the

likelihood of the box containing an object.

The overall convolutional net architecture is shown in Fig. 7. The network con-

sists of three convolution stages followed by three fully connected layers. A convo-

lution stage includes a convolution layer, a nonlinear activation layer, a local

response normalization layer and a max pooling layer. The nonlinear activation layer

and local response normalization layers are not included in Figs. 7 and 8 as data size

was not changed. Using shorthand notation, the full architecture is C(8,5,1)- ~A-N-P-

C(16,5,1)- ~A-N-P-C(32,5,1)- ~A-N-P-FC(1024)- ~A-FC(128)- ~A-FC(4)- ~A, where C(d,f,s)

indicates a convolutional layer with d ¯lters of spatial size f�f, applied to the inputwith stride s. A is the non-linear activation function, which uses the ReLU activation

function.14 FC(n) is a fully connected layer with n output nodes. All pooling layers P

Fig. 6. The °ow diagram of whole system.


1660010-7

use max-pooling in nonoverlapping 2� 2 regions and all normalization layers N arede¯ned as described in Ref. 24. The ¯nal layer is connected to a soft-max layer with

dense connections. The structure of the networks and the hyperparameters were

empirically initialized based on previous works using CNNs.

The well-known over¯tting problem has been taken into account in our design

with the following considerations. Firstly, we empirically compared a set of di®erent

CNN architectures with varying number of kernels and selected the one which is

deemed as the most e®ective with regard to the trade-o® between network com-

plexity and performance. Secondly, the over¯tting has been avoided to a large extent

as the CNN size is very moderate, which compares sharply with some published CNN

models for large-scale image classi¯cations. 24

The adopted CNN used the shared weight neural network architecture,25 in which

the local receptive ¯eld (kernel or ¯lter) is replicated across the entire visual ¯eld to

form a feature map, which is known as convolution operation. The sharing of weights

reduces the number of free variables, and increases the generalization performance of

the network. Weights (kernels or ¯lters) are initialized at random and will learn to be

edge, color or speci¯c pattern detectors.

In deep CNN, the classical sigmoidal function has been replaced by a Recti¯er

Linear Unit (ReLu) to accelerate training speed. Recent CNN-based approa-

ches23,24,34,37,42,44 applied the ReLU as the nonlinear activation function for both the

convolution layer and the full connection layer, often with faster training speed as

reported in Ref. 24.

Fig. 7. Architecture of head detector CNN.

Fig. 8. Architecture of face occlusion classi¯er CNN.


1660010-8

Typical pooling functions include average-pooling and max-pooling layers. Av-

erage pooling takes the arithmetic mean of the elements in each pooling region while

max-pooling selects the largest element from the input.

The four layers of convolution, nonlinear activation, pooling and normalization,

are combined hierarchically to form a convolution stage (block). Generally, an input

image will be passed through several convolution stages for extracting complex de-

scriptive features. In the output of the topmost convolution stage, all small-sized

feature maps are concatenated into a long vector. Such a vector plays the same role

as hand-coded features and it is fed to a full connection layer. A standard full

connection operation can be either the conventional Multi-Layer Perceptron (MLP)

or SVM.

As the fully connected layers receive feature vector from the topmost convolution

stage, the output layer can generate a probability distribution over the output

classes. Toward this purpose, the output of the last fully connected layer is fed to a

K-way softmax (where K is the number of classes) layer, which is the same as a

multi-class logistic regression.

3.2. Face occlusion classi¯cation

The second stage CNN takes the output from head detector as input and implements

the face occlusion classi¯cation. The CNN trains the classi¯er with the implicit,

highly discriminative features to di®erentiate facial parts and distinguish whether a

facial part is occluded or not at the same time. This is aided by a multi-task learning

paradigm described in more detail below.

The intuition of MTL is to jointly learn multiple tasks by exploiting a shared

structural representation and improving the generalization by using related tasks.

Deep neural networks such as CNN have been proven advantageous in MTL due to

their powerful representation learning capability and the knowledge transferability

across similar tasks.33,43,45

Inspired by the successes of CNN MTL, we con¯gured the second stage CNN to

enable its shared representation learned in the feature layers for two independent

MLP classi¯ers in the ¯nal layer. Speci¯cally, we jointly trained the facial parts (left

eye, right eye, nose and mouth) classi¯cation and occlusion/nonocclusion decision

simultaneously.

In our experiments, we adopt a multi-task CNN as shown in Fig. 8. The archi-

tecture is same as for the CNN for head detector. When MTL is performed, we

minimize the linear combination of individual task loss43 as:

Ljoint ¼XN

i¼1�iLi; ð1Þ

where N is the total number of tasks, �i is weighting factor for the ith task and Li is

the ith task loss. When one of �is takes 1, it will degenerate to classical single task

learning.


1660010-9

3.3. Bounding box regression

A bounding box regression module is employed to improve the detection accuracy.

Many bounding boxes generated from the EdgeBoxes algorithm are not close to the

object ground truth, but might be judged as positive samples. We can regard de-

tection as a regression problem to ¯nd the location of an object. This formulation can

work well for localizing a single object. Based on the error analysis, we implemented a

simple method to reduce localization errors. Inspired by the bounding-box regression

employed in the Deformable Parts Model (DPM),10 we trained a linear regression

model to predict a new detection window given by the last pooled features for the

region proposals produced by the EdgeBoxes. This simple approach can ¯x a large

number of mislocalized detections, thus substantially boosting the accuracy.

3.4. Pre-train

By supervised learning with su±cient annotated training data, CNNs that contain

millions of parameters have demonstrated competitive performance for visual rec-

ognition tasks23,24,34,37,44,42 when starting from a random initialization. However,

CNN architecture has a property that is strongly dependent on large amounts of

training data for good generalization. When the amount of labeled data is limited,

directly training a high capacitor CNN may become problematic. Researches40 have

shown an alternative solution to compensate the problem by choosing an optimized

starting point which can be pre-trained by transferring parameters from either

supervised learning or unsupervised learning, as opposite to a random initialized

start. We ¯rst trained the CNN model in the supervised mode using the ImageNet

data and then ¯ne-tuned it on the domain-speci¯c labeled images as the head de-

tector. To be speci¯c, the pre-trained model is designed to recognize objects in

natural images. The leveraged knowledge from the source task could re°ect some

common characteristics shared in these two types of images such as corners or edges.

4. Experiment Results

The proposed approach was analyzed using our face occlusion dataset, the AR face

dataset and the LFW dataset with pre-training and ¯ne-tuning. For head detector,

the pre-training stage employed 10% of images from the ILSVRC201231 for the

classi¯cation task. For the occlusion classi¯cation, the pre-training was implemented

by the face recognition.

4.1. Implementation details

The experiments were conducted on a computer Dell Tower 5810 with Intel Xeon

E5-1650 v3 and 64G of memory. In order to speed up CNN training, a GPU,

NVIDIA GeForce GTX TITAN, is plugged on the board. The program operated


1660010-10

under 64-bit windows 7 Ultimate with Matlab 2013b, Microsoft visual studio 2012

and CUDA7.0.47

We trained our models using stochastic gradient decent with a batch size of 128

examples and momentum of 0.01. The learning rate was initialized as 0.01 and

adapted during training. More speci¯cally, we monitored the overall loss function. If

the loss was not reduced for 5 epochs continually, the learning rate was dropped by

50%. We deem the network converged if the loss is stabilized.

In the training procedure, ¯rstly, 10% of images from the ILSVRC2012 are used to

train Net1 from random initialization. Then, the ¯ne tuning on Net1 is implemented

by head detection dataset in the AR face dataset, our dataset and the LFW dataset.

Thirdly, Net2 is initialized from the Net1 after ¯ne tuning. Finally, the ¯ne tuning on

Net2 is trained by face occlusion classi¯cation dataset in the AR face dataset, our

dataset and the LFW dataset. And the test procedure is described in Fig. 6.

4.2. Head detector

As the state-of-the-art object proposal method with regard to the trade-o® between

the speed and recall, EdgeBoxes method needs to tune the parameters at each desired

overlap threshold.48 We estimated three pairs of � and � parameters to evaluate the

object proposal corresponding to the head hypothesis, following the standard object

proposal evaluation framework. The detection recall is calculated as the ratio of

ground truth bounding boxes that have been predicted among the EdgeBoxes pro-

posals with an Intersection Over Union (IoU) larger than a given threshold.

Three useful pairs of � and � values in EdgeBoxes48 were estimated to provide the

head candidates from our dataset, the AR face dataset and the LFW dataset, as

shown in Fig. 9. The parameters � and � control the step size of the sliding window

search and the NMS threshold, respectively. The two ¯gures in Fig. 9 illustrate the

algorithm's behavior with varying � and � when the same number of object pro-

posals are generated. More speci¯cally, three pairs of � and � are 0.65/0.55, 0.65/

0.75 and 0.85/0.95, respectively. They were tested when the max region proposal was

¯xed at 500 with respect to our database and the AR database and 700 on the LFW

database. It is obvious that the biggest value obtain the best recall, such that we were

able to achieve 97.53% on our face occlusion database, 99.35% on the AR face

database and 100% on the LFW database when IoU is larger than 0:5. In conclusion,

when the density of the sampling rate increases, we will get higher recall, but the run

time becomes longer.

The second experiment compared the di®erent number of bounding boxes pro-

duced by EdgeBoxes according to ranking score. A suite of maximal number of

boxes, 100, 300, 500, 700, 900, was evaluated when the density of the sampling rate

and NMS threshold are ¯xed to 0.85 and 0.95, respectively, as shown in Fig. 10. In

order to make sure the face occlusion classi¯cation is valid, the IoU between region

and ground truth is larger than 0.5. The recall is approximately 100% on our

face occlusion database, the AR face database and the LFW database (Fig. 10).


1660010-11

The recall does not increase when the maximum number of region proposal is larger

than 500 on our dataset and the AR dataset and 700 on the LFW dataset. In consid-

eration of the tolerance of Net1, the �, � and maximal number of region proposal were

set to 0.85, 0.95 and 500 on our dataset and AR dataset, respectively. By taking

account of the complex background of LFWdataset, themaximal number of boxeswas

selected as 700 and the other parameterswere the same for our dataset andARdataset.

The same parameters were selected for our face occlusion database and the AR face

database. However, the recall obtained with repcet to the AR dataset is lower due to

the greater variability of our database. The recall curves (Figs. 9 and 10) demonstrate

that almost all of the head has been selected as proposals by the EdgeBoxes.

After obtaining the candidate regions from EdgeBoxes, the regions are classi¯ed

into head and nonhead by the trained CNNmodule. Before applying the CNN for the

classi¯cation, a hypothesis object proposal will be discarded if the proposal can be

simply judged as head or useless patch based on the following reasoning: a region is

head if the overlap with the whole image is less than 5% on AR face database, a

region is head or useless patch if the overlapping with the full image is between 2%

(a) (b)

(c)

Fig. 9. The three useful variants of EdgeBoxes when the maximal number of region proposal is same.


1660010-12

and 30% on our face occlusion database. Then, the features are extracted via Net1

from normalized regions, 100� 100 gray images. Next, the subsequent MLP predictswhether the region is a head region or not. After MLP, several positive regions are

merged into a box by the nonmax suppression with 0.3 overlap threshold sorted by

the reliability of the MLP, which indicated the accuracy of detection. The perfor-

mance of the approach was tested on the AR face database and our face occlusion

database. The accuracies using our face occlusion database, the AR face database

and the LFW database were 56.83%, 73.79% and 88.52% with 0.5 IoU, respectively

(Fig. 10). The performance can be further improved by bounding box regression, as

expounded in the following.

We use a simple bounding-box regression stage to improve the localization per-

formance. After scoring each object proposal by MLP, we predict a new bounding

box for detection using a class-speci¯c bounding-box regressor. This is similar in

spirit to the approach used in deformable part models.13 The better location of head

is regressed from features computed by the CNN (Fig. 11). After regressing the

positive features from CNN, the performance is improved, with increases of 28.78%,

(a) (b)

(c)

Fig. 10. The di®erent maximal number of region proposal when � is 0.85 and � is 0.95.


1660010-13

23.79% and 11.48% on our face occlusion database, AR face database and LFW

database respectively with IoU 0.5. The reason why the regressor improves the

performance is that the head is centered in the images. However, the regressor only

improves when the classi¯cation result from CNN is correct. Examples of the head

detection from our face occlusion database, AR face database and LFW database are

shown in Figs. 12–14, respectively.

We created a head detector based on HoG feature extraction1 and SVM classi-

¯cation2 as a baseline approach. The parameters for the HOG was set as orientations

in the range [0, 360], 40 orientations bins and 3 spatial levels, which means that the

inputs for SVM have a dimension of 840, ð1þ 4þ 16Þ � 40. For the SVM, weemployed C-SVC and radial basis kernel function.2 The comparison performance is

illustrated in Table 1, which demonstrates that our approach is robust and e®ective

in complex scenes. The computational complexity is showed in Table 2. The major

part of the computation of our approach is from the region proposal algorithm,

EdgeBoxes.

(a) (b)

(c)

Fig. 11. The recall of head detection, the red line is the performance of head detector without regression

and the blue line indicate the performance of head detector with regression (color online).


1660010-14

Fig. 12. Examples of head detection results from our face occlusion database, red rectangle is ground

truth and the green rectangle is detection location (color online).

Fig. 13. Examples of head detection results from the AR face database, the red rectangle is the ground

truth and the green rectangle is the detection location (color online).

Table 1. Accuracy of head detection when IoU is largerthan 0.5.

LFW (%) AR (%) Our Dataset (%)

HOGþ SVM 91.87 97.44 72.41Our method 100 97.58 85.61


1660010-15

4.3. Face occlusion classi¯cation

After obtaining the head position by the head detector, the face occlusion classi¯er

was used to classify the type of face occlusion.

The AR face dataset30 is a popular dataset for face recognition. It contains over

4000 color images from 126 people's faces (70 males and 56 females). Images cover

the frontal view faces with occlusions (sun glasses and scarf), di®erent facial

expressions and illumination conditions (Fig. 1). For the face occlusion classi¯er, the

dataset is categorized into two occlusion conditions: face with eyes occluded by

sunglasses and face with mouth occluded by scarf. Half of the un-occluded faces were

used to pre-train the CNN model, following a general face recognition methodology.

More speci¯cally, the pre-train dataset consisted of 31 male and 25 female faces

(frontal view face with di®erent expressions and illumination).

After pre-training, the fully connected layers were replaced by a new MLP, ini-

tialized from random connection values for the ¯ne-tuning, by applying the face

occlusion dataset. The CNN with multi-tasks learning was designed to predict the

presence or absence of di®erent facial parts, thus indicating occlusion or not. The

structure of CNN is illustrated in Fig. 8.

The loss function values during training is illustrated in Fig. 15. After the loss

becomes stabilized, the converged model is applied to test a testing sample with

accuracy as shown in Table 3.

The accuracy values during training and validation are given in Fig. 16. Cross-

validation and early stopping are employed to avoid over¯tting. Figure 16 illustrates

that the accuracies during training and validation continually rise. Early stopping is

used to select the appropriate trained model during training and avoid over¯tting.

Fig. 14. Examples of head detection results from the LFW database, the red rectangle is the ground truth

and the green rectangle is the detection location (color online).

Table 2. Computational complexity on head detection(fps).

LFW AR Our Created Dataset

HOGþ SVM 32.48 24.06 419.68Our method 0.94 2.02 1.76


1660010-16

The system performance for face occlusion classi¯cation was also evaluated using

our face occlusion dataset and the LFW dataset. The experiment procedures are

similar to the AR face dataset. The di®erence is that there is a variety of occlusions

from di®erent facial parts, namely, left eye, right eye, nose and mouth. Accordingly,

there are four MLPs following the shared layer to verify whether each facial part is

occluded or not, as explained in Fig. 8.

The training loss is shown in Fig. 15, and the occlusion accuracies on our dataset

and LFW are 94.55% and 95.41%, respectively, with further details provided in

Table 4. Our model is a multi-task framework that shares the front layers. The

summed loss changes with the updating of CNN parameters. And it can be seen that

(a) (b)

(c)

Fig. 15. Comparison of loss function values during training.

Table 3. Face occlusion classi¯cation onAR dataset.

Eyes Mouth Total

Accuracy 98.58% 100% 98.58%


1660010-17

the loss °uctuates more obviously on our face dateset because it is much more

complex compared with AR dataset and LFW dataset.

The face detector combined with Haar feature27 extractor and Viola–Jones39

classi¯er is used as the base line of face occlusion classi¯cation. The experiments

result is summarized in Table 5, which indicates that our method outperforms Haar-

based face detection. And the corresponding computational complexity is reported in

Table 6 showing that our method is faster than the classical approach.

Table 4. The accuracy on LFW and our face occlusion dataset.

The Accuracy on Our Face Occlusion Dataset

Left Eye (%) Right Eye (%) Nose (%) Mouth (%) Total (%)

Our dataset 98.15 99.07 98.15 99.07 94.55

LFW 97.79 98.91 99.63 99.02 95.41

(a) (b)

(c)

Fig. 16. Comparison of accuracy values during training and validation.


1660010-18

Table 5. Accuracy on face occlusion classi¯cation.

LFW (%) AR (%) Our Dataset (%)

Haarþ VJ 45.22 47.98 21.20Our method 95.41 98.58 94.55

Table 6. Computational complexity on face occlusionclassi¯cation (ms).

LFW AR Our Created Dataset

Haarþ VJ 16.29 58.23 22.35Our method 13.10 20.67 17.31

(a) (b)

(c)

Fig. 17. Examples of wrong region proposal.


1660010-19

4.4. Error analysis

There are two factors that may impact the e®ectiveness of the proposed approach.

Firstly, the region proposal algorithm should be robust with regard to the generated

candidate bounding boxes. However, as the EdgeBoxes produce the candidate

regions by the edge, it has several potential problems: (a) negative data will be

generated when a person wears clothing with complex texture (Fig. 17(a)); (b) a

head will appear larger with certain hairstyles (Fig. 17(b)); (c) the head will not be

(a) (b)

(c)

Fig. 18. Error prediction image sample.


1660010-20

segmented when a person has long hair and wears a black dust coat (Fig. 17(c)).

Secondly, there are three factors that may in°uence the performance of face occlusion

classi¯er, including the illumination variations, occlusions and the partial occlusions.

Figure 18 further explains the di±cult situations.

5. Conclusion

This paper proposed an approach for face occlusion detection to enhance the sur-

veillance security for ATM. The coarse-to-¯ne approach consists of a head detector

and a face occlusion classi¯er. The head detector is implemented with EdgeBoxes —

region proposal, CNN and MLP. The method achieved detection accuracies of

97.58%, 85.61% and 100% on the AR face database, our face occlusion database and

LFW dataset. For face occlusion classi¯cation, CNN is applied with a pre-training

strategy via usual face recognition task, followed by ¯ne-tuning with the face oc-

clusion classi¯cation based on MTL to verify whether a facial part is occluded or not.

Our approach is evaluated on the AR face dataset, our dataset and LFW dataset,

achieving 98.58%, 94.55% and 95.41% accuracies, respectively. Further work is being

made toward the improvement of the model by more robust and accurate region

proposal, which will render it more realistic for real-world applications.

Acknowledgment

The ¯rst author thank Chao Yan and Rongqiang Qian for their valuable help. The

constructive comments and suggestions from the anonymous reviewers are much

appreciated.

References

1. A. Bosch, A. Zisserman and X. Munoz, Representing shape with a spatial pyramid kernel,6th ACM Int. Conf. Image and Video Retrieval, New York (2007), pp. 401–408.

2. C.-C. Chang and C.-J. Lin, Libsvm: A library for support vector machines, ACM Trans.Intell. Syst. Technol. 2 (2011) 27:1–27:27.

3. T. Charoenpong, Face occlusion detection by using ellipse ¯tting and skin color ratio,Burapha University International Conf. (BUU), (2013), pp. 1145–1151.

4. T. Charoenpong, C. Nuthong and U. Watchareeruetai, A new method for occluded facedetection from single viewpoint of head, 2014 11th Int. Conf. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)(2014), pp. 1–5.

5. J. Chen, S. Shan, S. Yang, X. Chen and W. Gao, Modi¯cation of the adaboost-baseddetector for partially occluded faces, 18th Int. Conf. Pattern Recognition 2 (2006)516–519.

6. R. Cinbis, J. Verbeek and C. Schmid, Segmentation driven object detection with ¯shervectors, 2013 IEEE Int. Conf. Computer Vision (ICCV) (2013), pp. 2968–2975.

7. A. El-Barkouky, A. Shalaby, A. Mahmoud and A. Farag, Selective part models fordetecting partially occluded faces in the wild, 2014 IEEE Int. Conf. Image Processing(ICIP) (2014), pp. 268–272.


1660010-21

8. I. Endres and D. Hoiem, Category-independent object proposals with diverse ranking,IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 222–234.

9. M. Everingham, S. Eslami, L. Van Gool, C. Williams, J. Winn and A. Zisserman, Thepascal visual object classes challenge: A retrospective, Int. J. Comput. Vis. 111 (2015)98–136.

10. P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, Object detection withdiscriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32(2010) 1627–1645.

11. R. Girshick, Fast r-cnn, The IEEE Int. Conf. Computer Vision (ICCV), (2015), pp. 1440–1448.

12. R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurateobject detection and semantic segmentation, 2014 IEEE Conf. Computer Vision andPattern Recognition (CVPR) (2014), pp. 580–587.

13. R. Girshick, J. Donahue, T. Darrell and J. Malik, Region-based convolutional networksfor accurate object detection and segmentation, IEEE Trans. Pattern Anal. Mach. Intell.38 (2016) 142–158.

14. X. Glorot, A. Bordes and Y. Bengio, Deep sparse recti¯er neural networks, J. Mach.Learn. Res. 15 (2011) 315–323.

15. S. Gul and H. Farooq, A machine learning approach to detect occluded faces in uncon-strained crowd scene, 2015 IEEE 14th Int. Conf. Cognitive Informatics Cognitive Com-puting (ICCI*CC) (2015), pp. 149–155.

16. K. He, X. Zhang, S. Ren and J. Sun, Spatial pyramid pooling in deep convolutionalnetworks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) 1904–1916.

17. S. Hongxing, W. Jiagi, S. Peng and Z. Xiaoyang, Facial area forecast and occluded facedetection based on the ycbcr elliptical model, Int. Conf. Mechatronic Sciences, ElectricEngineering and Computer (MEC) (2013), pp. 1199–1202.

18. J. Hosang, R. Benenson, P. Doll�ar and B. Schiele, What makes for e®ective detectionproposals?, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2016) 814–830.

19. G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled faces in the wild: Adatabase for studying face recognition in unconstrained environments, Technical Report,University of Massachusetts, Amherst (2007), pp. 7–49.

20. K. Ichikawa, T. Mita, O. Hori and T. Kobayashi, Component-based face detectionmethod for various types of occluded faces, 3rd Int. Symp. Communications, Control andSignal Processing (2008), pp. 538–543.

21. G. Kim, J. K. Suhr, H. G. Jung and J. Kim, Face occlusion detection by using b-splineactive contour and skin color information, 11th Int. Conf. Control Automation RoboticsVision (ICARCV) (2010), pp. 627–632.

22. J. Kim, Y. Sung, S. Yoon and B. Park, A new video surveillance system employingoccluded face detection, in Innovations in Applied Arti¯cial Intelligence (Springer BerlinHeidelberg, 2005), pp. 65–68.

23. J. Krause, T. Gebru, J. Deng, L.-J. Li and L. Fei-Fei, Learning features and parts for ¯ne-grained recognition, 2014 22nd Int. Conf. Pattern Recognition (ICPR) (2014), pp. 26–33.

24. A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classi¯cation with deep con-volutional neural networks, in Advances in Neural Information Processing Systems(Curran Associates, Inc, 2012), pp. 1097–1105.

25. Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521 (2015) 436–444.26. S. Liao, A. K. Jain and S. Z. Li, Partial face recognition: Alignment-free approach, IEEE

Trans. Pattern Anal. Mach. Intell. 35 (2013) 1193–1205.


1660010-22

27. R. Lienhart and J. Maydt, An extended set of haar-like features for rapid object detection,Int. Conf. Image Processing 1 (2002) 900–903.

28. D.-T. Lin and M.-J. Liu, Face occlusion detection for automated teller machine surveil-lance, in Advances in Image and Video Technology, Lecture Notes in Computer Science,Vol. 4319 (Springer Berlin Heidelberg, 2006), pp. 641–651.

29. Z. Liu, P. Luo, X. Wang and X. Tang, Deep learning face attributes in the wild, The IEEEInt. Conf. Computer Vision (ICCV) (2015), pp. 3730–3738.

30. A. Martinez and R. Benavente, The ar face database, CVC Technical Report (1998).31. O. Russakovsky, J. Deng, H. Su et al., ImageNet large scale visual recognition challenge,

Int. J. Comput. Vis. (IJCV) 115 (2015) 211–252.32. J. Shermina and V. Vasudevan, Recognition of the face images with occlusion and ex-

pression, Int. J. Pattern Recogn. Artif. Intell. 26 (2012) 1256006-1–1256006-26.33. L. Sijin, L. Zhi-Qiang and A. Chan, Heterogeneous multi-task learning for human pose

estimation with deep convolutional neural network, IEEE Conf. Computer Vision andPattern Recognition Workshops (CVPRW) (2014), pp. 488–495.

34. K. Simonyan and A. Zisserman, Two-stream convolutional networks for actionrecognition in videos, in Advances in Neural Information Processing Systems (NeuralInformation Processing Systems, 2014), pp. 568–576.

35. Y. Sun, X. Wang and X. Tang, Deep convolutional network cascade for facial pointdetection, IEEE Conf. Computer Vision and Pattern Recognition (CVPR) Oregon(2013), pp. 3476–3483.

36. C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scalable, high-quality object detection,arXiv preprint (2014).

37. Y. Taigman, M. Yang, M. Ranzato and L. Wolf, Deepface: Closing the gap to human-level performance in face veri¯cation, IEEE Conf. Computer Vision and Pattern Rec-ognition (CVPR) (2014), pp. 1701–1708.

38. K. van de Sande, J. Uijlings, T. Gevers and A. Smeulders, Segmentation asselective search for object recognition, IEEE Int. Conf. Computer Vision (ICCV) (2011),pp. 1879–1886.

39. P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features,Computer Visi. Pattern Recogn. 1 (2001) 511–518.

40. R. Wagner, M. Thom, R. Schweiger, G. Palm and A. Rothermel, Learning convolutionalneural networks from few samples, The 2013 Int. Joint Conf. Neural Networks (IJCNN)(2013), pp. 1–7.

41. X. Wang, M. Yang, S. Zhu and Y. Lin, Regionlets for generic object detection, IEEEInternational Conference on Comput. Vis. (ICCV) (2013), pp. 17–24.

42. S. Yi, W. Xiaogang and T. Xiaoou, Deep learning face representation from predicting10,000 classes, IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2014),pp. 1891–1898.

43. C. Zhang and Z. Zhang, Improving multiview face detection with multi-task deep con-volutional neural networks, IEEE Winter Conf. Applications of Computer Vision(WACV) (2014), pp. 1036–1041.

44. N. Zhang, M. Paluri, M. Ranzato, T. Darrell and L. Bourdev, Panda: Pose alignednetworks for deep attribute modeling, IEEE Conf. Computer Vision and Pattern Rec-ognition (CVPR) (2014), pp. 1637–1644.

45. Z. Zhang, P. Luo, C. Loy and X. Tang, Facial landmark detection by deep multi-tasklearning, European Conf. Computer Vision (ECCV) (2014), pp. 94–108.

46. E. Zhou, H. Fan, Z. Cao, Y. Jiang and Q. Yin, Extensive facial landmark localizationwith coarse-to-¯ne convolutional network cascade, IEEE Int. Conf. Computer VisionWorkshops (ICCVW) (2013), pp. 386–391.


1660010-23

47. X. Zhu, K. Li, A. Salah, L. Shi and K. Li, Parallel implementation of ma®t on cuda-enabled graphics hardware, IEEE/ACM Trans. Comput. Biol. Bioinformatics 12 (2015)205–218.

48. C. Zitnick and P. Dollar, Edge boxes: Locating object proposals from edges, EuropeanConf. Computer Vision (ECCV) (2014), pp. 391–405.

Yizhang Xia is a Ph.D.student of Computer Sci-ence at The University ofLiverpool. He received hisBachelor's degree inComputer Science andSoftware Engineeringfrom Xi'an Jiaotong-Liverpool University,Suzhou. His researchinterests include comput-

er vision and deep learning with applications inintelligent transportation systems and bio-metrics.

Bailing Zhang receivedhis Master's degree inCommunication andElectronic System fromthe South China Univer-sity of Technology,Guangzhou, China, andhis Ph.D. in Electrical andComputer Engineeringfrom the University ofNewcastle, NSW, Aus-

tralia, in 1987 and 1999, respectively. He is cur-rently an Associate Professor in the Departmentof Computer Science and Software Engineeringat Xi'an Jiaotong-Liverpool University, Suzhou,China. He has been a Lecturer in the School ofComputer Science and Mathematics in the Vic-toria University, Australia since 2003. Before2001, he was a research sta® in the Kent RidgeDigital Labs (KRDL), Singapore. His researchinterests include computer vision and deeplearning with applications in intelligent trans-portation systems and bioinformatics.

Frans Coenen is a Pro-fessor of Computer Sci-ence at The Unversity ofLiverpool. He has a gen-eral background in AI,and has been working inthe ¯eld of data miningand Knowledge Discoveryin Data (KDD) for thelast 15 years. He is par-ticularly interested in big

data; social network and trend mining; themining of nonstandard datasets such as graph,image and document collections; and the prac-tical application of data mining in its manyforms.


1660010-24

Face Occlusion Detection Using Deep Convolutional Neural Networks1. Introduction2. Face Occlusion Dataset3. System Overview3.1. Head detection3.2. Face occlusion classification3.3. Bounding box regression3.4. Pre-train

4. Experiment Results4.1. Implementation details4.2. Head detector4.3. Face occlusion classification4.4. Error analysis

5. ConclusionAcknowledgmentReferences

face occlusion detection using deep convolutional neural...

Documents