face occlusion detection using deep convolutional neural...

24
Face Occlusion Detection Using Deep Convolutional Neural Networks Yizhang Xia * and Bailing Zhang Department of Computer Science and Software Engineering Xian Jiaotong-Liverpool University, SIP, Suzhou 215123, P. R. China * [email protected] Frans Coenen Department of Computer Science University of Liverpool, Liverpool L69 3BX, UK [email protected] Received 17 November 2015 Accepted 13 September 2016 Published 17 November 2016 With the rise of crimes associated with Automated Teller Machines (ATMs), security rein- forcement by surveillance techniques has been a hot topic on the security agenda. As a result, cameras are frequently installed with ATMs, so as to capture the facial images of users. The main objective is to support follow-up criminal investigations in the event of an incident. However, in the case of miss-use, the user's face is often occluded. Therefore, face occlusion detection has become very important to prevent crimes connected with ATM usage. Traditional approaches to solving the problem typically comprise a succession of steps: localization, seg- mentation, feature extraction and recognition. This paper proposes an end-to-end facial oc- clusion detection framework, which is robust and e®ective by combining region proposal algorithm and Convolutional Neural Networks (CNN). The framework utilizes a coarse-to-¯ne strategy, which consists of two CNNs. The ¯rst CNN detects the head element within an upper body image while the second distinguishes which facial part is occluded from the head image. In comparison with previous approaches, the usage of CNN is optimal from a system point of view as the design is based on the end-to-end principle and the model operates directly on image pixels. For evaluation purposes, a face occlusion database consisting of over 50 000 images, with annotated facial parts, was used. Experimental results revealed that the proposed framework is very e®ective. Using the bespoke face occlusion dataset, Aleix and Robert (AR) face dataset and the Labeled Face in the Wild (LFW) database, we achieved over 85.61%, 97.58% and 100% accuracies for head detection when the Intersection over Union-section (IoU) is larger than 0.5, and 94.55%, 98.58% and 95.41% accuracies for occlusion discrimination, respectively. Keywords : Automated teller machine (ATM); convolutional neural network (CNN); face occlusion detection; multi-task learning (MTL). * Corresponding author. International Journal of Pattern Recognition and Arti¯cial Intelligence Vol. 30, No. 9 (2016) 1660010 (24 pages) # . c World Scienti¯c Publishing Company DOI: 10.1142/S0218001416600107 1660010-1

Upload: others

Post on 25-Jan-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • Face Occlusion Detection Using Deep

    Convolutional Neural Networks

    Yizhang Xia* and Bailing Zhang

    Department of Computer Science and Software Engineering

    Xian Jiaotong-Liverpool University, SIP, Suzhou 215123, P. R. China*[email protected]

    Frans Coenen

    Department of Computer ScienceUniversity of Liverpool, Liverpool L69 3BX, UK

    [email protected]

    Received 17 November 2015

    Accepted 13 September 2016Published 17 November 2016

    With the rise of crimes associated with Automated Teller Machines (ATMs), security rein-

    forcement by surveillance techniques has been a hot topic on the security agenda. As a result,cameras are frequently installed with ATMs, so as to capture the facial images of users. The

    main objective is to support follow-up criminal investigations in the event of an incident.

    However, in the case of miss-use, the user's face is often occluded. Therefore, face occlusion

    detection has become very important to prevent crimes connected with ATM usage. Traditionalapproaches to solving the problem typically comprise a succession of steps: localization, seg-

    mentation, feature extraction and recognition. This paper proposes an end-to-end facial oc-

    clusion detection framework, which is robust and e®ective by combining region proposal

    algorithm and Convolutional Neural Networks (CNN). The framework utilizes a coarse-to-¯nestrategy, which consists of two CNNs. The ¯rst CNN detects the head element within an upper

    body image while the second distinguishes which facial part is occluded from the head image. In

    comparison with previous approaches, the usage of CNN is optimal from a system point of viewas the design is based on the end-to-end principle and the model operates directly on image

    pixels. For evaluation purposes, a face occlusion database consisting of over 50 000 images, with

    annotated facial parts, was used. Experimental results revealed that the proposed framework is

    very e®ective. Using the bespoke face occlusion dataset, Aleix and Robert (AR) face dataset andthe Labeled Face in the Wild (LFW) database, we achieved over 85.61%, 97.58% and 100%

    accuracies for head detection when the Intersection over Union-section (IoU) is larger than 0.5,

    and 94.55%, 98.58% and 95.41% accuracies for occlusion discrimination, respectively.

    Keywords : Automated teller machine (ATM); convolutional neural network (CNN); face

    occlusion detection; multi-task learning (MTL).

    *Corresponding author.

    International Journal of Pattern Recognitionand Arti¯cial Intelligence

    Vol. 30, No. 9 (2016) 1660010 (24 pages)

    #.c World Scienti¯c Publishing CompanyDOI: 10.1142/S0218001416600107

    1660010-1

    http://dx.doi.org/10.1142/S0218001416600107

  • 1. Introduction

    Automated Teller Machines (ATMs) have always been the targets of criminal ac-

    tivity since their widespread introduction in the 1970s. For example, fraudsters can

    obtain card details and PINs using a wide range of tactics. Among the possible

    techniques to defend against ATM crime, real-time automatic alarm systems seem to

    be the most straightforward technical solution to maximize protection. This is be-

    cause the surveillance cameras are installed in nearly all ATMs. However, current

    video surveillance for ATM requires constant sta® monitoring, which has the obvious

    disadvantages of human error caused by fatigue or distraction.

    Face occlusion detection has been studied for several years with a number of

    methods published,26,32 many of which aim to reinforce ATM security. The published

    approaches can be roughly categorized into two categories: face or head detection

    approaches and occlusion classi¯cation approaches.

    In the ¯rst category, the objective is robust face detection algorithms in the

    presence of partial occlusions, with two common practices, namely, facial compo-

    nent-based approaches and shape-based approaches. Facial component-based ap-

    proach, such as that presented,20 detected facial components such as eyes, nose and

    mouth, and determines a face area based on the component detection result. For

    example, the method proposed in Ref. 20 combined seven AdaBoost-based classi¯ers

    for whole face with individual face-part classi¯ers trained on nonoccluded face

    sample sets, and a decision tree and Linear Discriminant Analysis (LDA) to classify

    nonoccluded faces and various types of occluded faces. Inspired by discriminatively

    trained part-based models, El-Barkouky et al.7 proposed a Selective Part Model

    (SPM) to detect faces under certain types of partial occlusions. Gul and Farooq15

    applied what is known as the Viola–Jones approach,39 with free rectangular features,

    to detect left half faces, right half faces and the holistic faces. AdaBoost-based face

    detection was also improved upon in Ref. 5 in order to detect partially occluded faces,

    which, however, only worked for frontal faces with su±cient resolutions.

    Shape-based approaches3,4,17,21,28 detect faces based on the prior knowledge of

    head, neck and shoulder shapes. In the scenario of ATM video surveillance, Ref. 28

    proposed to compute the lower boundary of the head by moving object edge ex-

    traction and head tracking. Motion information was also exploited in Ref. 21 to

    detect the head and shoulder shape with the aid of B-spline active contouring. Color

    has also been applied as a major clue to detect the head or face, following appropriate

    template ¯tting strategies.3,4,17 These approaches, however, are limited to con-

    strained poses and well-controlled illumination conditions. In Refs. 3, 4 and 17 the

    head or shoulder were detected by using ellipse or what are known as \omega

    templates", which, however, will fail when a face is severely occluded and/or the

    shapes are severely changed with di®erent face poses.

    Some of the previous published researches have tried to solve the face occlusion

    detection problem by straightforward classi¯cation.22 For example, by separating a

    face area into upper and lower parts, Principal Component Analysis (PCA) and

    Y. Xia, B. Zhang & F. Coenen

    1660010-2

  • Support Vector Machine (SVM) were combined to distinguish between normal faces

    and partially occluded faces in Ref. 22.

    Until recently, the most successful approaches to object detection utilized the

    well-known sliding window paradigm,10 in which a computationally e±cient classi¯er

    tests for object presence in every candidate image window. The steady increase in

    complexity of the core classi¯ers has led to improved detection quality, but at the

    cost of signi¯cantly increased computation time per window.6,12,16,36,41 One approach

    for overcoming the tension between computational tractability and high detection

    quality is through the use of \detection proposals".8,38 If high object recall can be

    reached with considerably fewer windows than used by sliding window detectors,

    signi¯cant performance improvement can be achieved. Current top performing ob-

    ject detectors, when applied to PASCAL benchmark image datasets9 and Ima-

    geNet,31 all used detection proposals.6,11,12,16,36,41 According to Ref. 18, approaches

    for generating object proposals can be divided into four types: grouping methods,

    window scoring methods, alternative methods and baseline methods. Grouping

    methods attempt to generate multiple (possibly overlapping) segments that are

    likely to correspond to objects. Window scoring methods are used to score each

    candidate window according to how likely it is to contain an object. Inspired by the

    success of applying object proposal approaches in di®erent object detections, this

    paper proposes a face occlusion detection system using the highly ranked object

    proposal technique, EdgeBoxes.48

    Over the last several years, there has been increasing interests in deep neural

    network models for solving various vision problems. One of the most successful deep

    learning frameworks is the Convolutional Neural Networks (CNN) architecture,24

    which is a bio-inspired hierarchical multilayered neural network that can learn visual

    representations directly from raw images. CNN possesses some key properties,

    namely translation invariance and spatially local connections (receptive ¯elds).

    Pre-trained CNN models can be exploited as generic feature extractors for di®erent

    vision tasks.24 Among the various advantages of deep neural networks over classical

    machine learning techniques, the most frequently cited examples include the con-

    veniences for the implementation of knowledge transfer, Multi-Task Learning

    (MTL), attribute learning, multi label classi¯cation and weakly supervised learning.

    In this paper, a novel CNN-based approach to face occlusion detection is pro-

    posed. A CNN cascade paradigm is adopted, which tackles the occlusion detection

    problem in a coarse-to-¯ne manner. The ¯rst CNN implements head/shoulder de-

    tection by taking a person's upper body image as input. The second CNN takes the

    output of the previous CNN as input and locates and classi¯es di®erent facial parts.

    To facilitate the study of various face occlusion problems, a database directed at

    di®erent kind of facial occlusions was created. The database consists of over 50 000

    images which are demarcated with four facial parts: two eyes, nose and mouth.

    To the best of our knowledge, this is the ¯rst work directed at analyzing how face

    occlusion can be detected using multi-task CNN. The approach was veri¯ed on our

    Face Occlusion Detection Based on Deep CNN

    1660010-3

  • face occlusion dataset, Aleix and Robert (AR) dataset30 and Labeled Face in the

    Wild (LFW) dataset,19 obtaining 94.55%, 95.58% and 95.41% accuracies,

    respectively.

    The rest of the paper is organized as follows. Section 2 provides the problem

    descriptions with the introduction of our face occlusion dataset. Section 3 overviews

    the proposed method and elaborates on the details of the coarse-to-¯ne framework.

    Section 4 reports the experiment results, followed by conclusion in Sec. 5.

    2. Face Occlusion Dataset

    A normal face image consists of two eyes, a nose and a mouth. The geometrical and

    textual information from these components is critical to the recognition or identi¯-

    cation of a person. If any of these components is blocked or covered, the face image is

    considered to be occluded. Commonly found face occlusions include the faces being

    partially covered by a hat, sunglasses, mask or mu²er. Some examples are given in

    Fig. 1.

    To facilitate research on face occlusion detection, a database of face images which

    has di®erent facial regions intentionally covered or occluded was created. The images

    were taken using the Microsoft webcam studio camera and AMCap 9.20 edition.

    Some example images are illustrated in Figs. 2 and 3. The video was ¯lmed with a

    white background. There were 220 people involved, including 140 males and 80

    females.

    During the photography, a subject was asked to stand in front of the camera with

    a set of di®erent speci¯ed poses, including looking right ahead, up and down roughly

    45�, and right and left roughly 45�. In addition to taking face images without anyocclusions, a subject was also asked to wear sunglasses, hat (in yellow and black),

    white mask and black helmet, again in ¯ve di®erent poses. Each subject has six video

    clips recorded, with 30 s for each clip and 25 frames per second. The illumination was

    normal o±ce lighting condition. Other conditions, for example, clothing, make-up,

    hair style and expression, have not been strictly controlled.

    (a) (b) (c)

    Fig. 1. Face occlusions examples for withdrawing cash from an ATM scenario.

    Y. Xia, B. Zhang & F. Coenen

    1660010-4

  • Although the created dataset is comprehensive, it cannot become public currently

    due to legal considerations. To alleviate the problem, we also utilized the AR face

    database,30 which is one of the earliest and most popular benchmark face databases.

    AR faces have been often used for robust face recognition. It includes a number of

    di®erent types of occlusions: faces with sunglasses and face partially covered by a

    scarf. The AR faces dataset consists of over 3200 frontal face images taken from 126

    subjects, with some examples given in Fig. 4.

    The LFW dataset19 was also employed to further evaluate our approach. There

    are 13 000 images and 1680 people in the LFW dataset collected from the Internet.

    The faces were detected by the OpenCV implementation of the Viola–Jones39 face

    detector. The cropped region returned by the detector was then automatically en-

    larged by a factor of 2.2 in each dimension to capture more of the head and then

    Fig. 2. Upper body of our created dataset.

    Fig. 3. Head of our created dataset.

    Face Occlusion Detection Based on Deep CNN

    1660010-5

  • scaled to a uniform size, 250 � 250. For the evaluation presented in this paper, 1000images were selected and the heads manually cropped. Occlusions were created using

    black rectangles and facial landmark localization.46 Some examples are given in

    Fig. 5.

    3. System Overview

    Though deep neural networks have achieved remarkable performance, it is still dif-

    ¯cult to solve many real-world problems by a single CNN model. With the current

    technology, the resolution of an input to CNNmust be relatively small. This will cause

    some details of an image lost. To get better performance, a common practice of coarse-

    to-¯ne paradigm has been applied with CNN design.29,35 Following the same line of

    thought, we proposed a two-stage CNN for face occlusion detection, as illustrated in

    Fig. 6. The ¯rst CNN detects the head from a person's upper body image while the

    second CNN distinguishes which facial part is occluded from the head image.

    Fig. 5. Examples of images from the LFW database with occlusions superimposed.

    (a) (b)

    Fig. 4. Examples of face from the AR face Database. (a) Head and shoulder and (b) head.

    Y. Xia, B. Zhang & F. Coenen

    1660010-6

  • 3.1. Head detection

    To identify the locations of the head in an image, advantages were taken of Region

    with Convolutional Neural Networks (R-CNN),13 which is the state-of-the-art object

    detector that classi¯es candidate object hypothesis generated by appropriate region

    proposal algorithms. The R-CNN leverages several advantages of computer vision

    development and CNN, including the superb feature expression capability from a

    pre-trained CNN, ¯ne-tuning °exibility for speci¯c objects to be detected and the

    ever-increasing e±ciency of object proposal generation schemes.

    Among the o®-the-shelf object proposal generation algorithms, the EdgeBoxes

    technique48 was chose which has attracted much interest in recent years. EdgeBoxes

    is built on the Structural Edge Map to locate object boundaries and ¯nd object

    proposals. The number of enclosed edges inside a bounding box is used to rank the

    likelihood of the box containing an object.

    The overall convolutional net architecture is shown in Fig. 7. The network con-

    sists of three convolution stages followed by three fully connected layers. A convo-

    lution stage includes a convolution layer, a nonlinear activation layer, a local

    response normalization layer and a max pooling layer. The nonlinear activation layer

    and local response normalization layers are not included in Figs. 7 and 8 as data size

    was not changed. Using shorthand notation, the full architecture is C(8,5,1)- ~A-N-P-

    C(16,5,1)- ~A-N-P-C(32,5,1)- ~A-N-P-FC(1024)- ~A-FC(128)- ~A-FC(4)- ~A, where C(d,f,s)

    indicates a convolutional layer with d ¯lters of spatial size f�f, applied to the inputwith stride s. A is the non-linear activation function, which uses the ReLU activation

    function.14 FC(n) is a fully connected layer with n output nodes. All pooling layers P

    Fig. 6. The °ow diagram of whole system.

    Face Occlusion Detection Based on Deep CNN

    1660010-7

  • use max-pooling in nonoverlapping 2� 2 regions and all normalization layers N arede¯ned as described in Ref. 24. The ¯nal layer is connected to a soft-max layer with

    dense connections. The structure of the networks and the hyperparameters were

    empirically initialized based on previous works using CNNs.

    The well-known over¯tting problem has been taken into account in our design

    with the following considerations. Firstly, we empirically compared a set of di®erent

    CNN architectures with varying number of kernels and selected the one which is

    deemed as the most e®ective with regard to the trade-o® between network com-

    plexity and performance. Secondly, the over¯tting has been avoided to a large extent

    as the CNN size is very moderate, which compares sharply with some published CNN

    models for large-scale image classi¯cations. 24

    The adopted CNN used the shared weight neural network architecture,25 in which

    the local receptive ¯eld (kernel or ¯lter) is replicated across the entire visual ¯eld to

    form a feature map, which is known as convolution operation. The sharing of weights

    reduces the number of free variables, and increases the generalization performance of

    the network. Weights (kernels or ¯lters) are initialized at random and will learn to be

    edge, color or speci¯c pattern detectors.

    In deep CNN, the classical sigmoidal function has been replaced by a Recti¯er

    Linear Unit (ReLu) to accelerate training speed. Recent CNN-based approa-

    ches23,24,34,37,42,44 applied the ReLU as the nonlinear activation function for both the

    convolution layer and the full connection layer, often with faster training speed as

    reported in Ref. 24.

    Fig. 7. Architecture of head detector CNN.

    Fig. 8. Architecture of face occlusion classi¯er CNN.

    Y. Xia, B. Zhang & F. Coenen

    1660010-8

  • Typical pooling functions include average-pooling and max-pooling layers. Av-

    erage pooling takes the arithmetic mean of the elements in each pooling region while

    max-pooling selects the largest element from the input.

    The four layers of convolution, nonlinear activation, pooling and normalization,

    are combined hierarchically to form a convolution stage (block). Generally, an input

    image will be passed through several convolution stages for extracting complex de-

    scriptive features. In the output of the topmost convolution stage, all small-sized

    feature maps are concatenated into a long vector. Such a vector plays the same role

    as hand-coded features and it is fed to a full connection layer. A standard full

    connection operation can be either the conventional Multi-Layer Perceptron (MLP)

    or SVM.

    As the fully connected layers receive feature vector from the topmost convolution

    stage, the output layer can generate a probability distribution over the output

    classes. Toward this purpose, the output of the last fully connected layer is fed to a

    K-way softmax (where K is the number of classes) layer, which is the same as a

    multi-class logistic regression.

    3.2. Face occlusion classi¯cation

    The second stage CNN takes the output from head detector as input and implements

    the face occlusion classi¯cation. The CNN trains the classi¯er with the implicit,

    highly discriminative features to di®erentiate facial parts and distinguish whether a

    facial part is occluded or not at the same time. This is aided by a multi-task learning

    paradigm described in more detail below.

    The intuition of MTL is to jointly learn multiple tasks by exploiting a shared

    structural representation and improving the generalization by using related tasks.

    Deep neural networks such as CNN have been proven advantageous in MTL due to

    their powerful representation learning capability and the knowledge transferability

    across similar tasks.33,43,45

    Inspired by the successes of CNN MTL, we con¯gured the second stage CNN to

    enable its shared representation learned in the feature layers for two independent

    MLP classi¯ers in the ¯nal layer. Speci¯cally, we jointly trained the facial parts (left

    eye, right eye, nose and mouth) classi¯cation and occlusion/nonocclusion decision

    simultaneously.

    In our experiments, we adopt a multi-task CNN as shown in Fig. 8. The archi-

    tecture is same as for the CNN for head detector. When MTL is performed, we

    minimize the linear combination of individual task loss43 as:

    Ljoint ¼XN

    i¼1�iLi; ð1Þ

    where N is the total number of tasks, �i is weighting factor for the ith task and Li is

    the ith task loss. When one of �is takes 1, it will degenerate to classical single task

    learning.

    Face Occlusion Detection Based on Deep CNN

    1660010-9

  • 3.3. Bounding box regression

    A bounding box regression module is employed to improve the detection accuracy.

    Many bounding boxes generated from the EdgeBoxes algorithm are not close to the

    object ground truth, but might be judged as positive samples. We can regard de-

    tection as a regression problem to ¯nd the location of an object. This formulation can

    work well for localizing a single object. Based on the error analysis, we implemented a

    simple method to reduce localization errors. Inspired by the bounding-box regression

    employed in the Deformable Parts Model (DPM),10 we trained a linear regression

    model to predict a new detection window given by the last pooled features for the

    region proposals produced by the EdgeBoxes. This simple approach can ¯x a large

    number of mislocalized detections, thus substantially boosting the accuracy.

    3.4. Pre-train

    By supervised learning with su±cient annotated training data, CNNs that contain

    millions of parameters have demonstrated competitive performance for visual rec-

    ognition tasks23,24,34,37,44,42 when starting from a random initialization. However,

    CNN architecture has a property that is strongly dependent on large amounts of

    training data for good generalization. When the amount of labeled data is limited,

    directly training a high capacitor CNN may become problematic. Researches40 have

    shown an alternative solution to compensate the problem by choosing an optimized

    starting point which can be pre-trained by transferring parameters from either

    supervised learning or unsupervised learning, as opposite to a random initialized

    start. We ¯rst trained the CNN model in the supervised mode using the ImageNet

    data and then ¯ne-tuned it on the domain-speci¯c labeled images as the head de-

    tector. To be speci¯c, the pre-trained model is designed to recognize objects in

    natural images. The leveraged knowledge from the source task could re°ect some

    common characteristics shared in these two types of images such as corners or edges.

    4. Experiment Results

    The proposed approach was analyzed using our face occlusion dataset, the AR face

    dataset and the LFW dataset with pre-training and ¯ne-tuning. For head detector,

    the pre-training stage employed 10% of images from the ILSVRC201231 for the

    classi¯cation task. For the occlusion classi¯cation, the pre-training was implemented

    by the face recognition.

    4.1. Implementation details

    The experiments were conducted on a computer Dell Tower 5810 with Intel Xeon

    E5-1650 v3 and 64G of memory. In order to speed up CNN training, a GPU,

    NVIDIA GeForce GTX TITAN, is plugged on the board. The program operated

    Y. Xia, B. Zhang & F. Coenen

    1660010-10

  • under 64-bit windows 7 Ultimate with Matlab 2013b, Microsoft visual studio 2012

    and CUDA7.0.47

    We trained our models using stochastic gradient decent with a batch size of 128

    examples and momentum of 0.01. The learning rate was initialized as 0.01 and

    adapted during training. More speci¯cally, we monitored the overall loss function. If

    the loss was not reduced for 5 epochs continually, the learning rate was dropped by

    50%. We deem the network converged if the loss is stabilized.

    In the training procedure, ¯rstly, 10% of images from the ILSVRC2012 are used to

    train Net1 from random initialization. Then, the ¯ne tuning on Net1 is implemented

    by head detection dataset in the AR face dataset, our dataset and the LFW dataset.

    Thirdly, Net2 is initialized from the Net1 after ¯ne tuning. Finally, the ¯ne tuning on

    Net2 is trained by face occlusion classi¯cation dataset in the AR face dataset, our

    dataset and the LFW dataset. And the test procedure is described in Fig. 6.

    4.2. Head detector

    As the state-of-the-art object proposal method with regard to the trade-o® between

    the speed and recall, EdgeBoxes method needs to tune the parameters at each desired

    overlap threshold.48 We estimated three pairs of � and � parameters to evaluate the

    object proposal corresponding to the head hypothesis, following the standard object

    proposal evaluation framework. The detection recall is calculated as the ratio of

    ground truth bounding boxes that have been predicted among the EdgeBoxes pro-

    posals with an Intersection Over Union (IoU) larger than a given threshold.

    Three useful pairs of � and � values in EdgeBoxes48 were estimated to provide the

    head candidates from our dataset, the AR face dataset and the LFW dataset, as

    shown in Fig. 9. The parameters � and � control the step size of the sliding window

    search and the NMS threshold, respectively. The two ¯gures in Fig. 9 illustrate the

    algorithm's behavior with varying � and � when the same number of object pro-

    posals are generated. More speci¯cally, three pairs of � and � are 0.65/0.55, 0.65/

    0.75 and 0.85/0.95, respectively. They were tested when the max region proposal was

    ¯xed at 500 with respect to our database and the AR database and 700 on the LFW

    database. It is obvious that the biggest value obtain the best recall, such that we were

    able to achieve 97.53% on our face occlusion database, 99.35% on the AR face

    database and 100% on the LFW database when IoU is larger than 0:5. In conclusion,

    when the density of the sampling rate increases, we will get higher recall, but the run

    time becomes longer.

    The second experiment compared the di®erent number of bounding boxes pro-

    duced by EdgeBoxes according to ranking score. A suite of maximal number of

    boxes, 100, 300, 500, 700, 900, was evaluated when the density of the sampling rate

    and NMS threshold are ¯xed to 0.85 and 0.95, respectively, as shown in Fig. 10. In

    order to make sure the face occlusion classi¯cation is valid, the IoU between region

    and ground truth is larger than 0.5. The recall is approximately 100% on our

    face occlusion database, the AR face database and the LFW database (Fig. 10).

    Face Occlusion Detection Based on Deep CNN

    1660010-11

  • The recall does not increase when the maximum number of region proposal is larger

    than 500 on our dataset and the AR dataset and 700 on the LFW dataset. In consid-

    eration of the tolerance of Net1, the �, � and maximal number of region proposal were

    set to 0.85, 0.95 and 500 on our dataset and AR dataset, respectively. By taking

    account of the complex background of LFWdataset, themaximal number of boxeswas

    selected as 700 and the other parameterswere the same for our dataset andARdataset.

    The same parameters were selected for our face occlusion database and the AR face

    database. However, the recall obtained with repcet to the AR dataset is lower due to

    the greater variability of our database. The recall curves (Figs. 9 and 10) demonstrate

    that almost all of the head has been selected as proposals by the EdgeBoxes.

    After obtaining the candidate regions from EdgeBoxes, the regions are classi¯ed

    into head and nonhead by the trained CNNmodule. Before applying the CNN for the

    classi¯cation, a hypothesis object proposal will be discarded if the proposal can be

    simply judged as head or useless patch based on the following reasoning: a region is

    head if the overlap with the whole image is less than 5% on AR face database, a

    region is head or useless patch if the overlapping with the full image is between 2%

    (a) (b)

    (c)

    Fig. 9. The three useful variants of EdgeBoxes when the maximal number of region proposal is same.

    Y. Xia, B. Zhang & F. Coenen

    1660010-12

  • and 30% on our face occlusion database. Then, the features are extracted via Net1

    from normalized regions, 100� 100 gray images. Next, the subsequent MLP predictswhether the region is a head region or not. After MLP, several positive regions are

    merged into a box by the nonmax suppression with 0.3 overlap threshold sorted by

    the reliability of the MLP, which indicated the accuracy of detection. The perfor-

    mance of the approach was tested on the AR face database and our face occlusion

    database. The accuracies using our face occlusion database, the AR face database

    and the LFW database were 56.83%, 73.79% and 88.52% with 0.5 IoU, respectively

    (Fig. 10). The performance can be further improved by bounding box regression, as

    expounded in the following.

    We use a simple bounding-box regression stage to improve the localization per-

    formance. After scoring each object proposal by MLP, we predict a new bounding

    box for detection using a class-speci¯c bounding-box regressor. This is similar in

    spirit to the approach used in deformable part models.13 The better location of head

    is regressed from features computed by the CNN (Fig. 11). After regressing the

    positive features from CNN, the performance is improved, with increases of 28.78%,

    (a) (b)

    (c)

    Fig. 10. The di®erent maximal number of region proposal when � is 0.85 and � is 0.95.

    Face Occlusion Detection Based on Deep CNN

    1660010-13

  • 23.79% and 11.48% on our face occlusion database, AR face database and LFW

    database respectively with IoU 0.5. The reason why the regressor improves the

    performance is that the head is centered in the images. However, the regressor only

    improves when the classi¯cation result from CNN is correct. Examples of the head

    detection from our face occlusion database, AR face database and LFW database are

    shown in Figs. 12–14, respectively.

    We created a head detector based on HoG feature extraction1 and SVM classi-

    ¯cation2 as a baseline approach. The parameters for the HOG was set as orientations

    in the range [0, 360], 40 orientations bins and 3 spatial levels, which means that the

    inputs for SVM have a dimension of 840, ð1þ 4þ 16Þ � 40. For the SVM, weemployed C-SVC and radial basis kernel function.2 The comparison performance is

    illustrated in Table 1, which demonstrates that our approach is robust and e®ective

    in complex scenes. The computational complexity is showed in Table 2. The major

    part of the computation of our approach is from the region proposal algorithm,

    EdgeBoxes.

    (a) (b)

    (c)

    Fig. 11. The recall of head detection, the red line is the performance of head detector without regression

    and the blue line indicate the performance of head detector with regression (color online).

    Y. Xia, B. Zhang & F. Coenen

    1660010-14

  • Fig. 12. Examples of head detection results from our face occlusion database, red rectangle is ground

    truth and the green rectangle is detection location (color online).

    Fig. 13. Examples of head detection results from the AR face database, the red rectangle is the ground

    truth and the green rectangle is the detection location (color online).

    Table 1. Accuracy of head detection when IoU is largerthan 0.5.

    LFW (%) AR (%) Our Dataset (%)

    HOGþ SVM 91.87 97.44 72.41Our method 100 97.58 85.61

    Face Occlusion Detection Based on Deep CNN

    1660010-15

  • 4.3. Face occlusion classi¯cation

    After obtaining the head position by the head detector, the face occlusion classi¯er

    was used to classify the type of face occlusion.

    The AR face dataset30 is a popular dataset for face recognition. It contains over

    4000 color images from 126 people's faces (70 males and 56 females). Images cover

    the frontal view faces with occlusions (sun glasses and scarf), di®erent facial

    expressions and illumination conditions (Fig. 1). For the face occlusion classi¯er, the

    dataset is categorized into two occlusion conditions: face with eyes occluded by

    sunglasses and face with mouth occluded by scarf. Half of the un-occluded faces were

    used to pre-train the CNN model, following a general face recognition methodology.

    More speci¯cally, the pre-train dataset consisted of 31 male and 25 female faces

    (frontal view face with di®erent expressions and illumination).

    After pre-training, the fully connected layers were replaced by a new MLP, ini-

    tialized from random connection values for the ¯ne-tuning, by applying the face

    occlusion dataset. The CNN with multi-tasks learning was designed to predict the

    presence or absence of di®erent facial parts, thus indicating occlusion or not. The

    structure of CNN is illustrated in Fig. 8.

    The loss function values during training is illustrated in Fig. 15. After the loss

    becomes stabilized, the converged model is applied to test a testing sample with

    accuracy as shown in Table 3.

    The accuracy values during training and validation are given in Fig. 16. Cross-

    validation and early stopping are employed to avoid over¯tting. Figure 16 illustrates

    that the accuracies during training and validation continually rise. Early stopping is

    used to select the appropriate trained model during training and avoid over¯tting.

    Fig. 14. Examples of head detection results from the LFW database, the red rectangle is the ground truth

    and the green rectangle is the detection location (color online).

    Table 2. Computational complexity on head detection(fps).

    LFW AR Our Created Dataset

    HOGþ SVM 32.48 24.06 419.68Our method 0.94 2.02 1.76

    Y. Xia, B. Zhang & F. Coenen

    1660010-16

  • The system performance for face occlusion classi¯cation was also evaluated using

    our face occlusion dataset and the LFW dataset. The experiment procedures are

    similar to the AR face dataset. The di®erence is that there is a variety of occlusions

    from di®erent facial parts, namely, left eye, right eye, nose and mouth. Accordingly,

    there are four MLPs following the shared layer to verify whether each facial part is

    occluded or not, as explained in Fig. 8.

    The training loss is shown in Fig. 15, and the occlusion accuracies on our dataset

    and LFW are 94.55% and 95.41%, respectively, with further details provided in

    Table 4. Our model is a multi-task framework that shares the front layers. The

    summed loss changes with the updating of CNN parameters. And it can be seen that

    (a) (b)

    (c)

    Fig. 15. Comparison of loss function values during training.

    Table 3. Face occlusion classi¯cation onAR dataset.

    Eyes Mouth Total

    Accuracy 98.58% 100% 98.58%

    Face Occlusion Detection Based on Deep CNN

    1660010-17

  • the loss °uctuates more obviously on our face dateset because it is much more

    complex compared with AR dataset and LFW dataset.

    The face detector combined with Haar feature27 extractor and Viola–Jones39

    classi¯er is used as the base line of face occlusion classi¯cation. The experiments

    result is summarized in Table 5, which indicates that our method outperforms Haar-

    based face detection. And the corresponding computational complexity is reported in

    Table 6 showing that our method is faster than the classical approach.

    Table 4. The accuracy on LFW and our face occlusion dataset.

    The Accuracy on Our Face Occlusion Dataset

    Left Eye (%) Right Eye (%) Nose (%) Mouth (%) Total (%)

    Our dataset 98.15 99.07 98.15 99.07 94.55

    LFW 97.79 98.91 99.63 99.02 95.41

    (a) (b)

    (c)

    Fig. 16. Comparison of accuracy values during training and validation.

    Y. Xia, B. Zhang & F. Coenen

    1660010-18

  • Table 5. Accuracy on face occlusion classi¯cation.

    LFW (%) AR (%) Our Dataset (%)

    Haarþ VJ 45.22 47.98 21.20Our method 95.41 98.58 94.55

    Table 6. Computational complexity on face occlusionclassi¯cation (ms).

    LFW AR Our Created Dataset

    Haarþ VJ 16.29 58.23 22.35Our method 13.10 20.67 17.31

    (a) (b)

    (c)

    Fig. 17. Examples of wrong region proposal.

    Face Occlusion Detection Based on Deep CNN

    1660010-19

  • 4.4. Error analysis

    There are two factors that may impact the e®ectiveness of the proposed approach.

    Firstly, the region proposal algorithm should be robust with regard to the generated

    candidate bounding boxes. However, as the EdgeBoxes produce the candidate

    regions by the edge, it has several potential problems: (a) negative data will be

    generated when a person wears clothing with complex texture (Fig. 17(a)); (b) a

    head will appear larger with certain hairstyles (Fig. 17(b)); (c) the head will not be

    (a) (b)

    (c)

    Fig. 18. Error prediction image sample.

    Y. Xia, B. Zhang & F. Coenen

    1660010-20

  • segmented when a person has long hair and wears a black dust coat (Fig. 17(c)).

    Secondly, there are three factors that may in°uence the performance of face occlusion

    classi¯er, including the illumination variations, occlusions and the partial occlusions.

    Figure 18 further explains the di±cult situations.

    5. Conclusion

    This paper proposed an approach for face occlusion detection to enhance the sur-

    veillance security for ATM. The coarse-to-¯ne approach consists of a head detector

    and a face occlusion classi¯er. The head detector is implemented with EdgeBoxes —

    region proposal, CNN and MLP. The method achieved detection accuracies of

    97.58%, 85.61% and 100% on the AR face database, our face occlusion database and

    LFW dataset. For face occlusion classi¯cation, CNN is applied with a pre-training

    strategy via usual face recognition task, followed by ¯ne-tuning with the face oc-

    clusion classi¯cation based on MTL to verify whether a facial part is occluded or not.

    Our approach is evaluated on the AR face dataset, our dataset and LFW dataset,

    achieving 98.58%, 94.55% and 95.41% accuracies, respectively. Further work is being

    made toward the improvement of the model by more robust and accurate region

    proposal, which will render it more realistic for real-world applications.

    Acknowledgment

    The ¯rst author thank Chao Yan and Rongqiang Qian for their valuable help. The

    constructive comments and suggestions from the anonymous reviewers are much

    appreciated.

    References

    1. A. Bosch, A. Zisserman and X. Munoz, Representing shape with a spatial pyramid kernel,6th ACM Int. Conf. Image and Video Retrieval, New York (2007), pp. 401–408.

    2. C.-C. Chang and C.-J. Lin, Libsvm: A library for support vector machines, ACM Trans.Intell. Syst. Technol. 2 (2011) 27:1–27:27.

    3. T. Charoenpong, Face occlusion detection by using ellipse ¯tting and skin color ratio,Burapha University International Conf. (BUU), (2013), pp. 1145–1151.

    4. T. Charoenpong, C. Nuthong and U. Watchareeruetai, A new method for occluded facedetection from single viewpoint of head, 2014 11th Int. Conf. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)(2014), pp. 1–5.

    5. J. Chen, S. Shan, S. Yang, X. Chen and W. Gao, Modi¯cation of the adaboost-baseddetector for partially occluded faces, 18th Int. Conf. Pattern Recognition 2 (2006)516–519.

    6. R. Cinbis, J. Verbeek and C. Schmid, Segmentation driven object detection with ¯shervectors, 2013 IEEE Int. Conf. Computer Vision (ICCV) (2013), pp. 2968–2975.

    7. A. El-Barkouky, A. Shalaby, A. Mahmoud and A. Farag, Selective part models fordetecting partially occluded faces in the wild, 2014 IEEE Int. Conf. Image Processing(ICIP) (2014), pp. 268–272.

    Face Occlusion Detection Based on Deep CNN

    1660010-21

  • 8. I. Endres and D. Hoiem, Category-independent object proposals with diverse ranking,IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 222–234.

    9. M. Everingham, S. Eslami, L. Van Gool, C. Williams, J. Winn and A. Zisserman, Thepascal visual object classes challenge: A retrospective, Int. J. Comput. Vis. 111 (2015)98–136.

    10. P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, Object detection withdiscriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32(2010) 1627–1645.

    11. R. Girshick, Fast r-cnn, The IEEE Int. Conf. Computer Vision (ICCV), (2015), pp. 1440–1448.

    12. R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurateobject detection and semantic segmentation, 2014 IEEE Conf. Computer Vision andPattern Recognition (CVPR) (2014), pp. 580–587.

    13. R. Girshick, J. Donahue, T. Darrell and J. Malik, Region-based convolutional networksfor accurate object detection and segmentation, IEEE Trans. Pattern Anal. Mach. Intell.38 (2016) 142–158.

    14. X. Glorot, A. Bordes and Y. Bengio, Deep sparse recti¯er neural networks, J. Mach.Learn. Res. 15 (2011) 315–323.

    15. S. Gul and H. Farooq, A machine learning approach to detect occluded faces in uncon-strained crowd scene, 2015 IEEE 14th Int. Conf. Cognitive Informatics Cognitive Com-puting (ICCI*CC) (2015), pp. 149–155.

    16. K. He, X. Zhang, S. Ren and J. Sun, Spatial pyramid pooling in deep convolutionalnetworks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) 1904–1916.

    17. S. Hongxing, W. Jiagi, S. Peng and Z. Xiaoyang, Facial area forecast and occluded facedetection based on the ycbcr elliptical model, Int. Conf. Mechatronic Sciences, ElectricEngineering and Computer (MEC) (2013), pp. 1199–1202.

    18. J. Hosang, R. Benenson, P. Doll�ar and B. Schiele, What makes for e®ective detectionproposals?, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2016) 814–830.

    19. G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller, Labeled faces in the wild: Adatabase for studying face recognition in unconstrained environments, Technical Report,University of Massachusetts, Amherst (2007), pp. 7–49.

    20. K. Ichikawa, T. Mita, O. Hori and T. Kobayashi, Component-based face detectionmethod for various types of occluded faces, 3rd Int. Symp. Communications, Control andSignal Processing (2008), pp. 538–543.

    21. G. Kim, J. K. Suhr, H. G. Jung and J. Kim, Face occlusion detection by using b-splineactive contour and skin color information, 11th Int. Conf. Control Automation RoboticsVision (ICARCV) (2010), pp. 627–632.

    22. J. Kim, Y. Sung, S. Yoon and B. Park, A new video surveillance system employingoccluded face detection, in Innovations in Applied Arti¯cial Intelligence (Springer BerlinHeidelberg, 2005), pp. 65–68.

    23. J. Krause, T. Gebru, J. Deng, L.-J. Li and L. Fei-Fei, Learning features and parts for ¯ne-grained recognition, 2014 22nd Int. Conf. Pattern Recognition (ICPR) (2014), pp. 26–33.

    24. A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classi¯cation with deep con-volutional neural networks, in Advances in Neural Information Processing Systems(Curran Associates, Inc, 2012), pp. 1097–1105.

    25. Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521 (2015) 436–444.26. S. Liao, A. K. Jain and S. Z. Li, Partial face recognition: Alignment-free approach, IEEE

    Trans. Pattern Anal. Mach. Intell. 35 (2013) 1193–1205.

    Y. Xia, B. Zhang & F. Coenen

    1660010-22

  • 27. R. Lienhart and J. Maydt, An extended set of haar-like features for rapid object detection,Int. Conf. Image Processing 1 (2002) 900–903.

    28. D.-T. Lin and M.-J. Liu, Face occlusion detection for automated teller machine surveil-lance, in Advances in Image and Video Technology, Lecture Notes in Computer Science,Vol. 4319 (Springer Berlin Heidelberg, 2006), pp. 641–651.

    29. Z. Liu, P. Luo, X. Wang and X. Tang, Deep learning face attributes in the wild, The IEEEInt. Conf. Computer Vision (ICCV) (2015), pp. 3730–3738.

    30. A. Martinez and R. Benavente, The ar face database, CVC Technical Report (1998).31. O. Russakovsky, J. Deng, H. Su et al., ImageNet large scale visual recognition challenge,

    Int. J. Comput. Vis. (IJCV) 115 (2015) 211–252.32. J. Shermina and V. Vasudevan, Recognition of the face images with occlusion and ex-

    pression, Int. J. Pattern Recogn. Artif. Intell. 26 (2012) 1256006-1–1256006-26.33. L. Sijin, L. Zhi-Qiang and A. Chan, Heterogeneous multi-task learning for human pose

    estimation with deep convolutional neural network, IEEE Conf. Computer Vision andPattern Recognition Workshops (CVPRW) (2014), pp. 488–495.

    34. K. Simonyan and A. Zisserman, Two-stream convolutional networks for actionrecognition in videos, in Advances in Neural Information Processing Systems (NeuralInformation Processing Systems, 2014), pp. 568–576.

    35. Y. Sun, X. Wang and X. Tang, Deep convolutional network cascade for facial pointdetection, IEEE Conf. Computer Vision and Pattern Recognition (CVPR) Oregon(2013), pp. 3476–3483.

    36. C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scalable, high-quality object detection,arXiv preprint (2014).

    37. Y. Taigman, M. Yang, M. Ranzato and L. Wolf, Deepface: Closing the gap to human-level performance in face veri¯cation, IEEE Conf. Computer Vision and Pattern Rec-ognition (CVPR) (2014), pp. 1701–1708.

    38. K. van de Sande, J. Uijlings, T. Gevers and A. Smeulders, Segmentation asselective search for object recognition, IEEE Int. Conf. Computer Vision (ICCV) (2011),pp. 1879–1886.

    39. P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features,Computer Visi. Pattern Recogn. 1 (2001) 511–518.

    40. R. Wagner, M. Thom, R. Schweiger, G. Palm and A. Rothermel, Learning convolutionalneural networks from few samples, The 2013 Int. Joint Conf. Neural Networks (IJCNN)(2013), pp. 1–7.

    41. X. Wang, M. Yang, S. Zhu and Y. Lin, Regionlets for generic object detection, IEEEInternational Conference on Comput. Vis. (ICCV) (2013), pp. 17–24.

    42. S. Yi, W. Xiaogang and T. Xiaoou, Deep learning face representation from predicting10,000 classes, IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2014),pp. 1891–1898.

    43. C. Zhang and Z. Zhang, Improving multiview face detection with multi-task deep con-volutional neural networks, IEEE Winter Conf. Applications of Computer Vision(WACV) (2014), pp. 1036–1041.

    44. N. Zhang, M. Paluri, M. Ranzato, T. Darrell and L. Bourdev, Panda: Pose alignednetworks for deep attribute modeling, IEEE Conf. Computer Vision and Pattern Rec-ognition (CVPR) (2014), pp. 1637–1644.

    45. Z. Zhang, P. Luo, C. Loy and X. Tang, Facial landmark detection by deep multi-tasklearning, European Conf. Computer Vision (ECCV) (2014), pp. 94–108.

    46. E. Zhou, H. Fan, Z. Cao, Y. Jiang and Q. Yin, Extensive facial landmark localizationwith coarse-to-¯ne convolutional network cascade, IEEE Int. Conf. Computer VisionWorkshops (ICCVW) (2013), pp. 386–391.

    Face Occlusion Detection Based on Deep CNN

    1660010-23

  • 47. X. Zhu, K. Li, A. Salah, L. Shi and K. Li, Parallel implementation of ma®t on cuda-enabled graphics hardware, IEEE/ACM Trans. Comput. Biol. Bioinformatics 12 (2015)205–218.

    48. C. Zitnick and P. Dollar, Edge boxes: Locating object proposals from edges, EuropeanConf. Computer Vision (ECCV) (2014), pp. 391–405.

    Yizhang Xia is a Ph.D.student of Computer Sci-ence at The University ofLiverpool. He received hisBachelor's degree inComputer Science andSoftware Engineeringfrom Xi'an Jiaotong-Liverpool University,Suzhou. His researchinterests include comput-

    er vision and deep learning with applications inintelligent transportation systems and bio-metrics.

    Bailing Zhang receivedhis Master's degree inCommunication andElectronic System fromthe South China Univer-sity of Technology,Guangzhou, China, andhis Ph.D. in Electrical andComputer Engineeringfrom the University ofNewcastle, NSW, Aus-

    tralia, in 1987 and 1999, respectively. He is cur-rently an Associate Professor in the Departmentof Computer Science and Software Engineeringat Xi'an Jiaotong-Liverpool University, Suzhou,China. He has been a Lecturer in the School ofComputer Science and Mathematics in the Vic-toria University, Australia since 2003. Before2001, he was a research sta® in the Kent RidgeDigital Labs (KRDL), Singapore. His researchinterests include computer vision and deeplearning with applications in intelligent trans-portation systems and bioinformatics.

    Frans Coenen is a Pro-fessor of Computer Sci-ence at The Unversity ofLiverpool. He has a gen-eral background in AI,and has been working inthe ¯eld of data miningand Knowledge Discoveryin Data (KDD) for thelast 15 years. He is par-ticularly interested in big

    data; social network and trend mining; themining of nonstandard datasets such as graph,image and document collections; and the prac-tical application of data mining in its manyforms.

    Y. Xia, B. Zhang & F. Coenen

    1660010-24

    Face Occlusion Detection Using Deep Convolutional Neural Networks1. Introduction2. Face Occlusion Dataset3. System Overview3.1. Head detection3.2. Face occlusion classification3.3. Bounding box regression3.4. Pre-train

    4. Experiment Results4.1. Implementation details4.2. Head detector4.3. Face occlusion classification4.4. Error analysis

    5. ConclusionAcknowledgmentReferences