real-time eye blink detection using facial...

21st Computer Vision Winter WorkshopLuka Cehovin, Rok Mandeljc, Vitomir Struc (eds.)Rimske Toplice, Slovenia, February 3–5, 2016

Real-Time Eye Blink Detection using Facial Landmarks

Tereza Soukupova and Jan CechCenter for Machine Perception, Department of Cybernetics

Faculty of Electrical Engineering, Czech Technical University in Prague{soukuter,cechj}@cmp.felk.cvut.cz

Abstract. A real-time algorithm to detect eye blinksin a video sequence from a standard camera is pro-posed. Recent landmark detectors, trained on in-the-wild datasets exhibit excellent robustness against ahead orientation with respect to a camera, varyingillumination and facial expressions. We show thatthe landmarks are detected precisely enough to reli-ably estimate the level of the eye opening. The pro-posed algorithm therefore estimates the landmarkpositions, extracts a single scalar quantity – eye as-pect ratio (EAR) – characterizing the eye opening ineach frame. Finally, an SVM classifier detects eyeblinks as a pattern of EAR values in a short tempo-ral window. The simple algorithm outperforms thestate-of-the-art results on two standard datasets.

1. Introduction

Detecting eye blinks is important for instance insystems that monitor a human operator vigilance,e.g. driver drowsiness [5, 13], in systems that warna computer user staring at the screen without blink-ing for a long time to prevent the dry eye and thecomputer vision syndromes [17, 7, 8], in human-computer interfaces that ease communication for dis-abled people [15], or for anti-spoofing protection inface recognition systems [11].

Existing methods are either active or passive. Ac-tive methods are reliable but use special hardware,often expensive and intrusive, e.g. infrared camerasand illuminators [2], wearable devices, glasses witha special close-up cameras observing the eyes [10].While the passive systems rely on a standard remotecamera only.

Many methods have been proposed to automati-cally detect eye blinks in a video sequence. Severalmethods are based on a motion estimation in the eyeregion. Typically, the face and eyes are detected by

Figure 1: Open and closed eyes with landmarks piautomatically detected by [1]. The eye aspect ratioEAR in Eq. (1) plotted for several frames of a videosequence. A single blink is present.

a Viola-Jones type detector. Next, motion in the eyearea is estimated from optical flow, by sparse track-ing [7, 8], or by frame-to-frame intensity differenc-ing and adaptive thresholding. Finally, a decision ismade whether the eyes are or are not covered by eye-lids [9, 15]. A different approach is to infer the stateof the eye opening from a single image, as e.g. bycorrelation matching with open and closed eye tem-plates [4], a heuristic horizontal or vertical image in-tensity projection over the eye region [5, 6], a para-metric model fitting to find the eyelids [18], or activeshape models [14].

A major drawback of the previous approaches isthat they usually implicitly impose too strong re-quirements on the setup, in the sense of a relativeface-camera pose (head orientation), image resolu-tion, illumination, motion dynamics, etc. Especiallythe heuristic methods that use raw image intensityare likely to be very sensitive despite their real-timeperformance.

However nowadays, robust real-time facial land-mark detectors that capture most of the character-istic points on a human face image, including eyecorners and eyelids, are available, see Fig. 1. Mostof the state-of-the-art landmark detectors formulatea regression problem, where a mapping from an im-age into landmark positions [16] or into other land-mark parametrization [1] is learned. These mod-ern landmark detectors are trained on “in-the-wilddatasets” and they are thus robust to varying illu-mination, various facial expressions, and moderatenon-frontal head rotations. An average error of thelandmark localization of a state-of-the-art detectoris usually below five percent of the inter-ocular dis-tance. Recent methods run even significantly superreal-time [12].

Therefore, we propose a simple but efficient al-gorithm to detect eye blinks by using a recent faciallandmark detector. A single scalar quantity that re-flects a level of the eye opening is derived from thelandmarks. Finally, having a per-frame sequence ofthe eye opening estimates, the eye blinks are foundby an SVM classifier that is trained on examples ofblinking and non-blinking patterns.

Facial segmentation model presented in [14] issimilar to the proposed method. However, their sys-tem is based on active shape models with reportedprocessing time of about 5 seconds per frame for thesegmentation, and the eye opening signal is normal-ized by statistics estimated by observing a longer se-quence. The system is thus usable for offline pro-cessing only. The proposed algorithm runs real-time,since the extra costs of the eye opening from land-marks and the linear SVM are negligible.

The contributions of the paper are:

1. Ability of two state-of-the-art landmark de-tectors [1, 16] to reliably distinguish betweenthe open and closed eye states is quantita-tively demonstrated on a challenging in-the-wild dataset and for various face image resolu-tions.

2. A novel real-time eye blink detection algorithmwhich integrates a landmark detector and a clas-sifier is proposed. The evaluation is done on twostandard datasets [11, 8] achieving state-of-the-art results.

The rest of the paper is structured as follows: Thealgorithm is detailed in Sec. 2, experimental valida-

Eye aspect ratio:

0.4

0.2

0

EAR thresholding (t = 0.2):

EAR SVM output:

Ground-truth:

blink

blink

blink

non-blink

non-blink

non-blink

half

Figure 2: Example of detected blinks. The plots ofthe eye aspect ratio EAR in Eq. (1), results of theEAR thresholding (threshold set to 0.2), the blinksdetected by EAR SVM and the ground-truth labelsover the video sequence. Input image with detectedlandmarks (depicted frame is marked by a red line).

tion and evaluation is presented in Sec. 3. Finally,Sec. 4 concludes the paper.

2. Proposed method

The eye blink is a fast closing and reopening ofa human eye. Each individual has a little bit differentpattern of blinks. The pattern differs in the speed ofclosing and opening, a degree of squeezing the eyeand in a blink duration. The eye blink lasts approxi-

mately 100-400 ms.We propose to exploit state-of-the-art facial land-

mark detectors to localize the eyes and eyelid con-tours. From the landmarks detected in the image,we derive the eye aspect ratio (EAR) that is used asan estimate of the eye opening state. Since the per-frame EAR may not necessarily recognize the eyeblinks correctly, a classifier that takes a larger tem-poral window of a frame into account is trained.

2.1. Description of features

For every video frame, the eye landmarks are de-tected. The eye aspect ratio (EAR) between heightand width of the eye is computed.

EAR =‖p2 − p6‖+ ‖p3 − p5‖

2‖p1 − p4‖, (1)

where p1, . . . , p6 are the 2D landmark locations, de-picted in Fig. 1.

The EAR is mostly constant when an eye is openand is getting close to zero while closing an eye. Itis partially person and head pose insensitive. Aspectratio of the open eye has a small variance among indi-viduals and it is fully invariant to a uniform scaling ofthe image and in-plane rotation of the face. Since eyeblinking is performed by both eyes synchronously,the EAR of both eyes is averaged. An example ofan EAR signal over the video sequence is shown inFig. 1, 2, 7.

A similar feature to measure the eye opening wassuggested in [9], but it was derived from the eye seg-mentation in a binary image.

2.2. Classification

It generally does not hold that low value of theEAR means that a person is blinking. A low valueof the EAR may occur when a subject closes his/hereyes intentionally for a longer time or performs a fa-cial expression, yawning, etc., or the EAR captures ashort random fluctuation of the landmarks.

Therefore, we propose a classifier that takes alarger temporal window of a frame as an input. Forthe 30fps videos, we experimentally found that ±6frames can have a significant impact on a blink detec-tion for a frame where an eye is the most closed whenblinking. Thus, for each frame, a 13-dimensionalfeature is gathered by concatenating the EARs of its±6 neighboring frames.

This is implemented by a linear SVM classifier(called EAR SVM) trained from manually anno-tated sequences. Positive examples are collected as

ground-truth blinks, while the negatives are thosethat are sampled from parts of the videos where noblink occurs, with 5 frames spacing and 7 framesmargin from the ground-truth blinks. While testing, aclassifier is executed in a scanning-window fashion.A 13-dimensional feature is computed and classifiedby EAR SVM for each frame except the beginningand ending of a video sequence.

3. Experiments

Two types of experiments were carried out: Theexperiments that measure accuracy of the landmarkdetectors, see Sec. 3.1, and the experiments that eval-uate performance of the whole eye blink detectionalgorithm, see Sec 3.2.

3.1. Accuracy of landmark detectors

To evaluate accuracy of tested landmark detectors,we used the 300-VW dataset [19]. It is a dataset con-taining 50 videos where each frame has associated aprecise annotation of facial landmarks. The videosare “in-the-wild”, mostly recorded from a TV.

The purpose of the following tests is to demon-strate that recent landmark detectors are particularlyrobust and precise in detecting eyes, i.e. the eye-corners and contour of the eyelids. Therefore we pre-pared a dataset, a subset of the 300-VW, containingsample images with both open and closed eyes. Moreprecisely, having the ground-truth landmark annota-tion, we sorted the frames for each subject by the eyeaspect ratio (EAR in Eq. (1)) and took 10 frames ofthe highest ratio (eyes wide open), 10 frames of thelowest ratio (mostly eyes tightly shut) and 10 framessampled randomly. Thus we collected 1500 images.Moreover, all the images were later subsampled (suc-cessively 10 times by factor 0.75) in order to evaluateaccuracy of tested detectors on small face images.

Two state-of-the-art landmark detectors weretested: Chehra [1] and Intraface [16]. Both run inreal-time1. Samples from the dataset are shown inFig. 3. Notice that faces are not always frontal to thecamera, the expression is not always neutral, peo-ple are often emotionally speaking or smiling, etc.Sometimes people wear glasses, hair may occasion-ally partially occlude one of the eyes. Both detectorsperform generally well, but the Intraface is more ro-bust to very small face images, sometimes at impres-sive extent as shown in Fig. 3.

1Intraface runs in 50 Hz on a standard laptop.

Chehra

Intraface

Chehra

Intraface

Chehra

Intraface

Chehra

Intraface

Figure 3: Example images from the 300-VW datasetwith landmarks obtained by Chehra [1] and In-traface [16]. Original images (left) with inter-oculardistance (IOD) equal to 63 (top) and 53 (bottom) pix-els. Images subsampled (right) to IOD equal to 6.3(top) and 17 (bottom).

Quantitatively, the accuracy of the landmark de-tection for a face image is measured by the averagerelative landmark localization error, defined as usu-ally

ε =100

κN

N∑i=1

||xi − xi||2, (2)

where xi is the ground-truth location of landmark iin the image, xi is an estimated landmark location bya detector, N is a number of landmarks and normal-ization factor κ is the inter-ocular distance (IOD), i.e.Euclidean distance between eye centers in the image.

First, a standard cumulative histogram of the aver-age relative landmark localization error ε was calcu-lated, see Fig. 4, for a complete set of 49 landmarksand also for a subset of 12 landmarks of the eyes only,since these landmarks are used in the proposed eyeblink detector. The results are calculated for all theoriginal images that have average IOD around 80 px,and also for all “small” face images (including sub-sampled ones) having IOD ≤ 50 px. For all land-marks, Chehra has more occurrences of very smallerrors (up to 5 percent of the IOD), but Intraface ismore robust having more occurrences of errors be-low 10 percent of the IOD. For eye landmarks only,

0 5 10 15 20 250

20

40

60

80

100

localization error [% of IOD]

occura

nce [%

]

All landmarks

Chehra

Intraface

Chehra−small

Intraface−small

0 5 10 15 20 250

20

40

60

80

100

localization error [% of IOD]

occura

nce [%

]

Eye landmarks

ChehraIntrafaceChehra−smallIntraface−small

Figure 4: Cumulative histogram of average localiza-tion error of all 49 landmarks (top) and 12 landmarksof the eyes (bottom). The histograms are computedfor original resolution images (solid lines) and a sub-set of small images (IOD ≤ 50 px).

the Intraface is always more precise than Chehra. Asalready mentioned, the Intraface is much more robustto small images than Chehra. This behaviour is fur-ther observed in the following experiment.

Taking a set of all 15k images, we measured amean localization error µ as a function of a face im-age resolution determined by the IOD. More pre-cisely, µ = 1

|S|∑

j∈S εj , i.e. average error over set offace images S having the IOD in a given range. Re-sults are shown in Fig. 5. Plots have errorbars of stan-dard deviation. It is seen that Chehra fails quicklyfor images with IOD < 20 px. For larger faces, themean error is comparable, although slightly better forIntraface for the eye landmarks.

The last test is directly related to the eye blink de-tector. We measured accuracy of EAR as a func-tion of the IOD. Mean EAR error is defined as amean absolute difference between the true and theestimated EAR. The plots are computed for two sub-sets: closed/closing (average true ratio 0.05 ± 0.05)

0 20 40 60 80 1000

10

20

30

40

50

IOD [px]

mean e

rror

[% o

f IO

D]

All landmarks

ChehraIntraface

0 20 40 60 80 1000

10

20

30

40

50

IOD [px]

mean e

rror

[% o

f IO

D]

Eye landmarks

ChehraIntraface

Figure 5: Landmark localization accuracy as a func-tion of the face image resolution computed for alllandmarks and eye landmarks only.

and open eyes (average true ratio 0.4 ± 0.1). Theerror is higher for closed eyes. The reason is prob-ably that both detectors are more likely to outputopen eyes in case of a failure. It is seen that ratioerror for IOD < 20 px causes a major confusionbetween open/close eye states for Chehra, neverthe-less for larger faces the ratio is estimated preciselyenough to ensure a reliable eye blink detection.

3.2. Eye blink detector evaluation

We evaluate on two standard databases withground-truth annotations of blinks. The first one isZJU [11] consisting of 80 short videos of 20 sub-jects. Each subject has 4 videos: 2 with and 2 withoutglasses, 3 videos are frontal and 1 is an upward view.The 30fps videos are of size 320 × 240 px. An av-erage video length is 136 frames and contains about3.6 blinks in average. An average IOD is 57.4 pixels.In this database, subjects do not perform any notice-able facial expressions. They look straight into thecamera at close distance, almost do not move, do not

0 20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

IOD [px]

me

an

eye

op

en

ing

err

or

Low opening ratio (ρ < 0.15)

ChehraIntraface

0 20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

IOD [px]

me

an

eye

op

en

ing

err

or

High opening ratio (ρ > 0.25)

ChehraIntraface

Figure 6: Accuracy of the eye-opening ratio as afunction of the face image resolution. Top: forimages with small true ratio (mostly closing/closedeyes), and bottom: images with higher ratio (openeyes).

either smile nor speak. A ground-truth blink is de-fined by its beginning frame, peak frame and endingframe. The second database Eyeblink8 [8] is morechallenging. It consists of 8 long videos of 4 sub-jects that are smiling, rotating head naturally, cover-ing face with hands, yawning, drinking and lookingdown probably on a keyboard. These videos havelength from 5k to 11k frames, also 30fps, with a res-olution 640 × 480 pixels and an average IOD 62.9pixels. They contain about 50 blinks on average pervideo. Each frame belonging to a blink is annotatedby half-open or close state of the eyes. We considerhalf blinks, which do not achieve the close state, asfull blinks to be consistent with the ZJU.

Besides testing the proposed EAR SVM methods,that are trained to detect the specific blink pattern,we compare with a simple baseline method, whichonly thresholds the EAR in Eq. (1) values. The EARSVM classifiers are tested with both landmark detec-tors Chehra [1] and Intraface [16].

Eye aspect ratio:

EAR thresholding (t = 0.2)

EAR SVM output:

Ground-truth:blink

non-blink

half

non-blink

blink

non-blink

blink

0

0.2

0.4

Figure 7: Example of detected blinks where theEAR thresholding fails while EAR SVM succeeds.The plots of the eye aspect ratio EAR in Eq. (1), re-sults of the EAR thresholding (threshold set to 0.2),the blinks detected by EAR SVM and the ground-truth labels over the video sequence. Input imagewith detected landmarks (depicted frame is markedby a red line).

The experiment with EAR SVM is done in a cross-dataset fashion. It means that the SVM classifier istrained on the Eyeblink8 and tested on the ZJU andvice versa.

To evaluate detector accuracy, predicted blinks arecompared with the ground-truth blinks. The numberof true positives is determined as a number of theground-truth blinks which have a non-empty inter-

section with detected blinks. The number of falsenegatives is counted as a number of the ground-truthblinks which do not intersect detected blinks. Thenumber of false positives is equal to the number ofdetected blinks minus the number of true positivesplus a penalty for detecting too long blinks. Thepenalty is counted only for detecting blinks twicelonger then an average blink of length A. Every longblink of length L is counted L

A times as a false posi-tive. The number of all possibly detectable blinks iscomputed as number of frames of a video sequencedivided by subject average blink length followingDrutarovsky and Fogelton [8].

The ZJU database appears relatively easy. Itmostly holds that every eye closing is an eye blink.Consequently, the precision-recall curves shown inFig. 8a of the EAR thresholding and both EAR SVMclassifiers are almost identical. These curves werecalculated by spanning a threshold of the EAR andSVM output score respectively. All our methodsoutperform other detectors [9, 8, 5]. The publishedmethods presented the precision and the recall for asingle operation point only, not the precision-recallcurve. See Fig. 8a for comparison.

The precision-recall curves in Fig. 8b shows eval-uation on the Eyeblink8 database. We observe that inthis challenging database the EAR thresholding lagsbehind both EAR SVM classifiers. The thresholdingfails when a subject smiles (has narrowed eyes - seean example in Fig. 7), has a side view or when thesubject closes his/her eyes for a time longer than ablink duration. Both SVM detectors performs muchbetter, the Intraface detector based SVM is even alittle better than the Chehra SVM. Both EAR SVMdetectors outperform the method by Drutarovsky andFogelton [8] by a significant margin.

Finally, we measured a dependence of the wholeblink detector accuracy on the average IOD over thedataset. Every frame of the ZJU database was sub-sampled to 90%, 80%, ..., 10% of its original reso-lution. Both Chehra-SVM and Intraface-SVM wereused for evaluation. For each resolution, the area un-der the precision-recall curve (AUC) was computed.The result is shown in Fig. 9. We can see that withChehra landmarks the accuracy remains very highuntil average IOD is about 30 px. The detector failson images with the IOD < 20 px. Intraface land-marks are much better in low resolutions. This con-firms our previous study on the accuracy of land-marks in Sec. 3.1.

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

BAC

Recall [%]

Pre

cisi

on [%

]

EAR ThresholdingChehra SVMIntraface SVM

(a) ZJU

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

A

Recall [%]

Pre

cisi

on [%

]

EAR ThresholdingChehra SVMIntraface SVM

(b) Eyeblink8

Figure 8: Precision-recall curves of the EAR thresh-olding and EAR SVM classifiers measured on (a) theZJU and (b) the Eyeblink8 databases. Published re-sults of methods A - Drutarovsky and Fogelton [8], B- Lee et al. [9], C - Danisman et al. [5] are depicted.

4. Conclusion

A real-time eye blink detection algorithm waspresented. We quantitatively demonstrated thatregression-based facial landmark detectors are pre-cise enough to reliably estimate a level of eye open-ness. While they are robust to low image quality (lowimage resolution in a large extent) and in-the-wild

57.38 51.6 45.9 40.2 34.4 28.7 23.0 17.2 11.5 5.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IOD [px]

AU

C

Chehra SVMIntraface SVM

Figure 9: Accuracy of the eye blink detector (mea-sured by AUC) as a function of the image resolution(average IOD) when subsampling the ZJU dataset.

phenomena as non-frontality, bad illumination, facialexpressions, etc.

State-of-the-art on two standard datasets wasachieved using the robust landmark detector fol-lowed by a simple eye blink detection based on theSVM. The algorithm runs in real-time, since the ad-ditional computational costs for the eye blink detec-tion are negligible besides the real-time landmark de-tectors.

The proposed SVM method that uses a temporalwindow of the eye aspect ratio (EAR), outperformsthe EAR thresholding. On the other hand, the thresh-olding is usable as a single image classifier to detectthe eye state, in case that a longer sequence is notavailable.

We see a limitation that a fixed blink duration forall subjects was assumed, although everyone’s blinklasts differently. The results could be improved by anadaptive approach. Another limitation is in the eyeopening estimate. While EAR is estimated from a 2Dimage, it is fairly insensitive to a head orientation, butmay lose discriminability for out of plane rotations.A solution might be to define the EAR in 3D. Thereare landmark detectors that estimate a 3D pose (po-sition and orientation) of a 3D model of landmarks,e.g. [1, 3].

Acknowledgment

The research was supported by CTU student grantSGS15/155/OHK3/2T/13.

References[1] A. Asthana, S. Zafeoriou, S. Cheng, and M. Pantic.

Incremental face alignment in the wild. In Confer-ence on Computer Vision and Pattern Recognition,2014. 1, 2, 3, 4, 5, 7

[2] L. M. Bergasa, J. Nuevo, M. A. Sotelo, andM. Vazquez. Real-time system for monitoring drivervigilance. In IEEE Intelligent Vehicles Symposium,2004. 1

[3] J. Cech, V. Franc, and J. Matas. A 3D approach tofacial landmarks: Detection, refinement, and track-ing. In Proc. International Conference on PatternRecognition, 2014. 7

[4] M. Chau and M. Betke. Real time eye tracking andblink detection with USB cameras. Technical Report2005-12, Boston University Computer Science, May2005. 1

[5] T. Danisman, I. Bilasco, C. Djeraba, and N. Ihad-dadene. Drowsy driver detection system using eyeblink patterns. In Machine and Web Intelligence(ICMWI), Oct 2010. 1, 6, 7

[6] H. Dinh, E. Jovanov, and R. Adhami. Eye blinkdetection using intensity vertical projection. In In-ternational Multi-Conference on Engineering andTechnological Innovation, IMETI 2012. 1

[7] M. Divjak and H. Bischof. Eye blink based fa-tigue detection for prevention of computer visionsyndrome. In IAPR Conference on Machine VisionApplications, 2009. 1

[8] T. Drutarovsky and A. Fogelton. Eye blink detec-tion using variance of motion vectors. In ComputerVision - ECCV Workshops. 2014. 1, 2, 5, 6, 7

[9] W. H. Lee, E. C. Lee, and K. E. Park. Blink detec-tion robust to various facial poses. Journal of Neu-roscience Methods, Nov. 2010. 1, 3, 6, 7

[10] Medicton group. The system I4Control. http://www.i4tracking.cz/. 1

[11] G. Pan, L. Sun, Z. Wu, and S. Lao. Eyeblink-basedanti-spoofing in face recognition from a generic we-bcamera. In ICCV, 2007. 1, 2, 5

[12] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignmentat 3000 fps via regressing local binary features. InProc. CVPR, 2014. 2

[13] A. Sahayadhas, K. Sundaraj, and M. Murugappan.Detecting driver drowsiness based on sensors: A re-view. MDPI open access: sensors, 2012. 1

[14] F. M. Sukno, S.-K. Pavani, C. Butakoff, and A. F.Frangi. Automatic assessment of eye blinking pat-terns through statistical shape models. In ICVS,2009. 1, 2

[15] D. Torricelli, M. Goffredo, S. Conforto, andM. Schmid. An adaptive blink detector to initial-ize and update a view-basedremote eye gaze track-ing system in a natural scenario. Pattern Recogn.Lett., 30(12):1144–1150, Sept. 2009. 1

[16] X. Xiong and F. De la Torre. Supervised descentmethods and its applications to face alignment. InProc. CVPR, 2013. 2, 3, 4, 5

[17] Z. Yan, L. Hu, H. Chen, and F. Lu. Computer visionsyndrome: A widely spreading but largely unknownepidemic among computer users. Computers in Hu-man Behaviour, (24):2026–2042, 2008. 1

[18] F. Yang, X. Yu, J. Huang, P. Yang, and D. Metaxas.Robust eyelid tracking for fatigue detection. In ICIP,2012. 1

[19] S. Zafeiriou, G. Tzimiropoulos, and M. Pantic. The300 videos in the wild (300-VW) facial landmarktracking in-the-wild challenge. In ICCV Work-shop, 2015. http://ibug.doc.ic.ac.uk/resources/300-VW/. 3

http://www.i4tracking.cz/

http://www.i4tracking.cz/

http://ibug.doc.ic.ac.uk/resources/300-VW/

http://ibug.doc.ic.ac.uk/resources/300-VW/

real-time eye blink detection using facial...

Documents