recognition of facial expressions in the presence of occlusion
TRANSCRIPT
213
Recognitionof Facial Expressionsin thePresenceof Occlusion
FabriceBourelClaudeC. Chibelushi
AdrianA. LowSchoolof Computing
StaffordshireUniversityBeaconside
Stafford [email protected]
[email protected]@staffs.ac.uk
Abstract
We presenta new approachfor the recognitionof facial expressionsfromvideo sequencesin the presenceof occlusion. Although promisingresultshavebeenreportedin theliteratureonautomaticrecognitionof facialexpres-sions,most techniqueshave beenassessedusingexperimentsperformedincontrolledlaboratoryconditionswhich do not reflectreal-world conditions.Thegoalof thework presentedherein,is to developrecognitiontechniquesthatwill overcomesomelimitationsof currenttechniques,suchastheir sen-sitivity to partialocclusionof the face. Theproposedapproachis basedona localisedrepresentationof facial features,andon datafusion. Theexper-imentsshow that theproposedapproachis robust to partialocclusionof theface.
1 Intr oduction
The automaticrecognitionof facial expressionshasbeenstudiedwith muchinterestinthe past10 years[9, 22, 24]. An importantpotentialapplicationof automaticfacial ex-pressionrecognitionis humancomputerinteraction[19, 30]. It mayalsobeusedasatoolin humanbehavioral research[22]. This paperpresentsa new approachfor recognisingfacial expressions,from imagesequences,in the presenceof partial occlusion.The ap-proachis basedon a localisedrepresentationof facialexpressionfeatures,andon fusionof classifieroutputs. Data fusion or sensorfusion hasbeenusedsuccessfullyin manyfields suchaspatternrecognition[6, 7, 25] or distributedcomputing[4, 5]. The abilityto handleoccludedfacial featuresis importantfor achieving robustrecognition.Despiteits importance,theeffectof occlusionon automaticfacialexpressionrecognitionhasnotreceived muchattentionfrom the researchcommunity. As indicatedby Pantic [22], nofacial expressionrecognitiontechniquescancurrentlyhandlepartial occlusion. To the
BMVC 2001 doi:10.5244/C.15.23
214
knowledgeof the authors,the effect of the latter on recognitionaccuracy hasnot beenassessedeither. Causesfor occlusionmaybefacialhair, glasses,handsor otherobjects.
By thenatureof their structures,facialmodelsusinglocal facial informationextrac-tion have an advantageover modelsthat considerthe whole face. For instance,whenapart of the faceis occluded,performingoptical flow on the whole face[12] may causesomeproblemsin representingtheextracteddatafor laterclassification.Approachesus-ing adeformablemodelfittedto thewholeface[16, 15] maybeadaptedto handlemissingfeatures,if thestructureof themodelandtheclassifierbothhandlepartialdatain agenericway. Facemodelsbasedonfeaturepointsanda localgeometricrepresentationof theface[27, 28] canbe adaptedeasily for achieving robustnessto partial occlusion. The latterwill causethe lossof somemodelparametersbut the parametersnot affectedby occlu-sionmaybestill usedfor classification.Theclassifieralsohasto beableto handlepartialinputdataandmayneedto beretrained.
Although sensitivity to occlusionis still an unresolved problem,many techniquesfor facial expressionrecognitionhave beenreported.Black andYacoob[2] usea localparameterizedmodelof imagemotion from optical flow analysis.They utilize a planarmodel for rigid facial motion andan affine-plus-curvaturemodel for non rigid motion.OtsukaandOhya[21] estimatethemotionof theright eyeandthemouthby opticalflow.Then,they performa2D wavelettransformto extractlow-frequency featurevectors.Essaand Pentland[12] first locatethe nose,eyes and mouth. Then, from two consecutivenormalisedframes,a2D spatio-temporalmotionenergy representationof facialmotionisusedasadynamicfacemodel.KimuraandYachida[15] fit whatthey call a‘potentialnet’on theface.Then,applyinga Gaussianfilter to anedgeenhancedimage,they determinea ‘potential field’. The net is thendeformedby the elasticforce of the potentialfield.Informationencodedin the nodesof the net is usedfor classification.Wanget al. [28]use19facialfeaturepoints.Thefaceis representedasa labeledgraphmodelwherefacialfeaturepoints are treatedas interconnectednodes. Graphsare comparedto templatesfor recognitionusinga minimum-distanceclassificationtechnique.Cohnet al. [8] usefeaturepointsthatareautomaticallytrackedusingahierarchicalopticalflow method[18].Thefeaturevectors,usedfor therecognition,arecreatedby calculatingthedisplacementof the facialpoints. Thedisplacementis obtainedby subtractingits normalisedpositionin thefirst framefrom its currentnormalisedposition.Following Cohn’s approach,Lienet al. [17] performthreemethods:denseoptical flow, featurepoint trackingandhighgradientedgeextraction.Then,thevectorsfrom eachextractionstepsarevectorquantizedandusedfor classification.Tian et al. [27] recentlyproposeda feature-basedmethodusing geometricfacial features. The extractedfacial regions (mouth, eyes, brows andcheeks)arerepresentedwith geometricparameters.Thefurrows arealsodetectedusinga Canny edgedetectorto measureorientationandquantifytheir intensity. Thegeometricparametersarethenfed into aneuralnetwork for recognition.
Following thefacial featureextractionis theclassificationstep.Six basicexpressionclasses(anger, disgust,fear, joy, sadness,surprise),definedby Ekman[10], areoftenused.ExpressionanalyserscanclassifytheencounteredexpressionusingtheFACSsystem[11].The FACS systemdefines46 actionunits (AUs) for describingfacial expressions.AUsareanatomicallyrelatedto contractionof specificmuscles.AUscanbeclassifiedeitherasa singleor combinationof AUs [2, 23, 8, 17, 27]. Most approachesfor theclassificationof expressionsfrom imagesequencesarebasedon neuralnetworks [23, 27], rule-basedmodels[2, 29], templatebased-methods[28, 20, 12, 15] usingminimum-distanceclassifi-
215
cation,statisticalmodelsusingdiscriminantfunctionanalysis[8], or probabilisticmodelssuchashiddenMarkov Models[17, 21].
In thecontext of robustrecognitionof occludedexpressions,usinga modularclassi-ficationandfusionapproachmayhave severaladvantagesover monolithicclassification[25]. In thedatafusionapproach,failureof oneor severalclassifiersdoesnot necessarilydegradetheclassificationprocess.New classificationmodulescanalsobeappendedeas-ily whenotherregionsof thefaceareto beexamined.Monolithic classifiersaresensitiveto incompleteinputdata,aswould occurin caseof partialocclusionof theface.
Thispaperpresentsandassessesa datafusionapproachaimingto achieverobustnessto partial occlusion. The paperis organisedasfollow: Section2 givesan overview ofthesystem.Section3 andSection4 describerespectively thespatio-temporalmodelusedto representtheexpressions,andtheclassificationanddatafusionframework. Section5performsexperimentson the effect of occlusionanddiscussesthe results.We concludein Section6 by summarisingthefindingsandsuggestingfutureresearchdirections.
2 Overview of the approach
Feature pointtracking data
Motion informationextraction andnormalisation
Classificationof local featurevector 1
Classificationof local featurevector 6
Localclassifiers
Fusion/combiner
Class label
Image sequence
Figure1: Architectureof therecognitionsystem.
The framework for automaticrecognitionof facial expressions,shown in Figure1,consistsof a featurepoint tracker, a featureextractor, a groupof local classifiers,andafusionmodule.Thefeatureextractorprocesses12 facialfeaturepointswhich aretrackedover anentireimagesequenceby the robust tracker describedin [3] (Figure2). The12facialfeaturepointsaresetmanuallyon thefirst frameof theimagesequence.
As describedin [3], the tracker, basedon the Kanade-Lucastracker [18], is capableof following and recovering any of the 12 facial points lost due to lighting variations,rigid or non-rigidmotion,or (to a certainextent)changeof headorientation.Automatic
216
recovery, which usesthe nostrilsasa reference,is performedbasedon someheuristicsexploiting the configurationandvisual propertiesof faces.It wasshown in [3] that theenhancedtracker outperformsthe Kanade-Lucastracker, in termsof recovering lost ordrifting points.
Whenthetrackingis over, independentlocalspatio-temporalvectors,arecreatedfromgeometricalrelationsbetweenthe featurepoints (Figure3). Thesevectorscontainthemotioninformationrelativeto thestartof eachsequence.Dynamicinformationis usedinorderto achieve shapeindependence.Shapeindependenceis soughtsothat theeffect ofpersonidentity on expressionrecognitionis minimised. In particular, it hasbeenshownin speaker recognitionthatlip shapeconveysidentity information[6].
Classificationis appliedto local facialregions(mouthandeyebrows)insteadof globalfacialinformation,andthelocal regionsareclassifiedindependently. Thisway, occlusionof a local region will not affect the classificationof the other regions. Decision-levelfusion [25] of classifieroutputsis used. For the purposeof this paper, 4 classesof ex-pressionsareusedout of the6 definedin [10]; they arethefacialexpressionsfor ‘anger’,‘joy’, ‘sadness’and‘surprise’.
Tracking
Automatic pointrecovery
Next frame
loss
no loss
Trackerupdate
Figure2: Automaticfeaturepoint trackingscheme.
3 Spatio-temporal representationof expressions
Facial point tracking is followed by a determinationof the start and endpointsof thedynamicpart of the sequencefor eachlocal region. The neutralend-segmentsarethendiscarded.Thereafter, thedurationof thesequenceis normalisedto approximately2.66secondsby linear interpolation. At 30 framesper second,this correspondsto 80 videoframes. The normalisationis appliedindependentlyto eachlocal region. As a resultof durationnormalisation,two expressionsperformedin a differentamountof time willbecomesimilar on thenormalisedtemporalscale.Figure4 illustratestheremoval of the
217
Figure3: Facialmodelusedto producethesix local featurevectors.
neutralsegmentsat thebeginningandendof asequence.The mouthmodel is definedusingthe four featurepointsshown in Figure2. From
thesefour points,we extract four paraboliccoefficients,which model the shapeof themouthin eachframe(Figure3). Theupperfacemodelalsoconsistsof four coefficients.Two coefficientsmodeltheshapeof theeyebrowsbymeasuringtheirangleof deformation(Figure 3). The other two coefficients measurethe distancefrom the eyebrows to thenostrils (Figure 3). From the eight coefficients, a set of six local featurevectorsarecreated:
��������� ���� ���������� ������ ���� � ���� � ���� � ���� �. The componentsof the featurevectorsare
createdby takingthedifferencebetweenthecoefficientsfor eachframe,andtheaverageof thefirst setof coefficientsrepresentingthesegmentremovedfrom thebeginningof thesequence.
Neutral Neutral
Amplitude ofparabolic coefficient
Frames Frames
Amplitude ofparabolic coefficient
(a) (b)
Figure 4: Removal of theneutralsegmentsat thestartandendof a motionsequence.(a)Neutralsegmentsdetected.(b) Neutralsegmentsremoved.
218
4 Facial expressionclassification
Datafusionhasbeenusedby many researchersto improve theaccuracy of patternclas-sification[6], whenseveralsourcesof informationareindependentor complementary. Inour system,eachlocal expert classifieroutputsa recognitionscorefor eachknown ex-pression.The scoresare thencombinedto producea final classificationresult (Figure5).
(c1, c4)
(c2, c3)
a1
d1
a2
d2Class label
a1 a2
d2
c3
c4
c2
c1
d1
Localclassifiers
Featurevectors
Combiner
V
V
V
V
V
V
Figure5: Expressionclassificationusingdecision-level fusion.
4.1 Local classifiers
Each local classifieris a rank-weighted� -nearest-neighbourclassifier. It producesaweightedcumulativescorefor eachknown classof expression.Theweightedclassscoreis proportionalto the rank of eachnearestneighbourbelongingto the class. This is anadaptationto the distance-weighted� NN classifier[1]. Therefore,the classof the firstnearestneighbourwill have thehighestscore,thentheclassof theseconda lower scoreandsoonuntil wereach� neighbours.For instance,if ����� , theclassof thefirst nearestneighbourreceivesa scoreof 4, the classof the secondneighbour3, until the fifth oneandtherestwhich receivea scoreof 0.
4.2 Combining scheme
Thecombinationis implementedasa sumschemeperformedat thedecisionlevel. It isusedherefor its simplicity. Whenall local classifiersarecombined,all the classesofexpressionget a final scoreby summingthe class-specificscorefrom eachlocal classi-fier. The unknown patternis assignedto the classwhich hasthe highestscore. Otherapproachesto decisionfusionsuchaslinearcombination,Bayesianinference,neuralnet-works,fuzzy logic arethesubjectof ourongoinginvestigations[6, 25, 13].
219
5 Experimental results
Anger Joy Sadness Surprise0
20
40
60
80
100
% Correctrecognition
No occlusion
Mouth occlusion
Upper face occlusion
left/right half faceocclusion
Figure6: Effectof partialocclusiononfacialexpressionrecognition.Errorbarsshow the99%confidenceintervals.
Theaim of theseexperimentsis to studytherobustness,of theproposedrecognitionapproach,to theocclusionof faceregions.Theexperimentshave beencarriedout using100 video sequences,where25 sequencesareavailablefor eachexpressionclass. Par-titioning into testandtraining setsis basedon the leave-one-outtestmethodology. Theconfidencelimits areestimatedundertheassumptionthatrecognitionrateshaveof abino-mial distribution [26]. The99%confidencelevelsproduce,in theworstcase,confidenceintervalsof � 10.3%for thetestsamplesizeandvaluesof recognitionratesmeasured.Theimagesequencesarefrom theCohn-KanadeAction-Unit-codedCarnegieMellon Univer-sity facialexpressiondatabase[14]. The30subjectsfrom thedatabasevary in ethnicities,ageandskincolor.
Testswith noocclusionof any facialfeaturesserveasbaseline.Theeffectof occlusionis investigatedby simulatingmissingfacial regions. Thetestsetcoversocclusionof themouth,upperface,or left/right half of theface.Themotivationfor thesetestsaremany.For example,whensomebodyperformsanexpressionsuchasasmile,it mayhappenthat,heputshishandor anobjectin front of hismouth.Facialhairor glassescanalsooccludefaceregions.
Theresultsareshown in Figure6. Recognitionof ‘anger’ and‘surprise’expressionsis robustto theocclusionof themouth.This is asexpectedbecausesomediscriminativeinformationis alsoembeddedin theupperpartof theface.Surprisingly, ‘joy’ is accuratedespitethe absenceof the mouth,which is importantwhenmakinga smile. This unex-pectedresultcouldbedueto virtual lack of motionof theeyebrowsduringa smile. Thissets‘joy’ apartfrom theotherexpressionclassesunderstudy. However, a problemmayariseif a neutralexpressionis consideredasa valid classfor recognition. The featurevectors,for ‘joy’ andthe neutralexpression,would be similar. The discriminative fea-turesfor the‘sadness’expressionaremainlyconveyedby themouth,asillustratedby thesensitivity to mouthocclusionof therecognitionof ‘sadness’.
Both left/right facesideswerefound to give the sameresultson average.They arethereforerepresentedasonesinglesetin Figure6. Therecognitionresultsunderocclusionare very closeto the baselineresultscorrespondingto no occlusion. This shows thatreducingthe amountof facial informationby half still retainssufficient discriminative
220
informationfor expressionrecognition.However, removingthedatacorrespondingto halfof thefacedoesnotnecessarilymeanthatweremoveredundantdata.Indeed,expressionscanbeunilateralandbilateral[11].
Overall, theexperimentalresultsshow thattheproposedapproachis robustto partialocclusionof the face. However, analysingexpressionsusinggeometricfeaturesmissestheregionalinformationembeddedin imageintensity, texture,or edges.Hence,ongoinginvestigationsarestudyingtheuseof bothgeometricfeaturesandtexture. Theeffect onrobustnessto partialocclusion,of thegranularityof thesubdivision of thefaceinto localregions,is alsounderstudy.
6 Conclusion
A methodfor recognizingfacial expressionsin the presenceof occlusionhasbeenpre-sented.Facialfeaturepointsareautomaticallytrackedoveranimagesequences,andusedto representa facemodelmadeof severalregionsof interest.Decision-level fusioncom-bineslocal interpretationof the facemodel into a global recognitionscore. It hasbeenexperimentallyshown thattheproposedapproachis robustto partialocclusionof theface.It hasalsobeenobservedthat thereis an interactionbetweenrecognitionrobustnessforsomeexpressionsandwhatregionof thefaceis occluded.
References
[1] T. Bailey andA. K. Jain. A Noteon Distance-Weightedk-NearestNeighborRules.IEEETransactionson Systems,Man,andCybernetics, 8:311–313,1978.
[2] M. J.BlackandY. Yacoob. RecognizingFacialExpressionsin ImageSequencesUs-ing Local ParameterizedModelsof ImageMotion. InternationalJournal on Com-puterVision, 25(1):23–48,October1997.
[3] F. Bourel,C. C. Chibelushi,andA. A. Low. RobustFacialFeatureTracking.Proc.11th British Machine Vision Conference, Bristol, England, 1:232–241,September2000.
[4] R. R. BrooksandS. S. Iyengar. RobustDistributedComputingandSensingAlgo-rithm. IEEE Computer, 29(6):53–60,June1996.
[5] R. R. BrooksandS. S. Iyengar. Multi-SensorFusion: FundamentalsandApplica-tionswith Software. PrenticeHall, 1998.
[6] C. C. Chibelushi,F. Deravi, andS.D. Mason.A Review of Speech-BasedBimodalRecognition.IEEE TransactionsonMultimedia,Acceptedfor publication.
[7] C. C. Chibelushi,F. Deravi, andS. D. Mason. Adaptive ClassifierIntegrationforRobustPatternRecognition.IEEE Transactionson Systems,Man,andCybernetics- Part B: Cybernetics, 29(6):902–907,December1999.
221
[8] J. Cohn,A. Zlochower, J. Lien, andT. Kanade. Feature-pointtrackingby opticalflow discriminatessubtledifferencesin facial Expression. Proc. 3rd IEEE Inter-nationalConferenceon AutomaticFaceandGesture Recognition, pages396–401,1998.
[9] G. Donato,M. S. Barlett, J. C. Hager, P. Ekman,andT. J. Sejnowski. ClassifyingFacialActions. IEEE Transactionson Pattern AnalysisandMachine Intelligence,21(10):974–989,October1999.
[10] P. EkmanandW. Friesen.UnmaskingtheFace. PrenticeHall, New Jersey, 1975.
[11] P. EkmanandW. Friesen.TheFacialAction CodingSystem.ConsultingPsycholo-gistsPress,SanFrancisco,CA, 1978.
[12] I. A. EssaandA. P. Pentland.Coding,Analysis,InterpretationandRecognitionofFacial Expressions.IEEE Transactionson Pattern Analysisand Machine Intelli-gence, 19(7):757–763,July1997.
[13] A. K. Jain,R. P. W. Duin, andJ. Mao. StatisticalPatternRecognition:A Review.IEEE Transactionson PatternAnalysisandMachineIntelligence, 22(1):4–37,Jan-uary2000.
[14] T. Kanade,J. Cohn,andY. Tian. Comprehensive Databasefor Facial ExpressionAnalysis.Proc.4th IEEEInternationalConferenceon AutomaticFaceandGestureRecognition (FG’2000),France, pages46–53,March2000.
[15] S. Kimura andM. Yachida.FacialExpressionRecognitionandits DegreeEstima-tion. Proc. IEEE InternationalConferenceon ComputerVision andPatternRecog-nition, pages295–300,1997.
[16] A. Lanitis, C. J. Taylor, andT. F. Cootes.AutomaticInterpretationandCodingofFaceImagesUsing Flexible Models. IEEE Transactionson Pattern AnalysisandMachineIntelligence, 19(7):743–756,July 1997.
[17] J.J.Lien, T. Kanade,J.F. Cohn,andC. Li. Detection,Tracking,andClassificationof Action Units in FacialExpression.Journalof RoboticsandAutonomousSystem,31:131–146,2000.
[18] B. D. Lucasand T. Kanade. An Iterative ImageRegistrationTechniquewith anApplication to StereoVision. Proc. International Joint Conferenceon ArtificialIntelligence, pages674–679,1981.
[19] B. Myers,J.Hollan,andI. Cruz. Strategic Directionsin Human-ComputerInterac-tion. ACM Computingsurveys, 28(4):795–809,December1996.
[20] N. Oliver, A. P. Pentland,andF. Berard.LAFTER: Lips andFaceRealTimeTrackerwith FacialExpressionRecognition.Proc. ComputerVision andPattern Recogni-tion Conference, S.Juan,PuertoRico, pages123–129,June1997.
[21] T. OtsukaandJ.Ohya.Recognitionof FacialExpressionsUsingHMM with Contin-uousOutputProbabilities.5th IEEE InternationalWorkshopon RobotandHumanCommunication, pages323–328,1996.
222
[22] M. PanticandRothkrantzL. J.M. AutomaticAnalysisof FacialExpressions:TheStateof theArt. IEEE Transactionson PatternAnalysisandMachineIntelligence,22(12):1424–1445,December2000.
[23] M. Rosenblum,L. S. Davis, andY. Yacoob. HumanEmotion RecognitionfromMotion Using a RadialBasisFunctionNetwork Architecture. IEEE WorkshoponMotion of Non-RigidandArticulatedObjects,Austin,Texas, pages43–49,Novem-ber1994.
[24] A SamalandA. P. Iyengar. AutomaticRecognitionandAnalysisof HumanFacesandFacialExpressions:A Survey. PatternRecognition, 25(1):65–77,January1992.
[25] M. Sarkar. ModularPatternClassifiers:A Brief Survey. Proc. IEEEConferenceonSystem,Man,andCybernetics, 4:2878–2883,2000.
[26] M. R. Spiegel. Schaum’sOutlineof TheoryandProblemsof Statistics,2ndEdition.McGraw-Hill Inc., 1990.
[27] Y. Tian, T. Kanade,andJ. F. Cohn. RecognizingAction Units for Facial Expres-sion Analysis. IEEE Transactionon Pattern Analysisand Machine Intelligence,23(2):97–115,February2001.
[28] M. Wang,Y. Iwai, andM. Yachida.ExpressionRecognitionfrom Time-SequentialFacial Imagesby Useof ExpressionChangeModel. Proc. 3rd IEEE InternationalConferenceon AutomaticFaceandGesture Recognition, pages324–329,1998.
[29] Y. Yacooband L. S. Davis. RecognizingHumanFacial Expressionsfrom LongImageSequencesUsingOptical-Flow. IEEE Transactionson PatternAnalysisandMachineIntelligence, 18(6):636–642,June1996.
[30] J.Ziegler. InteractiveTechniques.ACM Computingsurveys, 28(1):185–187,March1996.