a neural network model for human facial expression...
TRANSCRIPT
A Neural Network Model for HumanFacial Expression Recognition
IJCNN ‘99
Computer Science and Engineering
Gary’s Unbelievable Research Unit (GURU)
Institute for Neural Computation
UC San Diego
Matthew N. Dailey
Garrison W. Cottrell
Why Cognitive Modeling?• Facial muscle movements can be observed objectively.
• But facial emotion is subjective.
• Machine recognition of emotional facial expressionsdepends on robustly detecting the muscle movementcombinations that humans reliably identify as Fear,Sadness, etc.
• Therefore, robust, practical facial emotion recognitionsystems should be informed by human perceptual data.
• Strategy:– Build simple systems that model human psychological data.
– Use the models to guide psychological research.
– Eventually, transfer the knowledge to practical systems.
Motivation• Our goal: understanding the basis of human perception
of facial expression through cognitive modeling.
• There are two classic theories of facial emotionperception: categorical and dimensional.
• Young et al. (1997) facial expression “megamix”experiments using emotion morph stimuli provideevidence partially supporting both theories.
• Our theory: the data on perception of facial affect canlargely be explained by the computational requirementsof associating facial expressions with emotion labels.
• Our successful facial expression recognition systemaccounts for both categorical and dimensionalperceptual data with the same mechanism.
A Good Cognitive Model Should:
• Be psychologically relevant (i.e. it should be in an areawith a lot of real, interesting psychological data).
• Actually be implemented.
• If possible, perform the actual task of interest ratherthan a cartoon version of it.
• Be simplifying (i.e. it should be constrained to theessential features of the problem at hand).
• Fit the experimental data.
• Make new predictions to guide psychological research.
Motivation• Our goal: understanding the basis of human perception
of facial expression through cognitive modeling.
• There are two classic theories of facial emotionperception: categorical and multidimensional.
• Young et al. (1997) facial expression “megamix”experiments provide evidence partially supporting boththeories.
• Our theory: the data on perception of facial affect canlargely be explained by the computational requirementsof associating facial expressions with emotion labels.
• Our facial expression recognition system accounts forboth categorical and multidimensional data with onemechanism.
Dimensional Theories of Emotion• Multidimensional scaling (MDS) of human similarity
judgments usually leads to a two-dimensional emotion“circumplex” with similar expressions closer together.
• Perceptual space is low-dimensional and continuous.
Anger
Happiness
Fear
SurpriseSadness
Disgust
• Predictions forperception:– Morphs along
chords shouldproduce intrusions.
– Morphs acrosscenter of the spaceshould travelthrough Neutral.
c.f. Russell (1980)
Categorical Theories of Emotion• Categorical perception (e.g. colors in a rainbow):
– Sharp boundaries: a small physical change in stimulus leadsto big change in classification at the category boundary.
– Discrimination is better near category boundaries than nearprototypes.
– Response times are slower near category boundaries.
– Can be innate (e.g. color categories) or learned (e.g. theboundary between /p/ and /b/ phonemes or the boundarybetween Clinton and Kennedy in a morph sequence).
• Predictions for perception in facial expression morphs:– No intrusions of other expressions.
– No difference between chord/bisector morph sequences.
Motivation• Our goal: understanding the basis of human perception
of facial expression through cognitive modeling.
• There are two classic theories of facial emotionperception: categorical and multidimensional.
• Young et al. (1997) facial expression “megamix”experiments provide evidence partially supporting boththeories.
• Our theory: the data on perception of facial affect canlargely be explained by the computational requirementsof associating facial expressions with emotion labels.
• Our facial expression recognition system accounts forboth categorical and multidimensional data with onemechanism.
Young et al. (1997) Megamix: Methods• Created morphs between all possible pairs of the six
basic expressions and Neutral. Example:
• Presented stimuli inrandom order tosubjects, 6-way or 7-way forced choice(with or withoutNeutral).
FearPrototype
SadnessPrototype
10% 50%30% 70% 90%0% 100%
Fear to Sadness Morph Transition
0
20
40
60
80
100
90% 70% 50% 30% 10%
% Fear in Morph
% Id
en
tifie
d%Happy
%Surprise
%Fear
%Sad
%Disgust
%Anger
%Neutral
Megamix Identification Results• All transitions have relatively sharp boundaries. Here
are 6:
iness Fear Sadness Disgust Anger Happ-Surprise
• Only a few very small “ intrusions” of unrelatedexpressions in transitions.
Megamix: More Data Supporting CP• Subjects showed poor discrimination ability near
emotion prototypes and better discrimination abilitynear transitions.
• Response times were faster near prototypes and slowernear transitions ("scallop" shaped):
iness Fear Sadness Disgust Anger Happ-Surprise
Less “Categorical” Data• Subjects were above chance at detecting the “mixed-in”
expression when 30% present.
• Despite seemingly categorical effects in perception ofthe morphs, the second expression is still detectable.
Mixed-In Expression Detection
-0.5
0
0.5
1
1.5
10 30 50
Percent Mixing
Ave
rage R
ank
Sco
re
Mixed-inexpression(humans)
Unrelatedexpressions(humans)
• Most apparentexpression: score =3.0.
• 2nd/3rd mostapparent: score =2.0/1.0.
• Normalized forresponse bias.
Motivation• Our goal: understanding the basis of human perception
of facial expression through cognitive modeling.
• There are two classic theories of facial emotionperception: categorical and multidimensional.
• Young et al. (1997) facial expression “megamix”experiments provide evidence partially supporting boththeories.
• Our theory: the data on perception of facial affect canlargely be explained by the computational requirementsof associating facial expressions with emotion labels.
• Our facial expression recognition system accounts forboth categorical and multidimensional data with onemechanism.
The Model’s Facial Expression Database• Ekman and Friesen proposed a quantification of the
prototypical muscle movement combinations (FacialActions) involved in portrayal of happiness, sadness,fear, anger, surprise, and disgust.– Result: the Pictures of Facial Affect (1976).
– 70% agreement on emotional content by naive humansubjects.
• 110 photos, 14 subjects, 7 expressions.
Actor “JJ”
SurprisedHappy Sad Afraid Angry Disgusted Neutral
Model: Expression Transitions• Young et al. tested their subjects on morphs between
pairs of the 6 "basic" expressions and neutral.
• Used Ekman and Friesen actor "JJ."
• We recreated the morph stimuli used in their study withcommercial morphing software and the same JJ photos:
Fear
Surprise
Sadness
Disgust
Representation: Gabor Jets• Based on the 2-D Gabor filter (Daugman, 1985).
• 2-D sinusoid wave localized by a Gaussian envelope.
• Combining kernels at multiple spatial frequencies andorientations forms a "jet."
• Good for object recognition (Lades et al. 1993), facerecognition (Wiskott et al. 1997), and classification ofindividual facial actions (Bartlett 1998).
Gabor Jet Extraction
Convolution
Gabor “ jet” =8 orientations,5 spatial frequenciesat one location.
Extracted jets in arectangular 29x36 grid
.
.
.
41760-elementPattern Vector
.
.
.
35-elementPattern Vector
PCA
(top 35 P.C.eigenvectorsfor non-JJ
faces)
* Real Part (Cosine)
Imaginary Part (Sine)
Combinequadrature
pairs to get phaseinsensitive
Gabor magnitudes
69% ofvariance
Feedforward Network Classification• Pattern vectors are classified independently by several
feedforward backpropagation networks. Combiningevidence improves generalization accuracy (3% for JJ).
• Individual networks: Train on 70 random faces,reserving remaining 29 for early stopping.
Training Set
HoldoutSet Disgust
Anger
Surprise
Individual Network(softmax, cross entropy)
Happiness
FearSadness
.
..
Ensemble Combination
Averaging
Network 1
Network 2. . .Stimulus
Preprocessing(Gabor + PCA)
Happiness
DisgustSurpriseAngerFearSadness
. . .. . .
. . .. . .
. . .
. . .
• An experimental subject ismodeled by an ensemble of5 networks with differentweights and training sets.
• We combine the outputs ofindividual networks byaveraging their (softmax)outputs.
Network 5
Measures for Cognitive Modeling• A college student = one trained-up network ensemble.
• Identification in six-way forced choice experiment = thelabel on the ensemble’s maximal output.
• Identification response time = the ensemble’suncertainty, or difference between the maximal outputand 1.0.
• Stimulus discrimination ability = dissimilarity = (1-correlation) between the ensemble’s 6-dimensionaloutput vectors for two stimuli.
• Scoring 1st, 2nd, and 3rd most apparent expressions =recording labels on the largest, 2nd largest, and 3rdlargest ensemble outputs.
Modeling Results• Trained 50 ensembles of 5 networks on all actors except
JJ. 49 ensembles generalized perfectly to JJ prototypes.
• Tested network's response to JJ morph sequences.
• Good quantitative fit: r2 = 0.76 with zero fit parameters.
• Small qualitative differences: slightly larger “ intrusions,”less variance.
iness Fear Sadness Disgust Anger Happ-Surprise
Model Response Times• The distance between an ensemble's maximal output
and 1.0 is a measure of its uncertainty:
• The model RTs show the same scallop-shapedpatterns as the data.
iness Fear Sadness Disgust Anger Happ-Surprise
Model Discrimination• Correlation ( r ) between an ensemble's response to a
pair of stimuli models a similarity judgment; 1-r modelsdissimilarity / discriminability.
• As in human data, model discriminability is best nearthe category boundary.
Perception of “Mixed-In” Expression• Can score and normalize the first, second, and third
largest network outputs as for the humans.
• Model scores for mixed-in expression are very close tothe human scores.
Mixed-In Expression Detection
-0.5
0
0.5
1
1.5
10 30 50
Percent Mixing
Ave
rage R
ank
Sco
re
Mixed-inexpression(humans)Unrelatedexpressions(humans)Mixed-inexpression(networks)Unrelatedexpressions(networks)
Conclusions• Much of the human perceptual data can be accounted
for by a simple feedforward neural network that simplylearns to associate expressions with emotional labels.
• The fit is "easy" due to the inherent properties ofnonlinear classifiers.
• The minor failings of the model may be due to a lack oftraining data (there is little between-network variance).
• What have we learned from building this model?– Neutral classification should be separate from the emotions.
– The fit to human data helps us select one of two systems withequivalent performance on the POFA prototypes.
– Ensemble classification improves accuracy.
Work in Progress• Robust Neutral/Expressive classification without test
set snooping.
• Adding dynamic information to improve performance.
• New experiments exploring the "malleability" ofexpression category boundaries in humans and innetworks.
• Collection of a large public database of emotional facialexpression images and video sequences.