Why Cognitive Modeling?• Facial muscle movements can be observed objectively.

• But facial emotion is subjective.

• Machine recognition of emotional facial expressionsdepends on robustly detecting the muscle movementcombinations that humans reliably identify as Fear,Sadness, etc.

• Therefore, robust, practical facial emotion recognitionsystems should be informed by human perceptual data.

• Strategy:– Build simple systems that model human psychological data.

– Use the models to guide psychological research.

– Eventually, transfer the knowledge to practical systems.

Motivation• Our goal: understanding the basis of human perception

of facial expression through cognitive modeling.

• There are two classic theories of facial emotionperception: categorical and dimensional.

• Young et al. (1997) facial expression “megamix”experiments using emotion morph stimuli provideevidence partially supporting both theories.

• Our theory: the data on perception of facial affect canlargely be explained by the computational requirementsof associating facial expressions with emotion labels.

• Our successful facial expression recognition systemaccounts for both categorical and dimensionalperceptual data with the same mechanism.

A Good Cognitive Model Should:

• Be psychologically relevant (i.e. it should be in an areawith a lot of real, interesting psychological data).

• Actually be implemented.

• If possible, perform the actual task of interest ratherthan a cartoon version of it.

• Be simplifying (i.e. it should be constrained to theessential features of the problem at hand).

• Fit the experimental data.

• Make new predictions to guide psychological research.



• There are two classic theories of facial emotionperception: categorical and multidimensional.

• Young et al. (1997) facial expression “megamix”experiments provide evidence partially supporting boththeories.


• Our facial expression recognition system accounts forboth categorical and multidimensional data with onemechanism.

Dimensional Theories of Emotion• Multidimensional scaling (MDS) of human similarity

judgments usually leads to a two-dimensional emotion“circumplex” with similar expressions closer together.

• Perceptual space is low-dimensional and continuous.

Anger

Happiness

Fear

SurpriseSadness

Disgust

• Predictions forperception:– Morphs along

chords shouldproduce intrusions.

– Morphs acrosscenter of the spaceshould travelthrough Neutral.

c.f. Russell (1980)

Categorical Theories of Emotion• Categorical perception (e.g. colors in a rainbow):

– Sharp boundaries: a small physical change in stimulus leadsto big change in classification at the category boundary.

– Discrimination is better near category boundaries than nearprototypes.

– Response times are slower near category boundaries.

– Can be innate (e.g. color categories) or learned (e.g. theboundary between /p/ and /b/ phonemes or the boundarybetween Clinton and Kennedy in a morph sequence).

• Predictions for perception in facial expression morphs:– No intrusions of other expressions.

– No difference between chord/bisector morph sequences.

Young et al. (1997) Megamix: Methods• Created morphs between all possible pairs of the six

basic expressions and Neutral. Example:

• Presented stimuli inrandom order tosubjects, 6-way or 7-way forced choice(with or withoutNeutral).

FearPrototype

SadnessPrototype

10% 50%30% 70% 90%0% 100%

Fear to Sadness Morph Transition

0

20

40

60

80

100

90% 70% 50% 30% 10%

% Fear in Morph

% Id

en

tifie

d%Happy

%Surprise

%Fear

%Sad

%Disgust

%Anger

%Neutral

Megamix Identification Results• All transitions have relatively sharp boundaries. Here

are 6:

iness Fear Sadness Disgust Anger Happ-Surprise

• Only a few very small “ intrusions” of unrelatedexpressions in transitions.

Megamix: More Data Supporting CP• Subjects showed poor discrimination ability near

emotion prototypes and better discrimination abilitynear transitions.

• Response times were faster near prototypes and slowernear transitions ("scallop" shaped):


Less “Categorical” Data• Subjects were above chance at detecting the “mixed-in”

expression when 30% present.

• Despite seemingly categorical effects in perception ofthe morphs, the second expression is still detectable.

Mixed-In Expression Detection

-0.5

0

0.5

1

1.5

10 30 50

Percent Mixing

Ave

rage R

ank

Sco

re

Mixed-inexpression(humans)

Unrelatedexpressions(humans)

• Most apparentexpression: score =3.0.

• 2nd/3rd mostapparent: score =2.0/1.0.

• Normalized forresponse bias.

The Model’s Facial Expression Database• Ekman and Friesen proposed a quantification of the

prototypical muscle movement combinations (FacialActions) involved in portrayal of happiness, sadness,fear, anger, surprise, and disgust.– Result: the Pictures of Facial Affect (1976).

– 70% agreement on emotional content by naive humansubjects.

• 110 photos, 14 subjects, 7 expressions.

Actor “JJ”

SurprisedHappy Sad Afraid Angry Disgusted Neutral

Model: Expression Transitions• Young et al. tested their subjects on morphs between

pairs of the 6 "basic" expressions and neutral.

• Used Ekman and Friesen actor "JJ."

• We recreated the morph stimuli used in their study withcommercial morphing software and the same JJ photos:

Fear

Surprise

Sadness

Disgust

Representation: Gabor Jets• Based on the 2-D Gabor filter (Daugman, 1985).

• 2-D sinusoid wave localized by a Gaussian envelope.

• Combining kernels at multiple spatial frequencies andorientations forms a "jet."

• Good for object recognition (Lades et al. 1993), facerecognition (Wiskott et al. 1997), and classification ofindividual facial actions (Bartlett 1998).

Gabor Jet Extraction

Convolution

Gabor “ jet” =8 orientations,5 spatial frequenciesat one location.

Extracted jets in arectangular 29x36 grid

.

.

.

41760-elementPattern Vector

.

.

.

35-elementPattern Vector

PCA

(top 35 P.C.eigenvectorsfor non-JJ

faces)

* Real Part (Cosine)

Imaginary Part (Sine)

Combinequadrature

pairs to get phaseinsensitive

Gabor magnitudes

69% ofvariance

Feedforward Network Classification• Pattern vectors are classified independently by several

feedforward backpropagation networks. Combiningevidence improves generalization accuracy (3% for JJ).

• Individual networks: Train on 70 random faces,reserving remaining 29 for early stopping.

Training Set

HoldoutSet Disgust

Anger

Surprise

Individual Network(softmax, cross entropy)

Happiness

FearSadness

.

..

Ensemble Combination

Averaging

Network 1

Network 2. . .Stimulus

Preprocessing(Gabor + PCA)

Happiness

DisgustSurpriseAngerFearSadness

. . .. . .

. . .. . .

. . .

. . .

• An experimental subject ismodeled by an ensemble of5 networks with differentweights and training sets.

• We combine the outputs ofindividual networks byaveraging their (softmax)outputs.

Network 5

Measures for Cognitive Modeling• A college student = one trained-up network ensemble.

• Identification in six-way forced choice experiment = thelabel on the ensemble’s maximal output.

• Identification response time = the ensemble’suncertainty, or difference between the maximal outputand 1.0.

• Stimulus discrimination ability = dissimilarity = (1-correlation) between the ensemble’s 6-dimensionaloutput vectors for two stimuli.

• Scoring 1st, 2nd, and 3rd most apparent expressions =recording labels on the largest, 2nd largest, and 3rdlargest ensemble outputs.

Modeling Results• Trained 50 ensembles of 5 networks on all actors except

JJ. 49 ensembles generalized perfectly to JJ prototypes.

• Tested network's response to JJ morph sequences.

• Good quantitative fit: r2 = 0.76 with zero fit parameters.

• Small qualitative differences: slightly larger “ intrusions,”less variance.


Model Response Times• The distance between an ensemble's maximal output

and 1.0 is a measure of its uncertainty:

• The model RTs show the same scallop-shapedpatterns as the data.


Model Discrimination• Correlation ( r ) between an ensemble's response to a

pair of stimuli models a similarity judgment; 1-r modelsdissimilarity / discriminability.

• As in human data, model discriminability is best nearthe category boundary.

Perception of “Mixed-In” Expression• Can score and normalize the first, second, and third

largest network outputs as for the humans.

• Model scores for mixed-in expression are very close tothe human scores.

Mixed-In Expression Detection

-0.5

0

0.5

1

1.5

10 30 50

Percent Mixing

Ave

rage R

ank

Sco

re

Mixed-inexpression(humans)Unrelatedexpressions(humans)Mixed-inexpression(networks)Unrelatedexpressions(networks)

Conclusions• Much of the human perceptual data can be accounted

for by a simple feedforward neural network that simplylearns to associate expressions with emotional labels.

• The fit is "easy" due to the inherent properties ofnonlinear classifiers.

• The minor failings of the model may be due to a lack oftraining data (there is little between-network variance).

• What have we learned from building this model?– Neutral classification should be separate from the emotions.

– The fit to human data helps us select one of two systems withequivalent performance on the POFA prototypes.

– Ensemble classification improves accuracy.

Work in Progress• Robust Neutral/Expressive classification without test

set snooping.

• Adding dynamic information to improve performance.

• New experiments exploring the "malleability" ofexpression category boundaries in humans and innetworks.

• Collection of a large public database of emotional facialexpression images and video sequences.

a neural network model for human facial expression...

Documents