empath: a neural network that categorizes facial...

16
EMPATH: A Neural Network that Categorizes Facial Expressions Matthew N. Dailey 1 , Garrison W. Cottrell 1 , Curtis Padgett 2 , and Ralph Adolphs 3 Abstract & There are two competing theories of facial expression recognition. Some researchers have suggested that it is an example of ‘‘categorical perception.’’ In this view, expression categories are considered to be discrete entities with sharp boundaries, and discrimination of nearby pairs of expressive faces is enhanced near those boundaries. Other researchers, however, suggest that facial expression perception is more graded and that facial expressions are best thought of as points in a continuous, low-dimensional space, where, for instance, ‘‘surprise’’ expressions lie between ‘‘happiness’’ and ‘‘fear’’ expressions due to their perceptual similarity. In this article, we show that a simple yet biologically plausible neural network model, trained to classify facial expressions into six basic emotions, predicts data used to support both of these theories. Without any parameter tuning, the model matches a variety of psychological data on categorization, similarity, reaction times, discrimination, and recognition difficulty, both qualitatively and quantitatively. We thus explain many of the seemingly complex psychological phenomena related to facial expression perception as natural consequences of the tasks’ implementations in the brain. & INTRODUCTION How do we see emotions in facial expressions? Are they perceived as discrete entities, like islands jutting out of the sea, or are they more continuous, reflecting the structure beneath the surface? We believe that computa- tional models of the process can shed light on these questions. Automatic facial expression analysis is an active area of computer vision research (Lien, Kanade, Cohn, & Li, 2000; Donato, Bartlett, Hager, Ekman, & Sejnowski, 1999; Lyons, Budynek, & Akamatsu, 1999; Rosenblum, Yacoob, & Davis, 1996). However, there has only been limited work in applying computational mod- els to the understanding of human facial expression processing (Calder, Burton, Miller, Young, & Akamatsu, 2001; Lyons, Akamatsu, Kamachi, & Gyoba, 1998). In particular, the relationship between categorization and perception is controversial, and a computational model may help elucidate the connection between them. Basic Emotions and Discrete Facial Expression Categories Although the details of his theory have evolved substan- tially since the 1960s, Ekman remains the most vocal proponent of the idea that emotions are discrete entities. In a recent essay, he outlined his theory of basic emo- tions and their relationship with facial expressions (Ekman, 1999). ‘‘Basic’’ emotions are distinct families of affective states characterized by different signals, physi- ology, appraisal mechanisms, and antecedent events. Ekman cites early evidence suggesting that each emotion is accompanied by distinctive physiological changes that prepare an organism to respond appropriately. For in- stance, blood flow to the hands increases during anger, possibly in preparation for a fight. In addition to physio- logical changes, according to the theory, each basic emotion family is also accompanied by a fast appraisal mechanism that attends to relevant stimuli and a set of universal antecedent events (e.g., physical or psycholog- ical harm normally leads to a state of fear, and loss of a significant other normally leads to a state of sadness). Finally, and most importantly, Ekman believes that emo- tions evolved to ‘‘inform conspecifics, without choice or consideration, about what is occurring: inside the per- son ... , what most likely occurred ... , and what is most likely to occur next’’ (p. 47). Thus, every basic emotion family is necessarily accompanied by one (or perhaps a few for some families) distinctive prototypical signals, including a set of facial muscle movements and body movements (e.g., approach or withdrawal). The signals are not entirely automatic; they may be attenuated, masked, or faked in certain circumstances. Further- more, within emotion families, individual differences and situational context allow for small variations on 1 University of California, San Diego, 2 Jet Propulsion Laboratory, 3 University of Iowa Note: EMPATH stands for ‘‘EMotion PATtern recognition using Holons’’, the name for a system developed by Cottrell & Metcalfe (1991). © 2002 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 14:8, pp. 1158–1173

Upload: others

Post on 06-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

EMPATH A Neural Network that CategorizesFacial Expressions

Matthew N Dailey1 Garrison W Cottrell1 Curtis Padgett2 and Ralph Adolphs3

Abstract

amp There are two competing theories of facial expressionrecognition Some researchers have suggested that it is anexample of lsquolsquocategorical perceptionrsquorsquo In this view expressioncategories are considered to be discrete entities with sharpboundaries and discrimination of nearby pairs of expressivefaces is enhanced near those boundaries Other researchershowever suggest that facial expression perception is moregraded and that facial expressions are best thought of as pointsin a continuous low-dimensional space where for instancelsquolsquosurprisersquorsquo expressions lie between lsquolsquohappinessrsquorsquo and lsquolsquofearrsquorsquoexpressions due to their perceptual similarity In this article

we show that a simple yet biologically plausible neural networkmodel trained to classify facial expressions into six basicemotions predicts data used to support both of thesetheories Without any parameter tuning the model matchesa variety of psychological data on categorization similarityreaction times discrimination and recognition difficulty bothqualitatively and quantitatively We thus explain many of theseemingly complex psychological phenomena related to facialexpression perception as natural consequences of the tasksrsquoimplementations in the brain amp

INTRODUCTION

How do we see emotions in facial expressions Are theyperceived as discrete entities like islands jutting out ofthe sea or are they more continuous reflecting thestructure beneath the surface We believe that computa-tional models of the process can shed light on thesequestions Automatic facial expression analysis is anactive area of computer vision research (Lien KanadeCohn amp Li 2000 Donato Bartlett Hager Ekman ampSejnowski 1999 Lyons Budynek amp Akamatsu 1999Rosenblum Yacoob amp Davis 1996) However there hasonly been limited work in applying computational mod-els to the understanding of human facial expressionprocessing (Calder Burton Miller Young amp Akamatsu2001 Lyons Akamatsu Kamachi amp Gyoba 1998) Inparticular the relationship between categorization andperception is controversial and a computational modelmay help elucidate the connection between them

Basic Emotions and Discrete Facial ExpressionCategories

Although the details of his theory have evolved substan-tially since the 1960s Ekman remains the most vocal

proponent of the idea that emotions are discrete entitiesIn a recent essay he outlined his theory of basic emo-tions and their relationship with facial expressions(Ekman 1999) lsquolsquoBasicrsquorsquo emotions are distinct families ofaffective states characterized by different signals physi-ology appraisal mechanisms and antecedent eventsEkman cites early evidence suggesting that each emotionis accompanied by distinctive physiological changes thatprepare an organism to respond appropriately For in-stance blood flow to the hands increases during angerpossibly in preparation for a fight In addition to physio-logical changes according to the theory each basicemotion family is also accompanied by a fast appraisalmechanism that attends to relevant stimuli and a set ofuniversal antecedent events (eg physical or psycholog-ical harm normally leads to a state of fear and loss of asignificant other normally leads to a state of sadness)Finally and most importantly Ekman believes that emo-tions evolved to lsquolsquoinform conspecifics without choice orconsideration about what is occurring inside the per-son what most likely occurred and what is mostlikely to occur nextrsquorsquo (p 47) Thus every basic emotionfamily is necessarily accompanied by one (or perhaps afew for some families) distinctive prototypical signalsincluding a set of facial muscle movements and bodymovements (eg approach or withdrawal) The signalsare not entirely automatic they may be attenuatedmasked or faked in certain circumstances Further-more within emotion families individual differencesand situational context allow for small variations on

1 University of California San Diego 2 Jet Propulsion Laboratory3 University of IowaNote EMPATH stands for lsquolsquoEMotion PATtern recognition usingHolonsrsquorsquo the name for a system developed by Cottrell ampMetcalfe (1991)

copy 2002 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 148 pp 1158ndash1173

the emotionrsquos theme But between families the physi-ology appraisal mechanisms antecedents and signalsdiffer in fundamental ways Based on these criteriaEkman proposes that there are 15 basic emotion fami-lies amusement anger contempt contentment dis-gust embarrassment excitement fear guilt pride inachievement relief sadnessdistress satisfaction sen-sory pleasure and shame The two crucial componentsof the theory which distinguish it from other theoristsrsquoapproaches are that emotions are fundamentally sepa-rate from one another and that they evolved to helporganisms deal with fundamental life tasks

On this view since emotions are distinct and eachemotion family is accompanied by a small set of dis-tinctive signals (facial expressions) we might expectsubjectsrsquo facial expression categorization behavior toexhibit the characteristics of discrete clear-cut deci-sions not smooth graded fuzzy categorization Evi-dence that facial expressions are perceived as discreteentities then would be further evidence for the theoryof basic emotions and a deterministic expressionemo-tion mapping Indeed evidence of lsquolsquocategorical percep-tionrsquorsquo (CP) of facial expressions has recently emerged inthe literature

In some domains it appears that sensory systemsadapt to impose discontinuous category boundaries incontinuous stimulus spaces For instance in a rainbowwe perceive bands of discrete colors even though thelightrsquos wavelength varies smoothly In psychophysicalexperiments subjects have difficulty discriminatingbetween two shades of green differing by a small con-stant wavelength distance but find it easier to distin-guish between two stimuli the same distance apart butcloser to the green yellow boundary This phenomenonis called lsquolsquocategorical perceptionrsquorsquo (Harnad 1987) It alsooccurs in auditory perception of phonemes When welisten to utterances varying continuously from a basound to a pa sound we perceive a sudden shift fromba to pa not a mixture of the two As with colors wecan also discriminate pairs of equidistant phonemesbetter when they are closer to the perceived ba ndash pa boundary In general CP is assessed operationally interms of two behavioral measures categorization judg-ments and discrimination (samedifferent) judgmentsCategorization measures typically use a forced-choicetask for example selection of the ba category or thepa category The stimuli are randomly sampled fromsmoothly varying continua such as a step-by-step tran-sition between ba and pa prototypes Even thoughsubjects are not told that the data come from suchcontinua they nevertheless label all stimuli on one sideof some boundary as ba and all stimuli on the otherside as pa suggesting a sharp category boundary Forthe second behavioral measure of CP discriminationsubjects are asked to make a lsquolsquosamedifferentrsquorsquo responseto a pair of stimuli that are nearby on the continuum(simultaneous discrimination) or perform a sequential

(ABX ) discrimination task in which stimulus lsquolsquoArsquorsquo isshown stimulus lsquolsquoBrsquorsquo is shown then either lsquolsquoArsquorsquo or lsquolsquoBrsquorsquois shown and subjects are asked which of the first twostimuli the third matches For categorically perceivedstimuli subjects show better discrimination when thetwo stimuli are near the category boundary defined bytheir labeling behavior compared with two stimulifurther from the boundary

In some cases such as the color example CP isthought to be an innate property of the perceptualsystem But in other cases perceptual discontinuitiesat category boundaries are clearly acquired throughlearning For instance Beale and Keil (1995) createdmorph sequences between pairs of famous faces (egClintonndashKennedy) and unfamiliar faces then had sub-jects discriminate or categorize neighboring pairs offaces along the morph sequence They found thatfamous face pairs exhibited category effects (increaseddiscrimination near the boundaries) but unfamiliar facepairs did not Their result showed that CP can beacquired through learning and is not limited to low-levelperceptual stimuli

Etcoff and Magee (1992) were the first to raise thequestion of whether the perceptual mechanisms respon-sible for facial expression recognition are actually tunedto emotion categories or whether perception is contin-uous with category membership lsquolsquoassigned by higherconceptual and linguistic systemsrsquorsquo (p 229) The authorscreated caricatures (line drawings) of the Ekman andFriesen (1976) photos and 10-step morphs between pairsof those caricatures They included happyndashsad angryndashsad and angryndashafraid continua as easily discriminatedcategory pairs Surprisedndashafraid and angryndashdisgustedcontinua were included as less easily discriminated pairsHappyndashneutral and sadndashneutral continua were includedto test for category boundaries along the dimension ofpresence or nonpresence of an emotion and finallyhappyndashsurprised continua were added to include atransition between positive emotions The authors foundthat all expressions except surprise were perceived cat-egorically In an ABX task morph pairs straddling the50 category boundary were significantly better discri-minated than those closer to the prototypes and in anidentification task subjects placed sharp boundariesbetween categories with significantly nonlinear categorymembership functions Etcoff and Magee interpretedthese results as evidence for mandatory category assign-ment lsquolsquopeople cannot help but see the face as showingone or another kind of emotionrsquorsquo (p 229) Their resultstherefore pose a serious challenge for advocates ofa continuous space of emotions and rough emotionexpression correspondence

Etcoff and Mageersquos (1992) provocative results led tofurther research exploring CP in facial expression rec-ognition Some of the potential limitations of theirstudy were that the stimuli were line drawings notimage-quality faces that each subject was only exposed

Dailey et al 1159

to a single continuum and that the discriminationresults being from a sequential (ABX ) task mightreflect a short-term memory phenomenon rather thana perceptual phenomenon In view of these limitationsCalder Young Perrett Etcoff and Rowland (1996)extended and replicated the earlier experiments withimage-quality morphs They produced several continuausing the Ekman and Friesen (1976) photos (egEkman and Friesen prototypes and image-quality morphsequences produced in R Arsquos lab see Figure 1) Theauthors first replicated Etcoff and Mageersquos experimentswith new stimuli image-quality happyndashsad angry ndashsadand angryndashafraid sequences each using a differentactor A second experiment followed the same proce-dure except that each subjectrsquos stimuli included morphsfrom four different actors In a third experiment theyhad subjects categorize stimuli from three differentexpression continua employing a single actor (lsquolsquoJ Jrsquorsquoafraidndashhappy happyndashangry and angryndashafraid sequen-ces) Finally they had subjects perform a simultaneousdiscrimination (samedifferent) task In the secondexperiment for the happyndashsad transitions the authorsfound that artifacts in morphing between a face withteeth and one without (a lsquolsquograyingrsquorsquo of the teeth as themorph moves away from the happy prototype) helpedsubjects in the discrimination task However on thewhole the results were consistent with CP sharpcategory boundaries and enhanced discrimination nearthe boundaries regardless of whether there was a singleor several continua present in the experiment andregardless of whether the sequential or simultaneousdiscrimination task was used

In the psychological literature on categorization CPeffects are usually taken as evidence that (1) objectrepresentations are altered during the course of cat-

egory learning or (2) that subjects automatically labelstimuli even when they are making a purely perceptualdiscrimination (Goldstone 2000 Goldstone Lippa ampShiffrin 2001 Pevtzow amp Harnad 1997 Tijsseling ampHarnad 1997 ) Findings of CP effects in facial expres-sion stimuli raise the possibility that perception of facialexpressions is discrete in the same way that categorylabeling is Calder et alrsquos experiments strengthened theargument for CP of facial expressions but the authorsshy away from Etcoff and Mageersquos strong interpretationthat facial expression category assignment is man-datory They propose instead that perhaps CP islsquolsquoan emergent property of population coding in thenervous systemrsquorsquo occurring whenever lsquolsquopopulations ofcells become tuned to distinct categoriesrsquorsquo(p 116) Inthis article we will show precisely how this can occursuggesting that Calder et alrsquos hypothesis may indeedbe correct

Continuous Emotion Space and Fuzzy FacialExpression Categories

Other theorists hold that the relationship betweenemotions and facial expressions is not so categoricaldiscrete or deterministic as the theory of basic emotionsand the evidence for CP suggest The notion that facialexpression perception is discrete is challenged by datashowing that similarity judgments of these expressionsexhibit a graded continuous structure

Schlosberg (1952) following up on the work of hisadvisor (Woodworth 1938) found that emotion cate-gory ratings and subjectsrsquo lsquolsquoerrorsrsquorsquo (eg the likelihood oftheir labeling a putatively disgusted expression as con-tempt) could be predicted fairly accurately by arrangingthe emotion categories around an ellipse whose major

Figure 1 (a) Example proto-typical expressions of six basicemotions and a neutral face foractor lsquolsquoJ Jrsquorsquo in Ekman andFriesenrsquos POFA (Ekman ampFriesen 1976) (b) Morphsfrom fear to sadness andhappiness to disgust generatedfrom the correspondingprototypes

1160 Journal of Cognitive Neuroscience Volume 14 Number 8

axis was pleasantness versus unpleasantness (exemplifiedby happy and sad expressions) and whose minor axiswas attention versus rejection (exemplified by surpriseand disgust expressions)

More recently Russell (1980) proposed a structuraltheory of emotion concepts with two dimensionspleasure and arousal Russell and Bullock (1986) thenproposed that emotion categories are best thought ofas fuzzy sets A few facial expressions might have amembership of 10 (100) in one particular categoryand others might have intermediate degrees of mem-bership in more than one category On their view thefacial expression confusion data supporting structuraltheories like Schlosbergrsquos (1952) reflected the overlapof these fuzzy categories To test this concept Russelland Bullock had subjects rate a variety of facial expres-sions for how well they exemplify categories likelsquolsquoexcitedrsquorsquo lsquolsquohappyrsquorsquo lsquolsquosleepyrsquorsquo lsquolsquomadrsquorsquo and so forth Theyfound that the categories did indeed overlap with peaklevels of membership for Ekmanrsquos basic emotion cate-gories occurring at Ekmanrsquos prototypical expressions Asimilarity structure analysis (multidimensional scaling[MDS]mdashsee Figure 9 for an example) performed onthe subjectsrsquo ratings produced two dimensions highlycorrelated with other subjectsrsquo pleasure and arousalratings When asked to verify (yes or no) membershipof facial expression stimuli in various emotion conceptcategories there was a high level of consensus forprototypical expressions and less consensus for boun-dary cases Asking subjects to choose exemplars for aset of categories also revealed graded membershipfunctions for the categories The data thus showed thatfacial expression categories have systematically gradedoverlapping membership in various emotion categoriesOn the basis of this evidence Russell and Bullockproposed that facial expression interpretation firstinvolves appraising the expression in terms of pleasureand arousal Then the interpreter may optionallychoose a label for the expression Following Schlosbergthey proposed that finer more confident judgmentsrequire contextual information

This and additional recent research (Schiano EhrlichSheridan amp Beck 2000 Katsikitis 1997 Carroll ampRussell 1996 Russell 1980 Russell Lewicka amp Niit1989) suggests that there is a continuous multidimen-sional perceptual space underlying facial expressionperception in which some expression categories aremore similar to each other than others

Young et alrsquos (1997) lsquolsquoMegamixrsquorsquo Experiments

Young et al (1997) set out to further test the predic-tions of multidimensional and categorical accounts offacial expression perception and recognition Theypointed out that 2-D structural theories like Schlos-bergrsquos or Russellrsquos predict that some morph transitionsbetween expression pairs should pass near other

categories For instance if the 2-D representation inFigure 9a adequately characterizes our perception ofemotional facial expression stimuli the midpoint of amorph between happiness and fear should be perceivedas a surprised expression Categorical views of emo-tional facial expressions on the other hand do notnecessarily predict confusion along morph transitionsone would expect either sharp transitions between allpairs of categories or perhaps indeterminate regionsbetween categories where no emotion is perceivedFurthermore if the categorical view is correct we mightexpect CP in which subjects find it difficult to discrim-inate between members of the category and easy todiscriminate pairs of stimuli near category boundariesas found in previous studies (Calder et al 1996 Etcoffamp Magee 1992) To test these contrasting predictionswith an exhaustive set of expression transitions Younget al constructed image-quality morph sequencesbetween all pairs of the emotional expression proto-types shown in Figure 1a (Figure 1b shows examplemorph sequences produced in R Arsquos lab)

In Experiment 1 the authors had subjects identify theemotion category in 10 30 50 70 and 90morphs between all pairs of the six prototypical expres-sions in Figure 1a The stimuli were presented inrandom order and subjects were asked to perform asix-way forced-choice identification Experiment 2 wasidentical except that morphs between the emotionalexpressions and the neutral prototype were added tothe pool and lsquolsquoneutralrsquorsquo was one of the subjectsrsquo choicesin a seven-way forced-choice procedure The results ofExperiment 2 for 6 of the 21 possible transitions arereprinted in Figure 2 In both experiments along everymorph transition subjectsrsquo modal response to the stim-uli abruptly shifted from one emotion to the other withno indeterminate regions or between Consistent withcategorical theories of facial expressions the subjectsrsquomodal response was always one of the endpoints of themorph sequence never a nearby category Subjectsrsquoresponse times (RTs) however were more consistentwith a multidimensional or weak category account offacial expression perception As distance from the pro-totypes increased RTs increased significantly presum-ably reflecting increased uncertainty about categorymembership near category boundaries

In Experiment 3 Young et al explored the extent towhich subjects could discriminate pairs of stimuli alongthe six transitions happinessndashsurprise surprisendashfearfear ndashsadness sadnessndashdisgust disgustndashanger angerndashhappiness They had subjects do both a sequentialdiscrimination task (ABX) and a simultaneous discrim-ination task (samendash different) Only the data fromthe sequential experiments are now available but theauthors report a very strong correlation between the twotypes of data (r = 82) The sequential data are reprintedin Figure 3 The main finding of the experiment wasthat subjects had significantly better discrimination

Dailey et al 1161

performance near the category boundaries than near theexpression prototypes a necessary condition to claimCP Experiment 3 therefore best supports the categoricalview of facial expression perception

Finally in Experiment 4 Young et al set out todetermine whether subjects could determine whatexpression is lsquolsquomixed-inrsquorsquo to a faint morph Again astrong categorical view of facial expressions would pre-dict that subjects should not be able to discern whatexpression a given morph sequence is moving towarduntil the sequence gets near the category boundaryHowever according to continuous theories subjectsshould be able to determine what prototype a sequenceis moving toward fairly early in the sequence In theexperiment subjects were asked to decide given amorph or prototype stimulus the most apparent emo-tion the second-most apparent emotion and the third-

most apparent emotion using a button box with abutton for each of the six emotion categories Aftercorrecting for intrinsic confusability of the emotioncategories the authors found that subjects were sig-nificantly above chance at detecting the mixed-in emo-tion at the 30 level As opposed to the identificationand discrimination data from Experiments 1ndash3 thisresult best supports continuous dimensional accountsof facial expression perception

In summary Young et alrsquos experiments rather thansettling the issue of categorical versus continuous the-ories of facial expression perception found evidencesupporting both types of theory The authors arguethat a 2-D account of facial expression perception isunable to account for all of the data but they alsoargue that the strong mandatory categorization view islikewise deficient

Figure 2 Selected results ofYoung et alrsquos (1997 ) Experi-ment 2 (a) Human responses( identification in a seven-wayforced-choice task) to sixmorph sequences happy ndashsurprised surprised ndashafraidafraidndash sad sadndashdisgusteddisgusted ndashangry and angry ndashhappy Every transition exhib-ited a sharp category boundary(b) Modeling subjectsrsquo choice asa random variable distributedaccording to the pattern ofactivation at the networkrsquos out-put layer which entails aver-aging across the 13 networksrsquooutputs The model has a cor-relation (over all 15 transitions)of r = 942 with the humansubjects (c) RTs for the sametransitions in (a) RTs increasedsignificantly with distance fromthe prototype (M = sad only 6out of 21 possible transitionsare shown) (d) Predictions ofnetwork lsquolsquouncertaintyrsquo rsquo modelfor the same morph transitions

1162 Journal of Cognitive Neuroscience Volume 14 Number 8

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 2: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

the emotionrsquos theme But between families the physi-ology appraisal mechanisms antecedents and signalsdiffer in fundamental ways Based on these criteriaEkman proposes that there are 15 basic emotion fami-lies amusement anger contempt contentment dis-gust embarrassment excitement fear guilt pride inachievement relief sadnessdistress satisfaction sen-sory pleasure and shame The two crucial componentsof the theory which distinguish it from other theoristsrsquoapproaches are that emotions are fundamentally sepa-rate from one another and that they evolved to helporganisms deal with fundamental life tasks

On this view since emotions are distinct and eachemotion family is accompanied by a small set of dis-tinctive signals (facial expressions) we might expectsubjectsrsquo facial expression categorization behavior toexhibit the characteristics of discrete clear-cut deci-sions not smooth graded fuzzy categorization Evi-dence that facial expressions are perceived as discreteentities then would be further evidence for the theoryof basic emotions and a deterministic expressionemo-tion mapping Indeed evidence of lsquolsquocategorical percep-tionrsquorsquo (CP) of facial expressions has recently emerged inthe literature

In some domains it appears that sensory systemsadapt to impose discontinuous category boundaries incontinuous stimulus spaces For instance in a rainbowwe perceive bands of discrete colors even though thelightrsquos wavelength varies smoothly In psychophysicalexperiments subjects have difficulty discriminatingbetween two shades of green differing by a small con-stant wavelength distance but find it easier to distin-guish between two stimuli the same distance apart butcloser to the green yellow boundary This phenomenonis called lsquolsquocategorical perceptionrsquorsquo (Harnad 1987) It alsooccurs in auditory perception of phonemes When welisten to utterances varying continuously from a basound to a pa sound we perceive a sudden shift fromba to pa not a mixture of the two As with colors wecan also discriminate pairs of equidistant phonemesbetter when they are closer to the perceived ba ndash pa boundary In general CP is assessed operationally interms of two behavioral measures categorization judg-ments and discrimination (samedifferent) judgmentsCategorization measures typically use a forced-choicetask for example selection of the ba category or thepa category The stimuli are randomly sampled fromsmoothly varying continua such as a step-by-step tran-sition between ba and pa prototypes Even thoughsubjects are not told that the data come from suchcontinua they nevertheless label all stimuli on one sideof some boundary as ba and all stimuli on the otherside as pa suggesting a sharp category boundary Forthe second behavioral measure of CP discriminationsubjects are asked to make a lsquolsquosamedifferentrsquorsquo responseto a pair of stimuli that are nearby on the continuum(simultaneous discrimination) or perform a sequential

(ABX ) discrimination task in which stimulus lsquolsquoArsquorsquo isshown stimulus lsquolsquoBrsquorsquo is shown then either lsquolsquoArsquorsquo or lsquolsquoBrsquorsquois shown and subjects are asked which of the first twostimuli the third matches For categorically perceivedstimuli subjects show better discrimination when thetwo stimuli are near the category boundary defined bytheir labeling behavior compared with two stimulifurther from the boundary

In some cases such as the color example CP isthought to be an innate property of the perceptualsystem But in other cases perceptual discontinuitiesat category boundaries are clearly acquired throughlearning For instance Beale and Keil (1995) createdmorph sequences between pairs of famous faces (egClintonndashKennedy) and unfamiliar faces then had sub-jects discriminate or categorize neighboring pairs offaces along the morph sequence They found thatfamous face pairs exhibited category effects (increaseddiscrimination near the boundaries) but unfamiliar facepairs did not Their result showed that CP can beacquired through learning and is not limited to low-levelperceptual stimuli

Etcoff and Magee (1992) were the first to raise thequestion of whether the perceptual mechanisms respon-sible for facial expression recognition are actually tunedto emotion categories or whether perception is contin-uous with category membership lsquolsquoassigned by higherconceptual and linguistic systemsrsquorsquo (p 229) The authorscreated caricatures (line drawings) of the Ekman andFriesen (1976) photos and 10-step morphs between pairsof those caricatures They included happyndashsad angryndashsad and angryndashafraid continua as easily discriminatedcategory pairs Surprisedndashafraid and angryndashdisgustedcontinua were included as less easily discriminated pairsHappyndashneutral and sadndashneutral continua were includedto test for category boundaries along the dimension ofpresence or nonpresence of an emotion and finallyhappyndashsurprised continua were added to include atransition between positive emotions The authors foundthat all expressions except surprise were perceived cat-egorically In an ABX task morph pairs straddling the50 category boundary were significantly better discri-minated than those closer to the prototypes and in anidentification task subjects placed sharp boundariesbetween categories with significantly nonlinear categorymembership functions Etcoff and Magee interpretedthese results as evidence for mandatory category assign-ment lsquolsquopeople cannot help but see the face as showingone or another kind of emotionrsquorsquo (p 229) Their resultstherefore pose a serious challenge for advocates ofa continuous space of emotions and rough emotionexpression correspondence

Etcoff and Mageersquos (1992) provocative results led tofurther research exploring CP in facial expression rec-ognition Some of the potential limitations of theirstudy were that the stimuli were line drawings notimage-quality faces that each subject was only exposed

Dailey et al 1159

to a single continuum and that the discriminationresults being from a sequential (ABX ) task mightreflect a short-term memory phenomenon rather thana perceptual phenomenon In view of these limitationsCalder Young Perrett Etcoff and Rowland (1996)extended and replicated the earlier experiments withimage-quality morphs They produced several continuausing the Ekman and Friesen (1976) photos (egEkman and Friesen prototypes and image-quality morphsequences produced in R Arsquos lab see Figure 1) Theauthors first replicated Etcoff and Mageersquos experimentswith new stimuli image-quality happyndashsad angry ndashsadand angryndashafraid sequences each using a differentactor A second experiment followed the same proce-dure except that each subjectrsquos stimuli included morphsfrom four different actors In a third experiment theyhad subjects categorize stimuli from three differentexpression continua employing a single actor (lsquolsquoJ Jrsquorsquoafraidndashhappy happyndashangry and angryndashafraid sequen-ces) Finally they had subjects perform a simultaneousdiscrimination (samedifferent) task In the secondexperiment for the happyndashsad transitions the authorsfound that artifacts in morphing between a face withteeth and one without (a lsquolsquograyingrsquorsquo of the teeth as themorph moves away from the happy prototype) helpedsubjects in the discrimination task However on thewhole the results were consistent with CP sharpcategory boundaries and enhanced discrimination nearthe boundaries regardless of whether there was a singleor several continua present in the experiment andregardless of whether the sequential or simultaneousdiscrimination task was used

In the psychological literature on categorization CPeffects are usually taken as evidence that (1) objectrepresentations are altered during the course of cat-

egory learning or (2) that subjects automatically labelstimuli even when they are making a purely perceptualdiscrimination (Goldstone 2000 Goldstone Lippa ampShiffrin 2001 Pevtzow amp Harnad 1997 Tijsseling ampHarnad 1997 ) Findings of CP effects in facial expres-sion stimuli raise the possibility that perception of facialexpressions is discrete in the same way that categorylabeling is Calder et alrsquos experiments strengthened theargument for CP of facial expressions but the authorsshy away from Etcoff and Mageersquos strong interpretationthat facial expression category assignment is man-datory They propose instead that perhaps CP islsquolsquoan emergent property of population coding in thenervous systemrsquorsquo occurring whenever lsquolsquopopulations ofcells become tuned to distinct categoriesrsquorsquo(p 116) Inthis article we will show precisely how this can occursuggesting that Calder et alrsquos hypothesis may indeedbe correct

Continuous Emotion Space and Fuzzy FacialExpression Categories

Other theorists hold that the relationship betweenemotions and facial expressions is not so categoricaldiscrete or deterministic as the theory of basic emotionsand the evidence for CP suggest The notion that facialexpression perception is discrete is challenged by datashowing that similarity judgments of these expressionsexhibit a graded continuous structure

Schlosberg (1952) following up on the work of hisadvisor (Woodworth 1938) found that emotion cate-gory ratings and subjectsrsquo lsquolsquoerrorsrsquorsquo (eg the likelihood oftheir labeling a putatively disgusted expression as con-tempt) could be predicted fairly accurately by arrangingthe emotion categories around an ellipse whose major

Figure 1 (a) Example proto-typical expressions of six basicemotions and a neutral face foractor lsquolsquoJ Jrsquorsquo in Ekman andFriesenrsquos POFA (Ekman ampFriesen 1976) (b) Morphsfrom fear to sadness andhappiness to disgust generatedfrom the correspondingprototypes

1160 Journal of Cognitive Neuroscience Volume 14 Number 8

axis was pleasantness versus unpleasantness (exemplifiedby happy and sad expressions) and whose minor axiswas attention versus rejection (exemplified by surpriseand disgust expressions)

More recently Russell (1980) proposed a structuraltheory of emotion concepts with two dimensionspleasure and arousal Russell and Bullock (1986) thenproposed that emotion categories are best thought ofas fuzzy sets A few facial expressions might have amembership of 10 (100) in one particular categoryand others might have intermediate degrees of mem-bership in more than one category On their view thefacial expression confusion data supporting structuraltheories like Schlosbergrsquos (1952) reflected the overlapof these fuzzy categories To test this concept Russelland Bullock had subjects rate a variety of facial expres-sions for how well they exemplify categories likelsquolsquoexcitedrsquorsquo lsquolsquohappyrsquorsquo lsquolsquosleepyrsquorsquo lsquolsquomadrsquorsquo and so forth Theyfound that the categories did indeed overlap with peaklevels of membership for Ekmanrsquos basic emotion cate-gories occurring at Ekmanrsquos prototypical expressions Asimilarity structure analysis (multidimensional scaling[MDS]mdashsee Figure 9 for an example) performed onthe subjectsrsquo ratings produced two dimensions highlycorrelated with other subjectsrsquo pleasure and arousalratings When asked to verify (yes or no) membershipof facial expression stimuli in various emotion conceptcategories there was a high level of consensus forprototypical expressions and less consensus for boun-dary cases Asking subjects to choose exemplars for aset of categories also revealed graded membershipfunctions for the categories The data thus showed thatfacial expression categories have systematically gradedoverlapping membership in various emotion categoriesOn the basis of this evidence Russell and Bullockproposed that facial expression interpretation firstinvolves appraising the expression in terms of pleasureand arousal Then the interpreter may optionallychoose a label for the expression Following Schlosbergthey proposed that finer more confident judgmentsrequire contextual information

This and additional recent research (Schiano EhrlichSheridan amp Beck 2000 Katsikitis 1997 Carroll ampRussell 1996 Russell 1980 Russell Lewicka amp Niit1989) suggests that there is a continuous multidimen-sional perceptual space underlying facial expressionperception in which some expression categories aremore similar to each other than others

Young et alrsquos (1997) lsquolsquoMegamixrsquorsquo Experiments

Young et al (1997) set out to further test the predic-tions of multidimensional and categorical accounts offacial expression perception and recognition Theypointed out that 2-D structural theories like Schlos-bergrsquos or Russellrsquos predict that some morph transitionsbetween expression pairs should pass near other

categories For instance if the 2-D representation inFigure 9a adequately characterizes our perception ofemotional facial expression stimuli the midpoint of amorph between happiness and fear should be perceivedas a surprised expression Categorical views of emo-tional facial expressions on the other hand do notnecessarily predict confusion along morph transitionsone would expect either sharp transitions between allpairs of categories or perhaps indeterminate regionsbetween categories where no emotion is perceivedFurthermore if the categorical view is correct we mightexpect CP in which subjects find it difficult to discrim-inate between members of the category and easy todiscriminate pairs of stimuli near category boundariesas found in previous studies (Calder et al 1996 Etcoffamp Magee 1992) To test these contrasting predictionswith an exhaustive set of expression transitions Younget al constructed image-quality morph sequencesbetween all pairs of the emotional expression proto-types shown in Figure 1a (Figure 1b shows examplemorph sequences produced in R Arsquos lab)

In Experiment 1 the authors had subjects identify theemotion category in 10 30 50 70 and 90morphs between all pairs of the six prototypical expres-sions in Figure 1a The stimuli were presented inrandom order and subjects were asked to perform asix-way forced-choice identification Experiment 2 wasidentical except that morphs between the emotionalexpressions and the neutral prototype were added tothe pool and lsquolsquoneutralrsquorsquo was one of the subjectsrsquo choicesin a seven-way forced-choice procedure The results ofExperiment 2 for 6 of the 21 possible transitions arereprinted in Figure 2 In both experiments along everymorph transition subjectsrsquo modal response to the stim-uli abruptly shifted from one emotion to the other withno indeterminate regions or between Consistent withcategorical theories of facial expressions the subjectsrsquomodal response was always one of the endpoints of themorph sequence never a nearby category Subjectsrsquoresponse times (RTs) however were more consistentwith a multidimensional or weak category account offacial expression perception As distance from the pro-totypes increased RTs increased significantly presum-ably reflecting increased uncertainty about categorymembership near category boundaries

In Experiment 3 Young et al explored the extent towhich subjects could discriminate pairs of stimuli alongthe six transitions happinessndashsurprise surprisendashfearfear ndashsadness sadnessndashdisgust disgustndashanger angerndashhappiness They had subjects do both a sequentialdiscrimination task (ABX) and a simultaneous discrim-ination task (samendash different) Only the data fromthe sequential experiments are now available but theauthors report a very strong correlation between the twotypes of data (r = 82) The sequential data are reprintedin Figure 3 The main finding of the experiment wasthat subjects had significantly better discrimination

Dailey et al 1161

performance near the category boundaries than near theexpression prototypes a necessary condition to claimCP Experiment 3 therefore best supports the categoricalview of facial expression perception

Finally in Experiment 4 Young et al set out todetermine whether subjects could determine whatexpression is lsquolsquomixed-inrsquorsquo to a faint morph Again astrong categorical view of facial expressions would pre-dict that subjects should not be able to discern whatexpression a given morph sequence is moving towarduntil the sequence gets near the category boundaryHowever according to continuous theories subjectsshould be able to determine what prototype a sequenceis moving toward fairly early in the sequence In theexperiment subjects were asked to decide given amorph or prototype stimulus the most apparent emo-tion the second-most apparent emotion and the third-

most apparent emotion using a button box with abutton for each of the six emotion categories Aftercorrecting for intrinsic confusability of the emotioncategories the authors found that subjects were sig-nificantly above chance at detecting the mixed-in emo-tion at the 30 level As opposed to the identificationand discrimination data from Experiments 1ndash3 thisresult best supports continuous dimensional accountsof facial expression perception

In summary Young et alrsquos experiments rather thansettling the issue of categorical versus continuous the-ories of facial expression perception found evidencesupporting both types of theory The authors arguethat a 2-D account of facial expression perception isunable to account for all of the data but they alsoargue that the strong mandatory categorization view islikewise deficient

Figure 2 Selected results ofYoung et alrsquos (1997 ) Experi-ment 2 (a) Human responses( identification in a seven-wayforced-choice task) to sixmorph sequences happy ndashsurprised surprised ndashafraidafraidndash sad sadndashdisgusteddisgusted ndashangry and angry ndashhappy Every transition exhib-ited a sharp category boundary(b) Modeling subjectsrsquo choice asa random variable distributedaccording to the pattern ofactivation at the networkrsquos out-put layer which entails aver-aging across the 13 networksrsquooutputs The model has a cor-relation (over all 15 transitions)of r = 942 with the humansubjects (c) RTs for the sametransitions in (a) RTs increasedsignificantly with distance fromthe prototype (M = sad only 6out of 21 possible transitionsare shown) (d) Predictions ofnetwork lsquolsquouncertaintyrsquo rsquo modelfor the same morph transitions

1162 Journal of Cognitive Neuroscience Volume 14 Number 8

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 3: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

to a single continuum and that the discriminationresults being from a sequential (ABX ) task mightreflect a short-term memory phenomenon rather thana perceptual phenomenon In view of these limitationsCalder Young Perrett Etcoff and Rowland (1996)extended and replicated the earlier experiments withimage-quality morphs They produced several continuausing the Ekman and Friesen (1976) photos (egEkman and Friesen prototypes and image-quality morphsequences produced in R Arsquos lab see Figure 1) Theauthors first replicated Etcoff and Mageersquos experimentswith new stimuli image-quality happyndashsad angry ndashsadand angryndashafraid sequences each using a differentactor A second experiment followed the same proce-dure except that each subjectrsquos stimuli included morphsfrom four different actors In a third experiment theyhad subjects categorize stimuli from three differentexpression continua employing a single actor (lsquolsquoJ Jrsquorsquoafraidndashhappy happyndashangry and angryndashafraid sequen-ces) Finally they had subjects perform a simultaneousdiscrimination (samedifferent) task In the secondexperiment for the happyndashsad transitions the authorsfound that artifacts in morphing between a face withteeth and one without (a lsquolsquograyingrsquorsquo of the teeth as themorph moves away from the happy prototype) helpedsubjects in the discrimination task However on thewhole the results were consistent with CP sharpcategory boundaries and enhanced discrimination nearthe boundaries regardless of whether there was a singleor several continua present in the experiment andregardless of whether the sequential or simultaneousdiscrimination task was used

In the psychological literature on categorization CPeffects are usually taken as evidence that (1) objectrepresentations are altered during the course of cat-

egory learning or (2) that subjects automatically labelstimuli even when they are making a purely perceptualdiscrimination (Goldstone 2000 Goldstone Lippa ampShiffrin 2001 Pevtzow amp Harnad 1997 Tijsseling ampHarnad 1997 ) Findings of CP effects in facial expres-sion stimuli raise the possibility that perception of facialexpressions is discrete in the same way that categorylabeling is Calder et alrsquos experiments strengthened theargument for CP of facial expressions but the authorsshy away from Etcoff and Mageersquos strong interpretationthat facial expression category assignment is man-datory They propose instead that perhaps CP islsquolsquoan emergent property of population coding in thenervous systemrsquorsquo occurring whenever lsquolsquopopulations ofcells become tuned to distinct categoriesrsquorsquo(p 116) Inthis article we will show precisely how this can occursuggesting that Calder et alrsquos hypothesis may indeedbe correct

Continuous Emotion Space and Fuzzy FacialExpression Categories

Other theorists hold that the relationship betweenemotions and facial expressions is not so categoricaldiscrete or deterministic as the theory of basic emotionsand the evidence for CP suggest The notion that facialexpression perception is discrete is challenged by datashowing that similarity judgments of these expressionsexhibit a graded continuous structure

Schlosberg (1952) following up on the work of hisadvisor (Woodworth 1938) found that emotion cate-gory ratings and subjectsrsquo lsquolsquoerrorsrsquorsquo (eg the likelihood oftheir labeling a putatively disgusted expression as con-tempt) could be predicted fairly accurately by arrangingthe emotion categories around an ellipse whose major

Figure 1 (a) Example proto-typical expressions of six basicemotions and a neutral face foractor lsquolsquoJ Jrsquorsquo in Ekman andFriesenrsquos POFA (Ekman ampFriesen 1976) (b) Morphsfrom fear to sadness andhappiness to disgust generatedfrom the correspondingprototypes

1160 Journal of Cognitive Neuroscience Volume 14 Number 8

axis was pleasantness versus unpleasantness (exemplifiedby happy and sad expressions) and whose minor axiswas attention versus rejection (exemplified by surpriseand disgust expressions)

More recently Russell (1980) proposed a structuraltheory of emotion concepts with two dimensionspleasure and arousal Russell and Bullock (1986) thenproposed that emotion categories are best thought ofas fuzzy sets A few facial expressions might have amembership of 10 (100) in one particular categoryand others might have intermediate degrees of mem-bership in more than one category On their view thefacial expression confusion data supporting structuraltheories like Schlosbergrsquos (1952) reflected the overlapof these fuzzy categories To test this concept Russelland Bullock had subjects rate a variety of facial expres-sions for how well they exemplify categories likelsquolsquoexcitedrsquorsquo lsquolsquohappyrsquorsquo lsquolsquosleepyrsquorsquo lsquolsquomadrsquorsquo and so forth Theyfound that the categories did indeed overlap with peaklevels of membership for Ekmanrsquos basic emotion cate-gories occurring at Ekmanrsquos prototypical expressions Asimilarity structure analysis (multidimensional scaling[MDS]mdashsee Figure 9 for an example) performed onthe subjectsrsquo ratings produced two dimensions highlycorrelated with other subjectsrsquo pleasure and arousalratings When asked to verify (yes or no) membershipof facial expression stimuli in various emotion conceptcategories there was a high level of consensus forprototypical expressions and less consensus for boun-dary cases Asking subjects to choose exemplars for aset of categories also revealed graded membershipfunctions for the categories The data thus showed thatfacial expression categories have systematically gradedoverlapping membership in various emotion categoriesOn the basis of this evidence Russell and Bullockproposed that facial expression interpretation firstinvolves appraising the expression in terms of pleasureand arousal Then the interpreter may optionallychoose a label for the expression Following Schlosbergthey proposed that finer more confident judgmentsrequire contextual information

This and additional recent research (Schiano EhrlichSheridan amp Beck 2000 Katsikitis 1997 Carroll ampRussell 1996 Russell 1980 Russell Lewicka amp Niit1989) suggests that there is a continuous multidimen-sional perceptual space underlying facial expressionperception in which some expression categories aremore similar to each other than others

Young et alrsquos (1997) lsquolsquoMegamixrsquorsquo Experiments

Young et al (1997) set out to further test the predic-tions of multidimensional and categorical accounts offacial expression perception and recognition Theypointed out that 2-D structural theories like Schlos-bergrsquos or Russellrsquos predict that some morph transitionsbetween expression pairs should pass near other

categories For instance if the 2-D representation inFigure 9a adequately characterizes our perception ofemotional facial expression stimuli the midpoint of amorph between happiness and fear should be perceivedas a surprised expression Categorical views of emo-tional facial expressions on the other hand do notnecessarily predict confusion along morph transitionsone would expect either sharp transitions between allpairs of categories or perhaps indeterminate regionsbetween categories where no emotion is perceivedFurthermore if the categorical view is correct we mightexpect CP in which subjects find it difficult to discrim-inate between members of the category and easy todiscriminate pairs of stimuli near category boundariesas found in previous studies (Calder et al 1996 Etcoffamp Magee 1992) To test these contrasting predictionswith an exhaustive set of expression transitions Younget al constructed image-quality morph sequencesbetween all pairs of the emotional expression proto-types shown in Figure 1a (Figure 1b shows examplemorph sequences produced in R Arsquos lab)

In Experiment 1 the authors had subjects identify theemotion category in 10 30 50 70 and 90morphs between all pairs of the six prototypical expres-sions in Figure 1a The stimuli were presented inrandom order and subjects were asked to perform asix-way forced-choice identification Experiment 2 wasidentical except that morphs between the emotionalexpressions and the neutral prototype were added tothe pool and lsquolsquoneutralrsquorsquo was one of the subjectsrsquo choicesin a seven-way forced-choice procedure The results ofExperiment 2 for 6 of the 21 possible transitions arereprinted in Figure 2 In both experiments along everymorph transition subjectsrsquo modal response to the stim-uli abruptly shifted from one emotion to the other withno indeterminate regions or between Consistent withcategorical theories of facial expressions the subjectsrsquomodal response was always one of the endpoints of themorph sequence never a nearby category Subjectsrsquoresponse times (RTs) however were more consistentwith a multidimensional or weak category account offacial expression perception As distance from the pro-totypes increased RTs increased significantly presum-ably reflecting increased uncertainty about categorymembership near category boundaries

In Experiment 3 Young et al explored the extent towhich subjects could discriminate pairs of stimuli alongthe six transitions happinessndashsurprise surprisendashfearfear ndashsadness sadnessndashdisgust disgustndashanger angerndashhappiness They had subjects do both a sequentialdiscrimination task (ABX) and a simultaneous discrim-ination task (samendash different) Only the data fromthe sequential experiments are now available but theauthors report a very strong correlation between the twotypes of data (r = 82) The sequential data are reprintedin Figure 3 The main finding of the experiment wasthat subjects had significantly better discrimination

Dailey et al 1161

performance near the category boundaries than near theexpression prototypes a necessary condition to claimCP Experiment 3 therefore best supports the categoricalview of facial expression perception

Finally in Experiment 4 Young et al set out todetermine whether subjects could determine whatexpression is lsquolsquomixed-inrsquorsquo to a faint morph Again astrong categorical view of facial expressions would pre-dict that subjects should not be able to discern whatexpression a given morph sequence is moving towarduntil the sequence gets near the category boundaryHowever according to continuous theories subjectsshould be able to determine what prototype a sequenceis moving toward fairly early in the sequence In theexperiment subjects were asked to decide given amorph or prototype stimulus the most apparent emo-tion the second-most apparent emotion and the third-

most apparent emotion using a button box with abutton for each of the six emotion categories Aftercorrecting for intrinsic confusability of the emotioncategories the authors found that subjects were sig-nificantly above chance at detecting the mixed-in emo-tion at the 30 level As opposed to the identificationand discrimination data from Experiments 1ndash3 thisresult best supports continuous dimensional accountsof facial expression perception

In summary Young et alrsquos experiments rather thansettling the issue of categorical versus continuous the-ories of facial expression perception found evidencesupporting both types of theory The authors arguethat a 2-D account of facial expression perception isunable to account for all of the data but they alsoargue that the strong mandatory categorization view islikewise deficient

Figure 2 Selected results ofYoung et alrsquos (1997 ) Experi-ment 2 (a) Human responses( identification in a seven-wayforced-choice task) to sixmorph sequences happy ndashsurprised surprised ndashafraidafraidndash sad sadndashdisgusteddisgusted ndashangry and angry ndashhappy Every transition exhib-ited a sharp category boundary(b) Modeling subjectsrsquo choice asa random variable distributedaccording to the pattern ofactivation at the networkrsquos out-put layer which entails aver-aging across the 13 networksrsquooutputs The model has a cor-relation (over all 15 transitions)of r = 942 with the humansubjects (c) RTs for the sametransitions in (a) RTs increasedsignificantly with distance fromthe prototype (M = sad only 6out of 21 possible transitionsare shown) (d) Predictions ofnetwork lsquolsquouncertaintyrsquo rsquo modelfor the same morph transitions

1162 Journal of Cognitive Neuroscience Volume 14 Number 8

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 4: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

axis was pleasantness versus unpleasantness (exemplifiedby happy and sad expressions) and whose minor axiswas attention versus rejection (exemplified by surpriseand disgust expressions)

More recently Russell (1980) proposed a structuraltheory of emotion concepts with two dimensionspleasure and arousal Russell and Bullock (1986) thenproposed that emotion categories are best thought ofas fuzzy sets A few facial expressions might have amembership of 10 (100) in one particular categoryand others might have intermediate degrees of mem-bership in more than one category On their view thefacial expression confusion data supporting structuraltheories like Schlosbergrsquos (1952) reflected the overlapof these fuzzy categories To test this concept Russelland Bullock had subjects rate a variety of facial expres-sions for how well they exemplify categories likelsquolsquoexcitedrsquorsquo lsquolsquohappyrsquorsquo lsquolsquosleepyrsquorsquo lsquolsquomadrsquorsquo and so forth Theyfound that the categories did indeed overlap with peaklevels of membership for Ekmanrsquos basic emotion cate-gories occurring at Ekmanrsquos prototypical expressions Asimilarity structure analysis (multidimensional scaling[MDS]mdashsee Figure 9 for an example) performed onthe subjectsrsquo ratings produced two dimensions highlycorrelated with other subjectsrsquo pleasure and arousalratings When asked to verify (yes or no) membershipof facial expression stimuli in various emotion conceptcategories there was a high level of consensus forprototypical expressions and less consensus for boun-dary cases Asking subjects to choose exemplars for aset of categories also revealed graded membershipfunctions for the categories The data thus showed thatfacial expression categories have systematically gradedoverlapping membership in various emotion categoriesOn the basis of this evidence Russell and Bullockproposed that facial expression interpretation firstinvolves appraising the expression in terms of pleasureand arousal Then the interpreter may optionallychoose a label for the expression Following Schlosbergthey proposed that finer more confident judgmentsrequire contextual information

This and additional recent research (Schiano EhrlichSheridan amp Beck 2000 Katsikitis 1997 Carroll ampRussell 1996 Russell 1980 Russell Lewicka amp Niit1989) suggests that there is a continuous multidimen-sional perceptual space underlying facial expressionperception in which some expression categories aremore similar to each other than others

Young et alrsquos (1997) lsquolsquoMegamixrsquorsquo Experiments

Young et al (1997) set out to further test the predic-tions of multidimensional and categorical accounts offacial expression perception and recognition Theypointed out that 2-D structural theories like Schlos-bergrsquos or Russellrsquos predict that some morph transitionsbetween expression pairs should pass near other

categories For instance if the 2-D representation inFigure 9a adequately characterizes our perception ofemotional facial expression stimuli the midpoint of amorph between happiness and fear should be perceivedas a surprised expression Categorical views of emo-tional facial expressions on the other hand do notnecessarily predict confusion along morph transitionsone would expect either sharp transitions between allpairs of categories or perhaps indeterminate regionsbetween categories where no emotion is perceivedFurthermore if the categorical view is correct we mightexpect CP in which subjects find it difficult to discrim-inate between members of the category and easy todiscriminate pairs of stimuli near category boundariesas found in previous studies (Calder et al 1996 Etcoffamp Magee 1992) To test these contrasting predictionswith an exhaustive set of expression transitions Younget al constructed image-quality morph sequencesbetween all pairs of the emotional expression proto-types shown in Figure 1a (Figure 1b shows examplemorph sequences produced in R Arsquos lab)

In Experiment 1 the authors had subjects identify theemotion category in 10 30 50 70 and 90morphs between all pairs of the six prototypical expres-sions in Figure 1a The stimuli were presented inrandom order and subjects were asked to perform asix-way forced-choice identification Experiment 2 wasidentical except that morphs between the emotionalexpressions and the neutral prototype were added tothe pool and lsquolsquoneutralrsquorsquo was one of the subjectsrsquo choicesin a seven-way forced-choice procedure The results ofExperiment 2 for 6 of the 21 possible transitions arereprinted in Figure 2 In both experiments along everymorph transition subjectsrsquo modal response to the stim-uli abruptly shifted from one emotion to the other withno indeterminate regions or between Consistent withcategorical theories of facial expressions the subjectsrsquomodal response was always one of the endpoints of themorph sequence never a nearby category Subjectsrsquoresponse times (RTs) however were more consistentwith a multidimensional or weak category account offacial expression perception As distance from the pro-totypes increased RTs increased significantly presum-ably reflecting increased uncertainty about categorymembership near category boundaries

In Experiment 3 Young et al explored the extent towhich subjects could discriminate pairs of stimuli alongthe six transitions happinessndashsurprise surprisendashfearfear ndashsadness sadnessndashdisgust disgustndashanger angerndashhappiness They had subjects do both a sequentialdiscrimination task (ABX) and a simultaneous discrim-ination task (samendash different) Only the data fromthe sequential experiments are now available but theauthors report a very strong correlation between the twotypes of data (r = 82) The sequential data are reprintedin Figure 3 The main finding of the experiment wasthat subjects had significantly better discrimination

Dailey et al 1161

performance near the category boundaries than near theexpression prototypes a necessary condition to claimCP Experiment 3 therefore best supports the categoricalview of facial expression perception

Finally in Experiment 4 Young et al set out todetermine whether subjects could determine whatexpression is lsquolsquomixed-inrsquorsquo to a faint morph Again astrong categorical view of facial expressions would pre-dict that subjects should not be able to discern whatexpression a given morph sequence is moving towarduntil the sequence gets near the category boundaryHowever according to continuous theories subjectsshould be able to determine what prototype a sequenceis moving toward fairly early in the sequence In theexperiment subjects were asked to decide given amorph or prototype stimulus the most apparent emo-tion the second-most apparent emotion and the third-

most apparent emotion using a button box with abutton for each of the six emotion categories Aftercorrecting for intrinsic confusability of the emotioncategories the authors found that subjects were sig-nificantly above chance at detecting the mixed-in emo-tion at the 30 level As opposed to the identificationand discrimination data from Experiments 1ndash3 thisresult best supports continuous dimensional accountsof facial expression perception

In summary Young et alrsquos experiments rather thansettling the issue of categorical versus continuous the-ories of facial expression perception found evidencesupporting both types of theory The authors arguethat a 2-D account of facial expression perception isunable to account for all of the data but they alsoargue that the strong mandatory categorization view islikewise deficient

Figure 2 Selected results ofYoung et alrsquos (1997 ) Experi-ment 2 (a) Human responses( identification in a seven-wayforced-choice task) to sixmorph sequences happy ndashsurprised surprised ndashafraidafraidndash sad sadndashdisgusteddisgusted ndashangry and angry ndashhappy Every transition exhib-ited a sharp category boundary(b) Modeling subjectsrsquo choice asa random variable distributedaccording to the pattern ofactivation at the networkrsquos out-put layer which entails aver-aging across the 13 networksrsquooutputs The model has a cor-relation (over all 15 transitions)of r = 942 with the humansubjects (c) RTs for the sametransitions in (a) RTs increasedsignificantly with distance fromthe prototype (M = sad only 6out of 21 possible transitionsare shown) (d) Predictions ofnetwork lsquolsquouncertaintyrsquo rsquo modelfor the same morph transitions

1162 Journal of Cognitive Neuroscience Volume 14 Number 8

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 5: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

performance near the category boundaries than near theexpression prototypes a necessary condition to claimCP Experiment 3 therefore best supports the categoricalview of facial expression perception

Finally in Experiment 4 Young et al set out todetermine whether subjects could determine whatexpression is lsquolsquomixed-inrsquorsquo to a faint morph Again astrong categorical view of facial expressions would pre-dict that subjects should not be able to discern whatexpression a given morph sequence is moving towarduntil the sequence gets near the category boundaryHowever according to continuous theories subjectsshould be able to determine what prototype a sequenceis moving toward fairly early in the sequence In theexperiment subjects were asked to decide given amorph or prototype stimulus the most apparent emo-tion the second-most apparent emotion and the third-

most apparent emotion using a button box with abutton for each of the six emotion categories Aftercorrecting for intrinsic confusability of the emotioncategories the authors found that subjects were sig-nificantly above chance at detecting the mixed-in emo-tion at the 30 level As opposed to the identificationand discrimination data from Experiments 1ndash3 thisresult best supports continuous dimensional accountsof facial expression perception

In summary Young et alrsquos experiments rather thansettling the issue of categorical versus continuous the-ories of facial expression perception found evidencesupporting both types of theory The authors arguethat a 2-D account of facial expression perception isunable to account for all of the data but they alsoargue that the strong mandatory categorization view islikewise deficient

Figure 2 Selected results ofYoung et alrsquos (1997 ) Experi-ment 2 (a) Human responses( identification in a seven-wayforced-choice task) to sixmorph sequences happy ndashsurprised surprised ndashafraidafraidndash sad sadndashdisgusteddisgusted ndashangry and angry ndashhappy Every transition exhib-ited a sharp category boundary(b) Modeling subjectsrsquo choice asa random variable distributedaccording to the pattern ofactivation at the networkrsquos out-put layer which entails aver-aging across the 13 networksrsquooutputs The model has a cor-relation (over all 15 transitions)of r = 942 with the humansubjects (c) RTs for the sametransitions in (a) RTs increasedsignificantly with distance fromthe prototype (M = sad only 6out of 21 possible transitionsare shown) (d) Predictions ofnetwork lsquolsquouncertaintyrsquo rsquo modelfor the same morph transitions

1162 Journal of Cognitive Neuroscience Volume 14 Number 8

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 6: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

Until now despite several years of research on auto-matic recognition of facial expressions no computation-al model has been shown to simultaneously explain allof these seemingly contradictory data In the next sec-tion we review the recent progress in computationalmodeling of facial expression recognition then intro-duce a new model that does succeed in explaining theavailable data

Computational Modeling of Facial ExpressionRecognition

Padgett and colleagues were the first to apply computa-tional models toward an understanding of facial expres-sion perception (Cottrell Dailey Padgett amp Adolphs2000 Padgett Cottrell amp Adolphs 1996 Padgett ampCottrell 1998) Their system employed linear filteringof regions around the eyes and mouth followed by amultilayer perception for classification into emotioncategories The system achieved good recognition per-formance and was able to explain some of the results onCP of facial expressions with linear lsquolsquodissolversquorsquo sequen-ces However the system was unable to account for thesharp category boundaries humans place along image-quality morph transitions because the linear filtersrsquoresponses varied too quickly in the presence of thenonlinear changes in morph transitions It also wouldhave been incapable of predicting recently observedevidence of holistic effects in facial expression recogni-tion (Calder Young Keane amp Dean 2000)

Lyons et al (1998) created a database of Japanesefemale models portraying facial expressions of happi-ness surprise fear anger sadness and disgust Theythen had subjects rate the degree to which each faceportrayed each basic emotion on a 1ndash5 scale and usedEuclidean distance between the resulting lsquolsquosemanticratingrsquorsquo vectors for each face pair as a measure ofdissimilarity They then used a system inspired by theDynamic Link Architecture (Lades et al 1993) to explainthe human similarity matrix Similarities obtained from

their Gabor wavelet-based representation of faces werehighly correlated with similarities obtained from thehuman data and nonmetric multidimensional scaling(MDS) revealed similar underlying 2-D configurationsof the stimuli The authors suggest in conclusion thatthe high-level circumplex constructs proposed byauthors like Schlosberg and Russell may in part reflectsimilarity at low levels in the visual system

In a recent work Calder et al (2001) applied adifferent computational model to the task of under-standing human facial expression perception and rec-ognition The idea of their system originally proposedby Craw and Cameron (1991) is to first encode thepositions of facial features relative to the average facethen warp each face to the average shape (thus pre-serving texture but removing between-subject faceshape variations) The shape information (the positionsof the facial features prior to warping) and the shape-free information (the pixel values after warping) caneach be submitted to a separate principal componentsanalysis (PCA) for linear dimensionality reduction pro-ducing separate low-dimensional descriptions of thefacersquos shape and texture Models based on thisapproach have been successful in explaining psycholog-ical effects in face recognition (Hancock Burton ampBruce 1996 1998) However warping the features inan expressive face to the average face shape wouldseem to destroy some of the information crucial torecognition of that expression so prior to Calderet alrsquos work it was an open question as to whethersuch a system could support effective classification offacial expressions The authors applied the Craw andCameron PCA method to Ekman and Friesenrsquos (1976)Pictures of Facial Affect (POFA) then used linear dis-criminant analysis to classify the faces into happy sadafraid angry surprised and disgusted categories Theauthors found that a representation incorporating boththe shape information (feature location PCA) and theshape-free information (warped pixel PCA) best sup-ported facial expression classification (83 accuracy

Figure 3 Results of Younget alrsquos (1997 ) Experiment 3 forthe sequential (ABX ) discrimi-nation task Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsdenote which pair of stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Discrimination was significantlybetter near the prototypes thannear the category boundaries

Dailey et al 1163

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 7: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

using leave-one-image-out classification for the Ekmanand Friesen database) The authors went on to comparetheir system with human data from psychological exper-iments They found that their system behaved similarlyto humans in a seven-way forced-choice task and thatthe principal component representation could be usedto predict human ratings of pleasure and arousal (thetwo dimensions of Russellrsquos circumplex)

Taken together results with the above three compu-tational models of facial expression recognition begin tohint that subjectsrsquo performance in psychological experi-ments can be explained as a simple consequence ofcategory learning and the statistical properties of thestimuli themselves Although Calder et alrsquos systemexhibits a surprising amount of similarity to humanperformance in a forced-choice experiment it hasnot been brought to bear in the debate on multi-dimensional versus CP of facial expressions In thepresent article we show that a new more biologicallyplausible computational model not only exhibits moresimilarity to human forced-choice performance butalso provides a detailed computational account of datasupporting both categorical and multidimensional the-ories of facial expression recognition and perceptionOur simulations consist of constructing and training asimple neural network model (Figure 4) the systemuses the same stimuli employed in many psychologicalexperiments performs many of the same tasks andcan be measured in similar ways as human subjectsThe model consists of three levels of processingperceptual analysis object representation and catego-rization The next section describes the model insome detail

The Model

The system is a feed-forward network consisting of threelayers common to most object recognition models(Riesenhuber amp Poggio 2000) Its input is a 240 by292 manually aligned grayscale face image from Ekmanand Friesenrsquos POFA (Ekman amp Friesen 1976) This dataset contains photos of 14 actors portraying expressionsthat are reliably classified as happy sad afraid angrysurprised or disgusted by naive observers (70 agree-ment was the threshold for inclusion in the data setand the overall agreement is 916) The expressionsmade by one of those actors lsquolsquoJ Jrsquorsquo are shown inFigure 1a The strong agreement across human sub-jects together with the use of these photos in a widevariety of psychological experiments exploring emo-tional expression perception makes POFA an idealtraining set for our model

The first layer of the model is a set of neurons whoseresponse properties are similar to those of complex cellsin the visual cortex The so-called Gabor filter (Daug-man 1985) is a standard way to model complex cells invisual recognition systems (Lades et al 1993) Figure 5shows the spatial lsquolsquoreceptive fieldsrsquorsquo of several filters Theunits essentially perform nonlinear edge detection atfive scales and eight orientations As a feature detectorthe Gabor filter has the advantage of moderate trans-lation invariance over pixel-based representations Thismeans that features can lsquolsquomoversquorsquo in the receptive fieldwithout dramatically affecting the detectorrsquos responsemaking the first layer of our model robust to small imagechanges With a 29 by 35 grid and 40 filters at each gridlocation we obtain 40600 model neurons in this layerwhich we term the lsquolsquoperceptualrsquorsquo layer

In order to extract a small set of informative featuresfrom this high-dimensional data we use the equivalentof an lsquolsquoimage compressionrsquorsquo network (Cottrell Munro ampZipser 1989) This is a back propagation network that istrained to simply reproduce its input on its outputthrough a narrow channel of hidden units In order tosolve this problem the hidden units must extract regu-larities from the data Since they are fully connected tothe inputs they usually extract global representationswe have termed lsquolsquoholonsrsquorsquo in previous work (Cottrell ampMetcalfe 1991) We note that this is a biologicallyplausible means of dimensionality reduction in the sensethat it is unsupervised and can be learned by simplenetworks employing Hebbian learning rules (Sanger1989) As a shortcut we compute this networkrsquos weights

Figure 5 Example Gaborfilters The real (cosine shaped)component is shown relative tothe size of the face at all fivescales and five of the eightorientations used

Figure 4 Model schematic

1164 Journal of Cognitive Neuroscience Volume 14 Number 8

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 8: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

directly via the equivalent statistical technique of PCAWe chose 50 principal components (hidden units) forthis layer based on previous experiments showing thisleads to good generalization on POFA It is important tonote that the number of components (the only freeparameter in our system) was not tuned to human dataThe resulting low-dimensional object-level representa-tion is specific to the facial expression and identityvariations in its input as is the population of so-calledface cells in the inferior temporal cortex (Perrett Hieta-nen Oram amp Benson 1992) We term this layer ofprocessing the lsquolsquogestaltrsquorsquo level

Although PCA is one of the simplest and mostefficient methods for coding a set of stimuli othermethods would probably also work For the currentmodel it is only important that (1) the code besensitive to the dimensions of variance in faces tofacilitate learning of expressions and that (2) the codebe low dimensional to facilitate generalization to novelfaces In a more general object recognition system asingle PCA for all kinds of objects would probably notbe appropriate because the resulting code would notbe sparse (a single image normally activates a largenumber of holistic lsquolsquounitsrsquorsquo in a PCA representation butobject-sensitive cells in the visual system appear torespond much more selectively Logothetis amp Shein-berg 1996) To obtain a good code for a large numberof different objects then nonnegative matrix factoriza-tion (Lee amp Seung 1999) or an independent compo-nent mixture model (Lee Lewicki amp Sejnowski 2000)might be more appropriate For the current problemthough PCA suffices

The outputs of the gestalt layer are finally categorizedinto the six lsquolsquobasicrsquorsquo emotions by a simple perceptronwith six outputs one for each emotion The network isset up so that its outputs can be interpreted as proba-bilities (they are all positive and sum to 1) However thesystem is trained with an lsquolsquoall-or-nonersquorsquo teaching signalThat is even though only 92 (say) of the humansubjects used to vet the POFA database may haveresponded lsquolsquohappyrsquorsquo to a particular face the networkrsquosteaching signal is 1 for the lsquolsquohappyrsquorsquo unit and 0 for theother five Thus the network does not have availableto it the confusions that subjects make on the dataWe term this layer the lsquolsquocategoryrsquorsquo level

While this categorization layer is an extremely simplis-tic model of human category learning and decision-making processes we argue that the particular form ofclassifier is unimportant so long as it is sufficientlypowerful and reliable to place the gestalt-level represen-tations into emotion categories we claim that similarresults will obtain with any nonlinear classifier

We should also point out that the system abstractsaway many important aspects of visual processing inthe brain such as eye movements facial expressiondynamics size and viewpoint invariance These com-plicating factors turn out to be irrelevant for our

purposes as we shall see despite the simplifyingassumptions of the model it nevertheless accountsfor a wide variety of data available from controlledbehavioral experiments This suggests that it is a goodfirst-order approximation of processing at the computa-tional level in the visual system

The next section reports on the results of severalexperiments comparing the modelrsquos predictions tohuman data from identical tasks with special emphasison Young et alrsquos landmark study of categorical effects infacial expression perception (Young et al 1997) Wefind (1) that the model and humans find the sameexpressions difficult or easy to interpret (2) that whenpresented with morphs between pairs of expressionsthe model and humans place similar sharp categoryboundaries between the prototypes (3) that pairwisesimilarity ratings derived from the modelrsquos gestalt-levelrepresentations predict human discrimination ability (4)that the model and humans are similarly sensitive tomixed-in expressions in morph stimuli and (5) that MDSanalysis produces a similar emotional similarity structurefrom the model and human data

RESULTS

Comparison of Model and Human Performance

The connections in the modelrsquos final layer were trainedto classify images of facial expressions from Ekman andFriesenrsquos POFA database (see Figure 1a) (Ekman ampFriesen 1976) the standard data set used in the vastmajority of psychological research on facial expression(see Methods for details) The training signal containedno information aside from the expression most agreedupon by human observersmdashthat is even if 90 ofhuman observers labeled a particular face lsquolsquohappyrsquorsquoand 10 labeled it lsquolsquosurprisedrsquorsquo the network was trainedas if 100 had said lsquolsquohappyrsquorsquo Again there is no infor-mation in the training signal concerning the similarity ofdifferent expression categories The first measurementwe made was the modelrsquos ability to generalize in classi-fying the expressions of previously unseen subjects fromthe same database The modelrsquos mean generalizationperformance was 900 which is not significantly differ-ent from human performance on the same stimuli(916 t = 587 df = 190 p = 279) (Table 1) More-over the rank-order correlation between the modelrsquosaverage accuracy on each category (happiness surprisefear etc) and the level of human agreement on thesame categories was 667 (two-tailed Kendallrsquos taup = 044 cf Table 1) For example both the modeland humans found happy faces easy and fear faces themost difficult to categorize correctly1 Since the networkhad about as much experience with one category asanother and was not trained on the human responseaccuracies this correlation between the relative difficul-ties of each category is an lsquolsquoemergentrsquorsquo property of the

Dailey et al 1165

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 9: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

model Studies of expression recognition consistentlyfind that fear is one of the most difficult expressions torecognize (Katsikitis 1997 Matsumoto 1992 Ekman ampFriesen 1976) Our model suggests that this is simplybecause the fear expression is lsquolsquoinherentlyrsquorsquo difficult todistinguish from the other five expressions

How does the network perform the expression rec-ognition task An examination of the trained networkrsquosrepresentation provides some insight We projectedeach unitrsquos weight vector back into image space in orderto visualize the lsquolsquoidealrsquorsquo stimulus for each unit in thenetwork (see Methods for details) The results areshown in Figure 6 with and without addition of theaverage face In each of the images each pixel value isthe result of applying a regression formula predictingthe value of the pixel at that location as a linear functionof the 50-element weight vector for the given networkoutput unit Dark and bright spots indicate the featuresthat excite or inhibit a given output unit depending onthe relative gray values in the region of that featureEach unit appears to combine evidence for an emotionbased upon the presence or absence of a few localfeatures For instance for fear the salient criteria appearto be the eyebrow raise and the eyelid raise with a

smaller contribution of parted lips The anger unitapparently requires a display in which the eyebrowsare not raised Clearly the classifier has determinedwhich feature configurations reliably distinguish eachexpression from the others

Comparison on CP Data

Several studies have reported CP of facial expressionsusing morphs between portrayals of the six basic emo-tions by POFA actor lsquolsquoJ Jrsquorsquo whose images have beenchosen because his expressions are among the easiest torecognize in the database (Young et al 1997 Calderet al 1996) Since our model also finds J J lsquolsquoeasyrsquorsquo(it achieves 100 generalization accuracy on J Jrsquos pro-totypical expressions) we replicated these studies with13 networks that had not been trained on J J2

We first compared the modelrsquos performance withhuman data from a six-way forced-choice experiment(Young et al 1997) on 10 3050 70 and 90morphs we constructed (see Methods) between allpairs of J Jrsquos prototypical expressions (see Figure 1bfor two such morph sequences) We modeled thesubjectsrsquo identification choices by letting the outputsof the networks represent the probability mass functionfor the human subjectsrsquo responses This means that weaveraged each of the six outputs of the 13 networks foreach face and compared these numbers to the humansubjectsrsquo response probabilities Figure 2 compares thenetworksrsquo forced-choice identification performancewith that of humans on the same stimuli Quantita-tively the modelrsquos responses were highly correlatedwith the human data (r = 942 p lt 001) andqualitatively the model maintained the essential fea-tures of the human data abrupt shifts in classificationat emotion category boundaries and few intrusions(identifications of unrelated expressions) along thetransitions Using the same criteria for an intrusionthat Young et al (1997) used the model predicted 4intrusions in the 15 morph sequences compared to 2out of 15 in the human data Every one of theintrusions predicted by the model involved fear which

Table 1 Error Rates for Networks and Level of Agreementfor Humans

ExpressionNetwork Percentage

CorrectHuman Percentage

Correct

Happiness 1000 987

Surprise 1000 924

Disgust 1000 923

Anger 891 889

Sadness 833 892

Fear 672 877

Average 900 916

Network generalization to unseen faces compared with humanagreement on the same faces (six-way forced choice) Human dataare provided with the POFA database (Ekman amp Friesen 1976)

Figure 6 Network representa-tion of each facial expression(Top row) Approximation ofthe optimal input-level stimulusfor each facial expressioncategory (Bottom row) Thesame approximations with theaverage face subtractedmdashdarkand bright pixels indicatesalient features

1166 Journal of Cognitive Neuroscience Volume 14 Number 8

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 10: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

is the least reliably classified expression in the POFAdatabase for humans and networks (Table 1)

We also compared subjectsrsquo RTs to those of the net-work in which we model RT as being proportional touncertainty at the category level That is if the networkoutputs a probability of 9 for happiness we assume it ismore certain and therefore faster than when it outputs aprobability of 6 for happiness We thus use the differ-ence between 1 (the maximum possible output) and thelargest actual output as our model of the reaction timeAs explained earlier subjectsrsquo RTs exhibit a characteristicscalloped shape with slower RTs near category bounda-ries The model also exhibits this pattern as shown inFigure 2 The reason should be obvious As a sequenceof stimuli approaches a category boundary the net-workrsquos output for one category necessarily decreasesas the other increases This results in a longer model RTThe model showed good correlation with human RTs(see Methods) (r = 677 p lt 001)

The model also provides insight into human discrim-ination behavior How can a model that processes oneface at a time discriminate faces The idea is to imaginethat the model is shown one face at a time and stores itsrepresentation of each face for comparison We use thecorrelation (Pearsonrsquos r) between representations as ameasure of similarity and then use 1 minus this numberas a measure of discriminability An interesting aspect ofour model is that it has several independent levels ofprocessing (see Figure 4) allowing us to determinewhich level best accounts for a particular phenomenonIn this case we compute the similarity between two

faces at the pixel level (correlation between the rawimages) the perceptual level (correlation between theGabor filter responses) the gestalt level (correlationbetween the principal components) or the categoriza-tion level (correlation between the six outputs) Wecompared our measure of discriminability with humanperformance in Young et alrsquos (1997) ABX experiment(see Methods for details) We found the followingcorrelations at each processing level Pixel r = 35p = 06 perceptual r = 52 p = 003 gestalt r = 65p lt 001 (shown in Figure 7b) category r = 41 p = 02Crucially when the gestalt layer and the categorizationlayer were combined in a multiple regression the cate-gorization layerrsquos contribution was insignificant ( p = 3)showing that the explanatory power of the model restswith the gestalt layer According to our model thenhuman subjectsrsquo improved discrimination near categoryboundaries in this task is best explained as an effect atthe level of gestalt-based representations which werederived in an unsupervised way from Gabor represen-tations via PCA This is in sharp contrast to the stand-ard explanation of this increased sensitivity which isthat categorization influences perception (Goldstoneet al 2001 Pevtzow amp Harnad 1997) This suggeststhat the facial expression morph sequences have natu-ral boundaries possibly because the endpoints areextremes of certain coordinated muscle group move-ments In other domains such as familiar face classi-fication (Beale amp Keil 1995) the boundaries must arisethrough learning In such domains we expect thatdiscrimination would be best explained in a learned

Figure 7 Discrimination ofmorph stimuli (a) Percent cor-rect discrimination of pairs ofstimuli in an ABX task (Younget al 1997 ) Each point repre-sents the percentage of timesubjects correctly discriminatedbetween two neighboringmorph stimuli The x-axis labelsto the left and right of eachpoint show which two stimuliwere being compared (eg7090 along the transition fromsadness to disgust denotes acomparison of a 70 disgust 30 sadness morph with a 90disgust 10 sadness morph)Note that better performanceoccurs near category bound-aries than near prototypes(highlighted by the verticallines) (b) The modelrsquos discri-mination performance at thegestalt representation levelThe model discriminationscores are highly correlatedwith the human subjectsrsquo scores(r = 65 p lt 001)

Dailey et al 1167

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 11: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

feature layer as in other models (Goldstone 2000Tijsseling amp Harnad 1997)

The data above show that the model naturallyaccounts for categorical behavior of humans makingforced-choice responses to facial expression morphstimuli In another experiment Young et al (1997)found that subjects could reliably detect the expres-sion mixed into a morph even at the 30 level Canthe model also account for this decidedly noncategor-ical behavior In the experiment subjects were askedto decide given a morph or prototype stimulus themost apparent emotion the second-most apparentemotion and the third-most apparent emotion (thescores on prototypes were used to normalize for inher-ent expression similarity) To compare the humanresponses with the model we used the top three out-puts of the 13 networks and used the same analysis asYoung et al to determine the extent to which thenetwork could detect the mixed-in expression in themorph images (see Methods for details) Figure 8 showsthat the modelrsquos average sensitivity to the mixed-inexpression is almost identical to that of human subjectseven though its behavior seems categorical in forced-choice experiments

Comparison of Similarity Structures

We next investigated the similarity structure of therepresentations that the model produced As beforewe calculated the similarity between pairs of faces ateach level of the network by computing the correlationbetween their representations In order to evaluate thesimilarity structure qualitatively we performed MDSboth on the human forced-choice responses publishedby Ekman and Friesen (1976) and on the networkrsquosresponses to the same stimuli at each level of pro-

cessing shown in Figure 4 At the pixel level we foundthat the structure present in the MDS configurationwas based mainly on identity and as we moved towardthe categorization level the identity-based structurebegan to break down and expression-related clustersbegan to emerge Unsurprisingly at the networkrsquos cat-egorization layer the configuration was clearly organ-ized by emotion category (Figure 9) Of more interesthowever is that the lsquolsquoorderingrsquorsquo of facial expressionsaround the human and network MDS configurationsis the same a result unlikely to have arisen by chance

Figure 9 Multidimensional scaling of human and network responsesreveals similar dimensions of emotion Each point represents one ofthe 96 expressive faces in POFA (a) 2-D similarity space induced byMDS from the average six-way forced-choice responses of humansubjects (Ekman amp Friesen 1976) (stress = 0218) (b) 2-D similarityspace induced by MDS from the average training set responses ofnetworks at their output layers (stress = 0201) The arrangement ofemotions around the circle is the same in both cases Stress at the pixellevel was 0222 the perceptual level 0245 and the gestalt level 0286

Figure 8 Ability to detect mixed-in expression in morph sequencesfor humans and the model In the human data (dashed lines) subjectschose the far prototype (the faint second expression in the morph)significantly more often than unrelated expressions when thatexpressionrsquos mix ratio was 30 or greater even though theyconsistently identified the 70 expression in forced-choice experi-ments The model is almost identical to the humans in its ability todetect the secondary expression in a morph image

1168 Journal of Cognitive Neuroscience Volume 14 Number 8

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 12: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

( p = 160 = 017) Given that the network was nevergiven any similarity information this is a remarkableresult It suggests that the human similarity structureis a simple result of the inherent confusability of thecategories not necessarily the result of locality in someunderlying psychological emotion space as dimen-sional theories (eg Russell 1980 Russell et al1989) might predict

To measure the similarity between the human andnetwork category structures quantitatively we com-pared the human and model confusion matrices directlyFor each pair of expressions we computed the proba-bility pi j that humans or networks respond with emotioni when the intended emotion is j for all i 6ˆ j that is theprobability of confusion between two categories Thecorrelation between networks and humans on the net-worksrsquo test sets (the stimuli the networks had neverseen before) was 661 ( p lt 001)3 Thus there is a closecorrespondence between human confusion rates andresponse distributions at the modelrsquos categorizationlevel The system is never instructed to confuse thecategories in a way similar to humans nevertheless thisproperty emerges naturally from the data

DISCUSSION

In this article we have introduced a computationalmodel that mimics some of the important functions ofthe visual system The model is simply a pattern classifierincorporating a biologically plausible representation ofvisual stimuli We first engineered the system to providegood classification performance on a small database ofreliably recognizable facial expressions The free param-eters of the model such as the number of principalcomponents used in dimensionality reduction wereoptimized to maximize the classifierrsquos generalizationperformance In contrast to most models in mathemat-ical psychology which seek to fit a low number of freeparameters to maximize a modelrsquos agreement withhuman data our model is compared to human datadirectly without any tuning

The results of our comparison of the modelrsquos per-formance with human data were nevertheless remark-able We first found that the relative levels of difficultyfor six basic facial expressions of emotion were highlycorrelated with the levels at which humans agree onthose same emotions For example humans are best atclassifying happy expressions because the smile makesthe task easy The network model likewise finds itextremely easy to detect smiles because they areobvious visual features that discriminate happy expres-sions from the rest of the faces in its training set Onthe other hand humans usually find that fear expres-sions are very difficult to classify Fear expressions areoften classified as surprise for instance The networkmodel likewise sometimes classifies fear expressions assurprise Why are fear expressions so difficult for

humans to classify Could it be that true displays offear are uncommon in our society Or that portrayals offear in the popular culture misguide us as to whatfearful people really look like Or simply that fearexpressions are perceptually similar to other expres-sions Our results suggest that the latter is the casemdashculture probably has very little to do with the difficultyof fear expressions We have shown that perceptualsimilarity alone is sufficient to account for the relativedifficulty of facial expressions

The agreement between human and network similar-ity structure analyses (MDS) is also somewhat surprisingAs pointed out earlier Russell and colleagues have foundthat affective adjectives and affective facial expressionsseem to share a common similarity structure One mightpostulate that the underlying psychological spacereflects the physiological similarity of emotional statesand that when we are asked to classify facial expressionswe are likely to classify a given face as any of theemotions close by in that psychophysiological spaceHowever we have shown that an emotionless machinewithout any underlying physiology exhibits a similaritystructure very much like that of the humans This is evenmore remarkable when one considers that the networkwas not given any indication of the similarity structure inthe training signal which was always all-or-none ratherthan for example human subject responses whichwould reflect subjectsrsquo confusions Since we neverthelessmatch the human similarity structure we have shownthat the perceptual similarity of the categories corre-sponds to the psychological similarity structure of facialexpressions Why would this be We suggest that evolu-tion did not randomly associate facial expressions withemotional states but that the expression-to-emotionmapping evolved in tandem with the need to communi-cate emotional states effectively

The final question we set out to explore with thiswork is whether facial expressions are representedcontinuously or as discrete entities As explained inthe Introduction the best evidence for discrete repre-sentations is that subjects appear to place sharp boun-daries between facial expression categories and arebetter able to discriminate pairs of expressions nearcategory boundaries suggesting that our perception ofthe faces is influenced by the existence of the catego-ries As has been found in other modalities (cf Ellisonamp Massaro 1997) we find that it is unnecessary toposit discrete representations to account for the sharpboundaries and high discrimination scores The net-work model places boundaries between categories asit must to obtain good classification accuracy but thecategories are actually fuzzy and overlapping Themodel can be thought of as a biologically plausibleworking implementation of Russell and Bullockrsquos(1986) theory of emotional facial expression categoriesas fuzzy concepts Despite the networkrsquos fuzzy over-lapping category concepts the categories appear

Dailey et al 1169

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 13: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

sharp when a rule such as lsquolsquorespond with the categorymost strongly activated by the given facial expressionrsquorsquois applied A more difficult result to explain in terms offuzzy categories placed in a continuous multidimen-sional space is the subjectsrsquo high discrimination scoresnear category boundaries This result seems to call foran influence of the category boundaries on our per-ception of the expressions indeed that is Younget alrsquos (1997 ) tentative interpretation To the contrarywe have shown that in our model discriminationscores best agree with human results at a purely per-ceptual level where the category labels have no effectWith respect to image space the low-dimensional PCArepresentation actually changes faster in boundaryregions than in regions near the prototypes it derivedfrom In the context of our model then the seeminglycontradictory results in different experiments con-ducted by Young et al can be explained as simplytapping different computational levels of processing ina visual system organized much like our model

We have found that our facial expression recognitionmodel is lsquolsquosufficientrsquorsquo to explain many aspects of humanperformance in behavioral tasks but we have no proofof the lsquolsquonecessityrsquorsquo of any of our particular implementa-tion decisions In fact we predict that many similarsystems would obtain similar results For instance asystem beginning with derivative-of-Gaussian edge filters(Marr 1982) whose responses are combined to producesmooth responses to translations (like complex cells)should exhibit the same behavior Replacing the PCAdimensionality reduction method with say factor anal-ysis or the Infomax algorithm for independent compo-nents analysis (Bell amp Sejnowski 1995) should notdramatically affect the results Finally a category levelusing Support Vector Machines (Vapnik 1995) shouldlikewise produce similar behavior The point of oursimulations is that the category boundaries discrimina-bility and similarity structures previously seen as beingat odds are in a sense lsquolsquopresent in the data and tasksthemselvesrsquorsquo and are easily exposed by any reasonablecomputational model

METHODS

Network Details

The first step of processing in the model is to filter theimage with a rigid 29 by 35 grid of overlapping 2-D Gaborfilters (Daugman 1985) in quadrature pairs at five scalesand eight orientations (some example filters are shownin Figure 5) The quadrature pairs are used to compute aphase insensitive energy response at each point in thegrid These linear energy responses or lsquolsquoGabor magni-tudesrsquorsquo are often used as a simplifying model of thespatial responses of complex cells in the early visualsystem (Lades et al 1993) Though the model losessome information about precise feature localization

when phase information is thrown away the overlappingreceptive fields compensate for this loss (HintonMcClelland amp Rumelhart 1986) Each Gabor magnitudeis z-scored (linearly transformed to have mean 0 andvariance 1 over the training data) so that each filtercontributes equally in the next representation layer

The second step of processing in the model isto perform linear dimensionality reduction on the40600-element Gabor representation via a PCA of thetraining set The actual computation is facilitated byTurk and Pentlandrsquos (1991) algebraic trick for lsquolsquoeigenfa-cesrsquorsquo This produces a 50-element representation typi-cally accounting for approximately 80 of the variancein the training setrsquos Gabor data

The 50-element vector p output by PCA can thenbe classified by a simple statistical model We locallyz-score (scale to mean 0 variance 1) each input elementthen use a single-layer neural network (a generalizedlinear model) containing six outputs one for each of thesix lsquolsquobasicrsquorsquo emotions happiness sadness fear angersurprise and disgust Each of the six units in thenetwork computes a weighted sum oi =

Pj wi j pj of

the 50-element input vector then the lsquolsquosoftmaxrsquorsquo func-tion yi = eo i

Pj e

o j is applied to the unitsrsquo linearactivations to obtain a vector of positive values whosesum is 10 The network is trained with the relativeentropy error function so that its outputs correspond tothe posterior probabilities of the emotion categoriesgiven the inputs (Bishop 1995)

Network Training

We train the expression recognition system using leave-one-out cross-validation and early stopping A givennetwork is trained to minimize output error on all theimages of 12 of the 14 actors in POFA using stochasticgradient descent momentum weight decay and therelative entropy error function (Bishop 1995) A thir-teenth actorrsquos images are used as a lsquolsquohold outrsquorsquo set todetermine when to stop training Training is stoppedwhen the error on the hold out set is minimized Aftertraining is stopped we evaluate the networkrsquos general-ization performance on the remaining (fourteenth)actor We performed training runs with every possiblecombination of generalization and hold out sets for atotal of 182 (14 by 13) individual networks

Network Weight Visualization

The idea is to project each unitrsquos weight vector back intoimage space in order to visualize what the network issensitive to in an image But this is not a well-definedtask though PCA is an easily inverted orthonormaltransformation the Gabor magnitude representationbesides being subsampled throws away importantphase information As a workaround we assume thateach pixelrsquos value is an approximately linear function of

1170 Journal of Cognitive Neuroscience Volume 14 Number 8

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 14: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

the 50-component gestalt-level (Gabor + PCA) repre-sentation We chose one of the 192 trained networksand for each pixel location we used regression to findthe linear function of the 50-element gestalt-level repre-sentation best predicting the pixelrsquos value over the net-workrsquos training set Then to visualize say the classifierrsquosrepresentation of happiness we apply the regressionfunction directly to happiness unitrsquos weights An imagewas constructed from each of the six unitsrsquo weightvectors using the regression functions learned from thatnetworkrsquos training set

Generation of Morphs

We generated morphs from the original stimuli (Ekmanamp Friesen 1976) using the Morph program version25 from Gryphon Software as described elsewhere( Jansari Tranel amp Adolphs 2000) Briefly for each ofthe 15 pairs of expression prototypes correspondingfeatures between the two images were manuallyspecified The images were then tessellated and line-arly transformed both with respect to pixel position(a smooth warping) and pixel grayscale value(a smooth fade in luminance) The 10 30 5070 and 90 blends (see Figure 1b for examples)were retained for each transformation

Multidimensional Scaling

MDS seeks to embed a set of stimuli in a low-dimen-sional space in such a way that the distances betweenpoints in the low-dimensional space are as faithful aspossible to ratings of their similarity (Borg amp Lingoes1987 ) We performed MDS on the human responses(Ekman amp Friesen 1976) and on the networkrsquosresponses to the same stimuli at each level of pro-cessing in the network Each analysis requires a dis-tance (dissimilarity) matrix enumerating how dissimilareach pair of stimuli is For the human data we formeda six-element vector containing the probability withwhich humans gave the labels happy sad afraid angrysurprised and disgusted for each of the 96 nonneutralphotographs in POFA We obtained a 96 by 96 elementsimilarity matrix from these data by computing thecorrelation rij between each pair of six-element vec-tors Finally we converted the resulting similaritymatrix into a distance (dissimilarity) matrix with thetransform di j = (1 iexcl ri j)2 to obtain values in therange 0ndash1 For the network we measure the similaritybetween two stimuli i and j as the correlation ri j

between the representations of the two stimuli at eachlevel of the network corresponding to similarity atdifferent levels of processing the pixel level (70080-element image pixel value vectors) the perceptuallevel (40600-element Gabor response patterns) thegestalt level (50-element GaborPCA patterns) andthe networkrsquos output (six-element category level) We

ran a classical nonmetric MDS algorithm SSA-1 due toGuttman and Lingoes (Borg amp Lingoes 1987) on eachof the four resulting distance matrices described aboveand then plotted the stimuli according to their posi-tion in the resulting 2-D configuration

Model RTs

We assume a networkrsquos RT is directly proportional to thelsquolsquouncertainty of its maximal outputrsquorsquo That is we definea networkrsquos model RT for stimulus i to be ti

m o d e l =1 iexcl maxj yij (the time scale is arbitrary) Here yi j is thenetworkrsquos output for emotion j on stimulus i This issimilar to the standard approach of equating RT with anetworkrsquos output error (Seidenberg amp McClelland1989) except that for morph stimuli there is nopredetermined lsquolsquocorrectrsquorsquo response For comparisonthe human data available are ti j

h u m a n the mean RT ofsubjects responding with emotion j to stimulus i Tocompare the model to the humans given these datawe treated each network as an individual subject andfor each morph stimulus recorded the networkrsquosresponse emotion j and model reaction time ti

m o d e l

then averaged tim o d e l over all networks making the

same response to the stimulus The quantitative com-parison is the linear fit between network predictionsand human RTs for all stimuli and response pairs forwhich both human and network data were availableYoung et alrsquos criterion for reporting ti j

h u m a n was that atleast 25 of the human subjects responded withemotion j to stimulus i and for some of these casesnone of the networks responded with emotion j tostimulus i so the missing human and network datawere disregarded

Model Discrimination Scores

We assume discrimination is more difficult the moresimilar two stimuli are at some level of processing Weuse the same measure of similarity as in the MDSprocedure The correlation ri j between the networkrsquosrepresentation of stimuli i and j (either at the pixelperceptual gestalt or output level) To convert similar-ity scores to discrimination scores we used the trans-form di j = 1 iexcl ri j This was measured for each of the 30pairs of J J images for which human data were availableThe discrimination scores were then averaged over the13 networks that had not been trained on J J andcompared to the human data

Mixed-in Expression Detection

To measure the ability of the model to detect thesecondary expression mixed into a morph stimulus wefollowed Young et alrsquos (1997) methods For eachnetworkrsquos output on a given stimulus we scored thefirst second and third highest outputs of the networks

Dailey et al 1171

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 15: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

as a 3 2 and 1 respectively and assigned the score 0 tothe three remaining outputs For each morph andprototype stimulus we averaged these score vectorsacross all 13 networks Now for each of the 30 possiblecombinations of near prototype (i) and far prototype( j ) using the 90 70 and 50 morphs moving fromexpression i to expression j we subtracted the scorevector for prototype i from the score for each of thethree morphs This eliminates the effect of the intrinsicsimilarity between J Jrsquos prototype expressions Nowaveraging these score vectors across all 30 sequenceswe obtain the data plotted in Figure 8

Acknowledgments

We thank Stevan Harnad Bill Kristan Paul Munro AliceOrsquoToole Terry Sejnowski and Garyrsquos Unbelievable ResearchUnit (GURU) for helpful comments on previous versions of thismanuscript We also thank Andrew Young for permission touse the human data plotted in Figures 2 3 7a and 8 We arealso grateful to Paul Ekman for providing us with the Picturesof Facial Affect This research was funded by NIMH grantMH57075 to GWC

Reprint requests should be sent to Garrison W Cottrell UCSDComputer Science and Engineering 9500 Gilman Drive LaJolla CA 92093-0114 USA or via e-mail garycsucsdedu

Notes

1 For comparison the order of difficulty for the Calder et al(2001) PCA model is happiness (98) fear (97) surprise(92) anger (86) sadness (72) and disgust (72)2 We used 13 networks for technical reasons (see Methodsfor details)3 The same comparison of our model with the confusionmatrix from the Calder et al (2001) forced-choice experimentis slightly better 686 whereas their PCA modelrsquos correlationwith their subjectsrsquo confusion data is somewhat lower 496

REFERENCES

Beale J amp Keil F (1995) Categorical effects in the perceptionof faces Cognition 57 217 ndash239

Bell A J amp Sejnowski T J (1995) An information-maximization approach to blind separation and blinddeconvolution Neural Computation 7 1129 ndash1159

Bishop C M (1995) Neural networks for patternrecognition Oxford Oxford University Press

Borg I amp Lingoes J (1987) Multidimensional similaritystructure analysis New York Springer

Calder A Young A Perrett D Etcoff N amp Rowland D(1996) Categorical perception of morphed facialexpressions Visual Cognition 3 81ndash117

Calder A J Burton A M Miller P Young A W ampAkamatsu S (2001) A principal component analysis of facialexpressions Vision Research 41 1179 ndash1208

Calder A J Young A W Keane J amp Dean M (2000)Configural information in facial expression perceptionJournal of Experimental Psychology Human Perceptionand Performance 26 526 ndash551

Carroll J M amp Russell J A (1996) Do facial expressionssignal specific emotions Judging emotion from the face in

context Journal of Personality and Social Psychology 70205 ndash218

Cottrell G W Dailey M N Padgett C amp Adolphs R (2000)Is all face processing holistic The view from UCSD In MWenger amp J Townsend (Eds) Computational geometricand process perspectives on facial cognition Context andchallenges (pp 347 ndash395) Mahwah NJ Erlbaum

Cottrell G W amp Metcalfe J (1991) EMPATH Face genderand emotion recognition using holons In R P LippmanJ Moody amp D S Touretzky (Eds) Advances in neuralinformation processing systems 3 (pp 564 ndash571)San Mateo Morgan Kaufmann

Cottrell G W Munro P amp Zipser D (1989) Imagecompression by back propagation An example ofextensional programming In N E Sharkey (Ed)Models of cognition A review of cognitive sciencesNew Jersey Norwood

Craw I amp Cameron P (1991) Parameterising images forrecognition and reconstruction In Proceedings of theBritish Machine Vision Conference (pp 367 ndash370)Berlin Springer

Daugman J G (1985) Uncertainty relation for resolution inspace spatial frequency and orientation optimized bytwo-dimensional visual cortical filters Journal of the OpticalSociety of America A 2 1160ndash1169

Donato G Barlett M S Hager J C Ekman P amp SejnowskyT J (1999) Classifying facial actions IEEE Transactions onPattern Analysis and Machine Intelligence 21 974 ndash989

Ekman P (1999) Basic emotions In T Dagleish ampM Power (Eds) Handbook of cognition and emotion(Chap 3 pp 45ndash60) New York Wiley

Ekman P amp Friesen W (1976) Pictures of facial affectPalo Alto CA Consulting Psychologist Press

Ellison J W amp Massaro D W (1997) Featural evaluationintegration and judgement of facial affect Journal ofExperimental Psychology Human Perception andPerformance 23 213 ndash226

Etcoff N L amp Magee J J (1992) Categorical perception offacial expressions Cognition 44 227 ndash240

Goldstone R L (2000) A neural network model ofconcept-influenced segmentation In Proceedings of the22nd Annual Conference of the Cognitive Science SocietyMahwah NJ Erlbaum

Goldstone R L Lippa Y amp Shiffrin R M (2001) Alteringobject representation through category learning Cognition78 27ndash43

Hancock P J B Bruce V amp Burton A M (1998) Acomparison of two computer-based face recognitionsystems with human perception of faces Vision Research38 2277 ndash2288

Hancock P J B Burton A M amp Bruce V (1996) Faceprocessing Human perception and principal componentsanalysis Memory and Cognition 24 26ndash40

Harnad S R (1987) Categorial perception The groundworkof cognition Cambridge Cambridge University Press

Hinton G E McClelland J L amp Rumelhart D E (1986)Distributed representations Parallel distributedprocessing Explorations in the microstructure ofcognition (vol 1 chap 3 pp 77ndash109) CambridgeMIT Press

Jansari A Tranel D amp Adolphs R (2000) A valence-specificlateral bias for discriminating emotional facial expressions infree field Cognition and Emotion 14 341 ndash353

Katsikitis M (1997) The classification of facial expressions ofemotion A multidimensional scaling approach Perception26 613 ndash626

Lades M Vorbruggen J C Buhmann J Lange J von derMalsburg C Wurtz R P amp Konen W (1993) Distortion

1172 Journal of Cognitive Neuroscience Volume 14 Number 8

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173

Page 16: EMPATH: A Neural Network that Categorizes Facial Expressionscseweb.ucsd.edu/~mdailey/papers/DaileyEtalCogneuro.pdfconceptual and linguistic systems’ ’ (p. 229). The authors created

invariant object recognition in the dynamic link architectureIEEE Transactions on Computers 42 300 ndash311

Lee D D amp Seung H S (1999) Learning the parts ofobjects by non-negative matrix factorization Nature 401788 ndash791

Lee T-W Lewicki M S amp Sejnowski T J (2000) ICA mixturemodels for unsupervised classification of non-Gaussiansources and automatic context switching in blind signalseparation IEEE Transactions on Pattern Recognition andMachine Intelligence 22 1ndash12

Lien J J-J Kanade T Cohn J F amp Li C-C (2000)Detection tracking and classification of action units infacial expression Robotics and Autonomous Systems 31131 ndash146

Logothetis N K amp Sheinberg D L (1996) Visual objectrecognition Annual Review of Neuroscience 19 577 ndash621

Lyons M J Akamatsu S Kamachi M amp Gyoba J (1998)Coding facial expressions with Gabor wavelets Proceedingsof the third IEEE International Conference on AutomaticFace and Gesture Recognition (pp 200 ndash205)Los Alamitos CA IEEE Computer Society

Lyons M J Budynek J amp Akamatsu S (1999) Automaticclassification of single facial images IEEE Transactionson Pattern Analysis and Machine Intelligence 211357 ndash1362

Marr D (1982) Vision A computational investigation intothe human representation and processing of visualinformation San Francisco Freeman

Matsumoto D (1992) AmericanndashJapanese cultural differencesin the recognition of universal facial expressions Journal ofCross-Cultural Psychology 23 72ndash84

Padgett C Cottrell G amp Adolphs R (1996) Categoricalperception in facial emotion classification Proceedingsof the 18th Annual Conference of the Cognitive ScienceSociety Hillsdale NJ Erlbaum

Padgett C amp Cottrell G W (1998) A simple neural networkmodels categorical perception of facial expressionsProceedings of the 20th Annual Conference of theCognitive Science Society (pp 806 ndash807) Mahwah NJErlbaum

Perrett D Hietanen J Oram M amp Benson P (1992)Organization and functions of cells responsive to faces inthe temporal cortex Philosophical Transactions of theRoyal Society of London B Biological Sciences 33523ndash30

Pevtzow R amp Harnad S R (1997) Warping similarity spacein category learning by human subjects The role of taskdifficulty In M Ramscar U Hahn E Cambouropolos amp H

Pain (Eds) Proceedings of SimCat 1997 Interdisciplinaryworkshop on similarity and categorization (pp 189 ndash195)Edinburgh Scotland Department of Artificial IntelligenceEdinburgh University

Riesenhuber M amp Poggio T (2000) Models of objectrecognition Nature Neuroscience 3 1199ndash1204

Rosenblum M Yacoob Y amp Davis L S (1996) Humanexpression recognition from motion using a radial basisfunction network architecture IEEE Transactions onNeural Networks 7 1121 ndash1138

Russell J A (1980) A circumplex model of affect Journal ofPersonality and Social Psychology 39 1161 ndash1178

Russell J A amp Bullock M (1986) Fuzzy concepts and theperception of emotion in facial expressions SocialCognition 4 309 ndash341

Russell J A Lewicka M amp Niit T (1989) A cross-culturalstudy of circumplex model of affect Journal of Personalityand Social Psychology 57 848 ndash856

Sanger T (1989) An optimality principle for unsupervisedlearning In D Touretzky (Ed) Advances in neuralinformation processing systems (vol 1 pp 11ndash19)San Mateo Morgan Kaufmann

Schiano D J Ehrlich S M Sheridan K M Beck D M(2000) Evidence for continuous rather than categoricalperception of facial affect Abstract presented at The 40thAnnual Meeting of the Psychonomic Society

Schlosberg H (1952) The description of facial expressionsin terms of two dimensions Journal of ExperimentalPsychology 44 229 ndash237

Seidenberg M S amp McClelland J L (1989) A distributeddevelopmental model of word recognition and namingPsychological Review 96 523 ndash568

Tijsseling A amp Harnad S R (1997) Warping similarity spacein category learning by backprop nets In M RamscarU Hahn E Cambouropolos amp H Pain (Eds) Proceedingsof SimCat 1997 Interdisciplinary workshop on similarityand categorization (pp 263 ndash269) Edinburgh ScotlandDepartment of Artificial Intelligence Edinburgh University

Turk M amp Pentland A (1991) Eigenfaces for recognitionJournal of Cognitive Neuroscience 3 71ndash86

Vapnik V N (1995) The nature of statistical learning theoryNew York Springer

Woodworth R S (1938) Experimental psychologyNew York Holt

Young A W Rowland D Calder A J Etcoff N L Seth A ampPerrett D I (1997) Facial expression megamix Tests ofdimensional and category accounts of emotion recognitionCognition 63 271ndash313

Dailey et al 1173