event2mind: commonsense inference on events, intents, and ... · event2mind: commonsense inference...

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

Hannah Rashkin†∗ Maarten Sap†∗ Emily Allaway† Noah A. Smith† Yejin Choi†‡†Paul G. Allen School of Computer Science & Engineering, University of Washington

‡Allen Institute for Artificial Intelligence{hrashkin,msap,eallaway,nasmith,yejin}@cs.washington.edu

Abstract

We investigate a new commonsense in-ference task: given an event describedin a short free-form text (“X drinks cof-fee in the morning”), a system reasonsabout the likely intents (“X wants to stayawake”) and reactions (“X feels alert”) ofthe event’s participants. To support thisstudy, we construct a new crowdsourcedcorpus of 25,000 event phrases coveringa diverse range of everyday events andsituations. We report baseline perfor-mance on this task, demonstrating thatneural encoder-decoder models can suc-cessfully compose embedding represen-tations of previously unseen events andreason about the likely intents and reac-tions of the event participants. In addition,we demonstrate how commonsense infer-ence on people’s intents and reactions canhelp unveil the implicit gender inequalityprevalent in modern movie scripts.

1 Introduction

Understanding a narrative requires commonsensereasoning about the mental states of people in re-lation to events. For example, if “Alex is dragginghis feet at work”, pragmatic implications aboutAlex’s intent are that “Alex wants to avoid do-ing things” (Figure 1). We can also infer thatAlex’s emotional reaction might be feeling “lazy”or “bored”. Furthermore, while not explicitlymentioned, we can infer that people other thanAlex are affected by the situation, and these peopleare likely to feel “frustrated” or “impatient”.

This type of pragmatic inference can potentiallybe useful for a wide range of NLP applications

∗These two authors contributed equally.

PersonX drags

PersonX's feet

PersonX cooks

thanksgiving

dinner

PersonX reads

PersonY's diary

to avoid doing things

lazy, bored

frustrated, impatient

to impress their family

tired, a sense of belonging

impressed

to be nosey, know secrets

guilty, curious

angry, violated, betrayed

X's intent

X's reaction

Y's reaction

X's intent

X's reaction

Y's reaction

X's intent

X's reaction

Y's reaction

Figure 1: Examples of commonsense inference onmental states of event participants. In the third ex-ample event, common sense tells us that Y is likelyto feel betrayed as a result of X reading their diary.

that require accurate anticipation of people’s in-tents and emotional reactions, even when they arenot explicitly mentioned. For example, an idealdialogue system should react in empathetic waysby reasoning about the human user’s mental statebased on the events the user has experienced, with-out the user explicitly stating how they are feel-ing. Similarly, advertisement systems on socialmedia should be able to reason about the emo-tional reactions of people after events such as massshootings and remove ads for guns which mightincrease social distress (Goel and Isaac, 2016).Also, pragmatic inference is a necessary step to-ward automatic narrative understanding and gen-eration (Tomai and Forbus, 2010; Ding and Riloff,2016; Ding et al., 2017). However, this type of so-cial commonsense reasoning goes far beyond thewidely studied entailment tasks (Bowman et al.,2015; Dagan et al., 2006) and thus falls outsidethe scope of existing benchmarks.

In this paper, we introduce a new task, corpus,

arX

iv:1

805.

0693

9v2

[cs

.CL

] 1

4 Ju

n 20

19

PersonX’s Intent Event Phrase PersonX’s Reaction Others’ Reactions

to express angerto vent their frustrationto get PersonY’s full

attention

PersonX starts toyell at PersonY

madfrustratedannoyed

shockedhumiliatedmad at PersonX

to communicate somethingwithout being rude

to let the other person thinkfor themselves

to be subtle

PersonX drops a hintslysecretivefrustrated

oblivioussurprisedgrateful

to catch the criminalto be civilizedjustice

PersonX reportsto the police

anxiousworriednervous

sadangryregret

to wake upto feel more energized

PersonX drinksa cup of coffee

alertawakerefreshed

NONE

to be fearedto be taken seriouslyto exact revenge

PersonX carriesout PersonX’s threat

angrydangeroussatisfied

sadafraidangry

NONEIt startssnowing NONE

calmpeacefulcold

Table 1: Example annotations of intent and reactions for 6 event phrases. Each annotator could fill in upto three free-responses for each mental state.

and model, supporting commonsense inference onevents with a specific focus on modeling stereo-typical intents and reactions of people, describedin short free-form text. Our study is in a similarspirit to recent efforts of Ding and Riloff (2016)and Zhang et al. (2017), in that we aim to modelaspects of commonsense inference via natural lan-guage descriptions. Our new contributions are:(1) a new corpus that supports commonsense in-ference about people’s intents and reactions overa diverse range of everyday events and situations,(2) inference about even those people who are notdirectly mentioned by the event phrase, and (3) atask formulation that aims to generate the textualdescriptions of intents and reactions, instead ofclassifying their polarities or classifying the infer-ence relations between two given textual descrip-tions.

Our work establishes baseline performance onthis new task, demonstrating that, given thephrase-level inference dataset, neural encoder-decoder models can successfully compose phrasalembeddings for previously unseen events and rea-son about the mental states of their participants.

Furthermore, in order to showcase the practicalimplications of commonsense inference on eventsand people’s mental states, we apply our modelto modern movie scripts, which provide a new in-sight into the gender bias in modern films beyondwhat previous studies have offered (England et al.,2011; Agarwal et al., 2015; Ramakrishna et al.,2017; Sap et al., 2017). The resulting corpus in-cludes around 25,000 event phrases, which com-bine automatically extracted phrases from storiesand blogs with all idiomatic verb phrases listed inthe Wiktionary. Our corpus is publicly available.1

2 Dataset

One goal of our investigation is to probe whetherit is feasible to build computational models thatcan perform limited, but well-scoped common-sense inference on short free-form text, which werefer to as event phrases. While there has beenmuch prior research on phrase-level paraphrases(Pavlick et al., 2015) and phrase-level entailment(Dagan et al., 2006), relatively little prior work fo-cused on phrase-level inference that requires prag-

1https://tinyurl.com/event2mind

https://tinyurl.com/event2mind

matic or commonsense interpretation. We scopeour study to two distinct types of inference: givena phrase that describes an event, we want to reasonabout the likely intents and emotional reactions ofpeople who caused or affected by the event. Thiscomplements prior work on more general com-monsense inference (Speer and Havasi, 2012; Liet al., 2016; Zhang et al., 2017), by focusing onthe causal relations between events and people’smental states, which are not well covered by mostexisting resources.

We collect a wide range of phrasal event de-scriptions from stories, blogs, and Wiktionary id-ioms. Compared to prior work on phrasal em-beddings (Wieting et al., 2015; Pavlick et al.,2015), our work generalizes the phrases by in-troducing (typed) variables. In particular, we re-place words that correspond to entity mentions orpronouns with typed variables such as PersonXor PersonY, as shown in examples in Table 1.More formally, the phrases we extract are a com-bination of a verb predicate with partially instan-tiated arguments. We keep specific argumentstogether with the predicate, if they appear fre-quently enough (e.g., PersonX eats pastafor dinner). Otherwise, the arguments arereplaced with an untyped blank (e.g., PersonXeats for dinner). In our work, onlyperson mentions are replaced with typed variables,leaving other types to future research.

Inference types The first type of pragmatic in-ference is about intent. We define intent as anexplanation of why the agent causes a volitionalevent to occur (or “none” if the event phrase wasunintentional). The intent can be considered amental pre-condition of an action or an event. Forexample, if the event phrase is PersonX takesa stab at , the annotated intent might bethat “PersonX wants to solve a problem”.

The second type of pragmatic inference is aboutemotional reaction. We define reaction as an ex-planation of how the mental states of the agent andother people involved in the event would changeas a result. The reaction can be considered a men-tal post-condition of an action or an event. Forexample, if the event phrase is that PersonXgives PersonY as a gift, PersonXmight “feel good about themselves” as a result,and PersonY might “feel grateful” or “feel thank-ful”.

Source # UniqueEvents

# UniqueVerbs

Averageκ

ROC Story 13,627 639 0.57G. N-grams 7,066 789 0.39Spinn3r 2,130 388 0.41Idioms 1,916 442 0.42

Total 24,716 1,333 0.45

Table 2: Data and annotation agreement statisticsfor our new phrasal inference corpus. Each eventis annotated by three crowdworkers.

2.1 Event Extraction

We extract phrasal events from three different cor-pora for broad coverage: the ROC Story train-ing set (Mostafazadeh et al., 2016), the GoogleSyntactic N-grams (Goldberg and Orwant, 2013),and the Spinn3r corpus (Gordon and Swanson,2008). We derive events from the set of verbphrases in our corpora, based on syntactic parses(Klein and Manning, 2003). We then replace thepredicate subject and other entities with the typedvariables (e.g., PersonX, PersonY), and selec-tively substitute verb arguments with blanks ( ).We use frequency thresholds to select events toannotate (for details, see Appendix A.1). Addi-tionally, we supplement the list of events with all2,000 verb idioms found in Wiktionary, in order tocover events that are less compositional.2 Our fi-nal annotation corpus contains nearly 25,000 eventphrases, spanning over 1,300 unique verb predi-cates (Table 2).

2.2 Crowdsourcing

We design an Amazon Mechanical Turk task toannotate the mental pre- and post-conditions ofevent phrases. A snippet of our MTurk HIT de-sign is shown in Figure 2. For each phrase, we askthree annotators whether the agent of the event,PersonX, intentionally causes the event, and ifso, to provide up to three possible textual de-scriptions of their intents. We then ask anno-tators to provide up to three possible reactionsthat PersonX might experience as a result. Wealso ask annotators to provide up to three pos-sible reactions of other people, when applicable.These other people can be either explicitly men-tioned (e.g., “PersonY” in PersonX punchesPersonY’s lights out), or only implied

2We compiled the list of idiomatic verb phrases by cross-referencing between Wiktionary’s English idioms categoryand the Wiktionary English verbs categories.

EventPersonX punches PersonY's lights out

1. Does this event make sense enough for you to answer questions 2-5?

(Or does it have too many meanings?)

Yes, can answer No, can't answer or has too many meanings

Before the event2. Does PersonX willingly cause this event?

Yes No

a). Why?

(Try to describe without reusing words from the event)

Because PersonXwants ...

to (be)[write a reason]

[write another reason - optional]


Figure 2: Intent portion of our annotation task. Weallow annotators to label events as invalid if thephrase is unintelligible. The full annotation setupis shown in Figure 8 in the appendix.

(e.g., given the event description PersonXyells at the classroom, we can inferthat other people such as “students” in the class-room may be affected by the act of PersonX). Forquality control, we periodically removed workerswith high disagreement rates, at our discretion.

Coreference among Person variables Withthe typed Person variable setup, events involv-ing multiple people can have multiple meaningsdepending on coreference interpretation (e.g.,PersonX eats PersonY’s lunch hasvery different mental state implications fromPersonX eats PersonX’s lunch). Toprune the set of events that will be annotatedfor intent and reaction, we ran a preliminaryannotation to filter out candidate events that haveimplausible coreferences. In this preliminarytask, annotators were shown a combinatorial listof coreferences for an event (e.g., PersonXpunches PersonX’s lights out,PersonX punches PersonY’s lightsout) and were asked to select only the plausibleones (e.g., PersonX punches PersonY’slights out). Each set of coreferences wasannotated by 3 workers, yielding an overallagreement of κ =0.4. This annotation excluded8,406 events with implausible coreference fromour set (out of 17,806 events).

2.3 Mental State Descriptions

Our dataset contains nearly 25,000 event phrases,with annotators rating 91% of our extracted eventsas “valid” (i.e., the event makes sense). Of thoseevents, annotations for the multiple choice por-tions of the task (whether or not there exists in-tent/reaction) agree moderately, with an averageCohen’s κ = 0.45 (Table 2). The individual κscores generally indicate that turkers disagree halfas often as if they were randomly selecting an-swers.

Importantly, this level of agreement is accept-able in our task formulation for two reasons. First,unlike linguistic annotations on syntax or seman-tics where experts in the corresponding theorywould generally agree on a single correct label,pragmatic interpretations may better be defined asdistributions over multiple correct labels (e.g., af-ter PersonX takes a test, PersonX mightfeel relieved and/or stressed; de Marneffe et al.,2012). Second, because we formulate our task asa conditional language modeling problem, where adistribution over the textual descriptions of intentsand reactions is conditioned on the event descrip-tion, this variation in the labels is only as expected.

A majority of our events are annotated as will-ingly caused by the agent (86%, Cohen’s κ =0.48), and 26% involve other people (κ = 0.41).Most event patterns in our data are fully instan-tiated, with only 22% containing blanks ( ).In our corpus, the intent annotations are slightlylonger (3.4 words on average) than the reaction an-notations (1.5 words).

3 Models

Given an event phrase, our models aim to gener-ate three entity-specific pragmatic inferences: Per-sonX’s intent, PersonX’s reaction, and others’ re-actions. The general outline of our model archi-tecture is illustrated in Figure 3.

The input to our model is an event pattern de-scribed through free-form text with typed vari-ables such as PersonX gives PersonYas a gift. For notation purposes, we describeeach event pattern E as a sequence of word em-beddings 〈e1, e2, . . . , en〉 ∈ Rn×D. This input isencoded as a vector hE ∈ RH that will be usedfor predicting output. The output of the model isits hypotheses about PersonX’s intent, PersonX’sreaction, and others’ reactions (vi,vx, and vo, re-spectively). We experiment with representing the

PersonX’s Intent decoder

vi: start, a, fight vx: powerful vo: defensive

PersonX’s Reaction decoder

Others’ Reaction decoder

Pre-condition Post-condition

Event2mind Encoder

PersonX punches PersonY’s lights out

E = e1…en

f (e1…en)

hE

softmax(Wi hE+bi) softmax(Wx hE+bx) softmax(Wo hE+bo)

Figure 3: Overview of the model architecture.From an encoded event, our model predicts intentsand reactions in a multitask setting.

output in two decoding set-ups: three vectors in-terpretable as discrete distributions over words andphrases (n-gram reranking) or three sequences ofwords (sequence decoding).

Encoding events The input event phrase E iscompressed into anH-dimensional embedding hEvia an encoding function f : Rn×D → RH :

hE = f(e1, . . . , en)

We experiment with several ways for defining f ,inspired by standard techniques in sentence andphrase classification (Kim, 2014). First, we exper-iment with max-pooling and mean-pooling overthe word vectors {ei}ni=1. We also consider a con-volutional neural network (ConvNet; LeCun et al.,1998) taking the last layer of the network as the en-coded version of the event. Lastly, we encode theevent phrase with a bi-directional RNN (specifi-cally, a GRU; Cho et al., 2014), concatenating thefinal hidden states of the forward and backwardcells as the encoding: hE = [

−→hn;←−h1]. For hyper-

parameters and other details, we refer the reader toAppendix B.

Though the event sequences are typically rathershort (4.6 tokens on average), our model still ben-efits from the ConvNet and BiRNN’s ability tocompose words.

Pragmatic inference decoding We use threedecoding modules that take the event phrase em-bedding hE and output distributions of possiblePersonX’s intent (vi), PersonX’s reactions (vx),and others’ reactions (vo). We experiment withtwo different decoder set-ups.

First, we experiment with n-gram re-ranking,considering the |V | most frequent {1, 2, 3}-grams in our annotations. Each decoder projectsthe event phrase embedding hE into a |V |-dimensional vector, which is then passed througha softmax function. For instance, the distributionover descriptions of PersonX’s intent is given by:

vi = softmax(WihE + bi)

Second, we experiment with sequence generation,using RNN decoders to generate the textual de-scription. The event phrase embedding hE is set asthe initial state hdec of three decoder RNNs (usingGRU cells), which then output the intent/reactionsone word at a time (using beam-search at testtime). For example, an event’s intent sequence(vi = v

(0)i v

(1)i . . .) is computed as follows:

v(t+1)i = softmax(Wi RNN(v

(t)i , h

(t)i,dec) + bi)

Training objective We minimize the cross-entropy between the predicted distribution overwords and phrases, against the one actually ob-served in our dataset. Further, we employ multi-task learning, simultaneously minimizing the lossfor all three decoders at each iteration.

Training details We fix our input embeddings,using 300-dimensional skip-gram word embed-dings trained on Google News (Mikolov et al.,2013). For decoding, we consider a vocabulary ofsize |V | = 14,034 in the n-gram re-ranking setup.For the sequence decoding setup, we only con-sider the unigrams in V , yielding an output spaceof 7,110 at each time step.

We randomly divided our set of 24,716unique events (57,094 annotations) into a train-ing/dev./test set using an 80/10/10% split. Someannotations have multiple responses (i.e., a crowd-worker gave multiple possible intents and reac-tions), in which case we take each of the combina-tions of their responses as a separate training ex-ample.

4 Empirical Results

Table 3 summarizes the performance of differentencoding models on the dev and test set in termsof cross-entropy and recall at 10 predicted intentsand reactions. As expected, we see a moderateimprovement in recall and cross-entropy when us-ing the more compositional encoder models (Con-vNet and BiRNN; both n-gram and sequence de-

Development Test

EncodingFunction

DecoderAverage

Cross-EntRecall @10 (%) Average

Cross-EntRecall @10 (%)

Intent XReact OReact Intent XReact OReact

max-pool n-gram 5.75 31 35 68 5.14 31 37 67mean-pool n-gram 4.82 35 39 69 4.94 34 40 68ConvNet n-gram 4.85 36 42 69 4.81 37 44 69BiRNN 300d n-gram 4.78 36 42 68 4.74 36 43 69BiRNN 100d n-gram 4.76 36 41 68 4.73 37 43 68

mean-pool sequence 4.59 39 36 67 4.54 40 38 66ConvNet sequence 4.44 42 39 68 4.40 43 40 67BiRNN 100d sequence 4.25 39 38 67 4.22 40 40 67

Table 3: Average cross-entropy (lower is better) and recall @10 (percentage of times the gold falls withinthe top 10 decoded; higher is better) on development and test sets for different modeling variations. Weshow recall values for PersonX’s intent, PersonX’s reaction and others’ reaction (denoted as “Intent”,“XReact”, and “OReact”). Note that because of two different decoding setups, cross-entropy betweenn-gram and sequence decoding are not directly comparable.

coding setups). Additionally, BiRNN models out-perform ConvNets on cross-entropy in both de-coding setups. Looking at the recall split acrossintent vs. reaction labels (“Intent”, “XReact” and“OReact” columns), we see that much of the im-provement in using these two models is within theprediction of PersonX’s intents. Note that recallfor “OReact” is much higher, since a majority ofevents do not involve other people.

Human evaluation To further assess the qual-ity of our models, we randomly select 100 eventsfrom our test set and ask crowd-workers to rategenerated intents and reactions. We present 5workers with an event’s top 10 most likely in-tents and reactions according to our model and askthem to select all those that make sense to them.We evaluate each model’s precision @10 by com-puting the average number of generated responsesthat make sense to annotators.

Figure 4 summarizes the results of this evalu-ation. In most cases, the performance is higherfor the sequential decoder than the correspondingn-gram decoder. The biggest gain from using se-quence decoders is in intent prediction, possiblybecause intent explanations are more likely to belonger. The BiRNN and ConvNet encoders consis-tently have higher precision than the mean-poolingwith the BiRNN-seq setup slightly outperformingother models. Unless otherwise specified, this isthe model we employ in further sections.

0%

10%

20%

30%

40%

50%

60%

Intent XReact OReact

mean-pool ngram mean-pool seq

ConvNet ngram ConvNet seq

BiRNN ngram BiRNN seq

Figure 4: Average precision @10 of each model’stop ten responses in the human evaluation. Weshow results for various encoder functions (mean-pool, ConvNet, BiRNN-100d) combined with twodecoding setups (n-gram re-ranking, sequencegeneration).

Error analyses We test whether certain typesof events are easier for predicting commonsenseinference. In Figure 6, we show the differencein cross-entropy of the BiRNN 100d model onpredicting different portions of the developmentset including: Blank events (events containingnon-instantiated arguments), 2+ People events(events containing multiple different Person vari-ables), and Idiom events (events coming fromthe Wiktionary idiom list). Our results show that,while intent prediction performance remains sim-

learn, get a job, learn a new skill,

get better

relax, get somewhere,

go home, get some exercise

relax, go home, get somewhere,

get home

learn, graduate, learn a new skill,

get a job

learn, graduate, learn a new skill,

learn moreIntent

PersonX’s Reaction

Event1 Event2

satisfied, refreshed,

accomplished, exhausted

satisfied, healthy, sad, exhausted, relieved

tired, sad, scared, pain,

hurt

refreshed, clean,

accomplished, good

clean, refreshed, satisfied,

accomplished, wet

Others’ Reaction

angry, upset, sad, annoyed,

hurt

grateful, annoyed,

angry, upset

grateful, thankful,

relieved, happy, satisfied

hurt, angry, sad, upset, dead,

scared

hurt, angry, sad, upset, annoyed,

scared

PersonXtakes PersonY

to the emergency room

PersonX goes to school

PersonXwashes

PersonX’slegs

PersonXpunches

PersonY’sface

PersonX cutsPersonX’s

legs

PersonXcomes home

after school

Figure 5: Sample predictions from homotopic embeddings (gradual interpolation between Event1 andEvent2), selected from the top 10 beam elements decoded in the sequence generation setup. Exampleshighlight differences captured when ideas are similar (going to and coming from school), when only asingle word differs (washes versus cuts), and when two events are unrelated.

Rec

all @

10 (%

)

Blanks 2+ People Idioms Full Dev

6767

43

68

38

20

3837 3931

41

30

Intent XReact OReact

Figure 6: Recall @ 10 (%) on different subsetsof the development set for intents, PersonX’s re-actions, and other people’s reactions, using theBiRNN 100d model. “Full dev” represents the re-call on the entire development dataset.

ilar for all three sets of events, it is 10% behindintent prediction on the full development set. Ad-ditionally, predicting other people’s reactions ismore difficult for the model when other people areexplicitly mentioned. Unsurprisingly, idioms areparticularly difficult for commonsense inference,perhaps due to the difficulty in composing mean-ing over nonliteral or noncompositional event de-scriptions.

To further evaluate the geometry of the em-bedding space, we analyze interpolations betweenpairs of event phrases (from outside the train set),similar to the homotopic analysis of Bowman et al.(2016). For a handful of event pairs, we decodeintents, reactions for PersonX, and reactions forother people from points sampled at equal inter-

vals on the interpolated line between two eventphrases. We show examples in Figure 5. Theembedding space distinguishes changes from gen-erally positive to generally negative words andis also able to capture small differences betweenevent phrases (such as “washes” versus “cuts”).

5 Analyzing Bias via Event2MindInference

Through Event2Mind inference, we can attempt tobring to the surface what is implied about people’sbehavior and mental states. We employ this infer-ence to analyze implicit bias in modern films. Asshown in Figure 7, our model is able to analyzecharacter portrayal beyond what is explicit in text,by performing pragmatic inference on characteractions to explain aspects of a character’s mentalstate. In this section, we use our model’s infer-ence to shed light on gender differences in intentsbehind and reactions to characters’ actions.

5.1 Processing of Movie ScriptsFor our portrayal analyses, we use scene descrip-tions from 772 movie scripts released by Gorin-ski and Lapata (2015), assigned to over 21,000characters as done by Sap et al. (2017). We ex-tract events from the scene descriptions, and gen-erate their 10 most probable intent and reaction se-quences using our BiRNN sequence model (as inFigure 7).

We then categorize generated intents and reac-tions into groups based on LIWC category scoresof the generated output (Tausczik and Pennebaker,2016).3 The intent and reaction categories are then

3We only consider content word categories: ‘Core Drives

Vivian sits on her bed,lost in thought. Her bags are packed, ...

PersonX sits on PersonX's

bed , lost in thought

Reaction

Juno laughs and hugs her father, planting a smooch on his cheek.

PersonX hugs ___ , planting

a smooch on PersonY's cheek

Intent

show affection

show love

loving

none

funny

friendly

nice

express love

worried

sad

upset

embarrassed

sick

scared

lonely

bad

Figure 7: Two scene description snippets fromJuno (2007, top) and Pretty Woman (1990, bot-tom), augmented with Event2mind inferences onthe characters’ intents and reactions. E.g., ourmodel infers that the event PersonX sits onPersonX’s bed, lost in thought im-plies that the agent, Vivian, is sad or worried.

aggregated for each character, and standardized(zero-mean and unit variance).

We compute correlations with gender for eachcategory of intent or reaction using a logis-tic regression model, testing significance whileusing Holm’s correction for multiple compar-isons (Holm, 1979).4 To account for the genderskew in scene presence (29.4% of scenes havewomen), we statistically control for the total num-ber of words in a character’s scene descriptions.Note that the original event phrases are all gen-der agnostic, as their participants have been re-placed by variables (e.g., PersonX). We also findthat the types of gender biases uncovered remainsimilar when we run these analyses on the humanannotations or the generated words and phrasesfrom the BiRNN with n-gram re-ranking decodingsetup.

and Needs’, ‘Personal Concerns’, ‘Biological Processes’,‘Cognitive Processes’, ‘Social Words’, ‘Affect Words’, ‘Per-ceptual Processes’. We refer the reader to Tausczik andPennebaker (2016) or http://liwc.wpengine.com/compare-dictionaries/ for a complete list of cate-gory descriptions.

4Given the data limitation, we represent gender as a bi-nary, but acknowledge that gender is a more complex socialconstruct.

5.2 Revealing Implicit Bias via ExplicitIntents and Reactions

Female: intentsAFFILIATION, FRIEND, FAMILY

BODY, SEXUAL, INGEST

SEE, INSIGHT, DISCREP

Male: intentsDEATH, HEALTH, ANGER, NEGEMO

RISK, POWER, ACHIEVE, REWARD, WORK

CAUSE, TENTATIVE‡

Female: reactionsPOSEMO, AFFILIATION, FRIEND, REWARD

INGEST, SEXUAL‡, BODY‡

Male: reactionsWORK, ACHIEVE, POWER, HEALTH†

Female: others’ reactionsPOSEMO, AFFILIATION, FRIEND

INGEST, SEE, INSIGHT

Male: others’ reactionsACHIEVE, RISK†

SAD, NEGEMO‡, ANGER†

Table 4: Select LIWC categories correlated withgender. All results are significant when correctedfor multiple comparisons at p < 0.001, except†p < 0.05 and ‡p < 0.01.

Our Event2Mind inferences automate portrayalanalyses that previously required manual annota-tions (Behm-Morawitz and Mastro, 2008; Pren-tice and Carranza, 2002; England et al., 2011).Shown in Table 4, our results indicate a genderbias in the behavior ascribed to characters, consis-tent with psychology and gender studies literature(Collins, 2011). Specifically, events with femalesemantic agents are intended to be helpful to otherpeople (intents involving FRIEND, FAMILY, andAFFILIATION), particularly relating to eating andmaking food for themselves and others (INGEST,BODY). Events with male agents on the other handare motivated by and resulting in achievements(ACHIEVE, MONEY, REWARDS, POWER).

Women’s looks and sexuality are also empha-sized, as their actions’ intents and reactions aresexual, seen, or felt (SEXUAL, SEE, PERCEPT).Men’s actions, on the other hand, are motivated byviolence or fighting (DEATH, ANGER, RISK), withstrong negative reactions (SAD, ANGER, NEGA-TIVE EMOTION).

Our approach decodes nuanced implications

http://liwc.wpengine.com/compare-dictionaries/

http://liwc.wpengine.com/compare-dictionaries/

into more explicit statements, helping to identifyand explain gender bias that is prevalent in modernliterature and media. Specifically, our results indi-cate that modern movies have the bias to portrayfemale characters as having pro-social attitudes,whereas male characters are portrayed as beingcompetitive or pro-achievement. This is consis-tent with gender stereotypes that have been studiedin movies in both NLP and psychology literature(Agarwal et al., 2015; Madaan et al., 2017; Pren-tice and Carranza, 2002; England et al., 2011).

6 Related Work

Prior work has sought formal frameworks for in-ferring roles and other attributes in relation toevents (Baker et al., 1998; Das et al., 2014; Schuleret al., 2009; Hartshorne et al., 2013, inter alia),implicitly connoted by events (Reisinger et al.,2015; White et al., 2016; Greene, 2007; Rashkinet al., 2016), or sentiment polarities of events(Ding and Riloff, 2016; Choi and Wiebe, 2014;Russo et al., 2015; Ding and Riloff, 2018). In ad-dition, recent work has studied the patterns whichevoke certain polarities (Reed et al., 2017), thedesires which make events affective (Ding et al.,2017), the emotions caused by events (Vu et al.,2014), or, conversely, identifying events or rea-soning behind particular emotions (Gui et al.,2017). Compared to this prior literature, our workuniquely learns to model intents and reactions overa diverse set of events, includes inference overevent participants not explicitly mentioned in text,and formulates the task as predicting the textualdescriptions of the implied commonsense insteadof classifying various event attributes.

Previous work in natural language inference hasfocused on linguistic entailment (Bowman et al.,2015; Bos and Markert, 2005) while ours fo-cuses on commonsense-based inference. Therealso has been inference or entailment work thatis more generation focused: generating, e.g., en-tailed statements (Zhang et al., 2017; Blouw andEliasmith, 2018), explanations of causality (Kanget al., 2017), or paraphrases (Dong et al., 2017).Our work also aims at generating inferences fromsentences; however, our models infer implicit in-formation about mental states and causality, whichhas not been studied by most previous systems.

Also related are commonsense knowledgebases (Espinosa and Lieberman, 2005; Speer andHavasi, 2012). Our work complements these ex-

isting resources by providing commonsense rela-tions that are relatively less populated in previ-ous work. For instance, ConceptNet contains only25% of our events, and only 12% have relationsthat resemble intent and reaction. We present amore detailed comparison with ConceptNet in Ap-pendix C.

7 Conclusion

We introduced a new corpus, task, and model forperforming commonsense inference on textually-described everyday events, focusing on stereotyp-ical intents and reactions of people involved in theevents. Our corpus supports learning representa-tions over a diverse range of events and reason-ing about the likely intents and reactions of pre-viously unseen events. We also demonstrate thatsuch inference can help reveal implicit gender biasin movie scripts.

Acknowledgments

We thank the anonymous reviewers for their in-sightful comments. We also thank xlab membersat the University of Washington, Martha Palmer,Tim O’Gorman, Susan Windisch Brown, Ghaza-leh Kazeminejad as well as other members at theUniversity of Colorado at Boulder for many help-ful comments for our development of the annota-tion pipeline. This work was supported in part byNational Science Foundation Graduate ResearchFellowship Program under grant DGE-1256082,NSF grant IIS-1714566, and the DARPA CwCprogram through ARO (W911NF-15-1-0543).

ReferencesMartın Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schus-ter, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke,Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,Pete Warden, Martin Wattenberg, Martin Wicke,Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow:Large-scale machine learning on heterogeneous sys-tems. Software available from tensorflow.org.

Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri-ramkumar Balasubramanian, and Shirin Ann Dey.2015. Key female characters in film have more to

https://www.tensorflow.org/



talk about besides men: Automating the bechdeltest. In NAACL, pages 830–840, Denver, Colorado.Association for Computational Linguistics.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The berkeley framenet project. In COLING-ACL.

Elizabeth Behm-Morawitz and Dana E Mastro. 2008.Mean girls? The influence of gender portrayals inteen movies on emerging adults’ gender-based atti-tudes and beliefs. Journalism & Mass Communica-tion Quarterly, 85(1):131–146.

Peter Blouw and Chris Eliasmith. 2018. Using neu-ral networks to generate inferential roles for naturallanguage. Frontiers in Psychology, 8:2335.

Johan Bos and Katja Markert. 2005. Recognising tex-tual entailment with robust logical inference. InMLCW.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In EMNLP.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew M. Dai, Rafal Jozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace. In CoNLL.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches. In SSST@EMNLP.

Yoonjung Choi and Janyce Wiebe. 2014. +/-effectwordnet: Sense-level lexicon acquisition foropinion inference. In EMNLP.

Rebecca L Collins. 2011. Content analysis of genderroles in media: Where are we now and where shouldwe go? Sex Roles, 64(3-4):290–298.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The PASCAL recognising textual entailmentchallenge. In Machine Learning Challenges. Eval-uating Predictive Uncertainty, Visual Object Classi-fication, and Recognising Textual Entailment, pages177–190. Springer.

Dipanjan Das, Desai Chen, Andre F. T. Martins,Nathan Schneider, and Noah A. Smith. 2014.Frame-semantic parsing. Computational Linguis-tics, 40(1):9–56.

Haibo Ding, Tianyu Jiang, and Ellen Riloff. 2017.Why is an event affective? Classifying affectiveevents based on human needs. In AAAI Workshopon Affective Content Analysis.

Haibo Ding and Ellen Riloff. 2016. Acquiring knowl-edge of affective events from blogs using label prop-agation. In AAAI.

Haibo Ding and Ellen Riloff. 2018. Weakly supervisedinduction of affective events by optimizing semanticconsistency. In AAAI.

Li Dong, Jonathan Mallinson, Siva Reddy, and MirellaLapata. 2017. Learning to paraphrase for questionanswering. In EMNLP.

Dawn Elizabeth England, Lara Descartes, andMelissa A Collier-Meek. 2011. Gender role por-trayal and the Disney princesses. Sex roles, 64(7-8):555–567.

Jose H. Espinosa and Henry Lieberman. 2005. Event-net: Inferring temporal relations between common-sense events. In MICAI.

Vindu Goel and Mike Isaac. 2016. Face-book Moves to Ban Private Gun Sales onits Site and Instagram. https://www.nytimes.com/2016/01/30/technology/facebook-gun-sales-ban.html. Ac-cessed: 2018-02-19.

Yoav Goldberg and Jon Orwant. 2013. A dataset ofsyntactic-ngrams over time from a very large corpusof english books. In SEM2013.

Andrew S Gordon and Reid Swanson. 2008. Sto-ryUpgrade: finding stories in internet weblogs. InICWSM.

Philip John Gorinski and Mirella Lapata. 2015. Moviescript summarization as graph-based scene extrac-tion. In NAACL, pages 1066–1076.

Stephan Charles Greene. 2007. Spin: Lexical seman-tics, transitivity, and the identification of implicitsentiment.

Lin Gui, Jiannan Hu, Yulan He, Ruifeng Xu, Qin Lu,and Jiachen Du. 2017. A question answering ap-proach for emotion cause extraction. In EMNLP.

Joshua K. Hartshorne, Claire Bonial, and MarthaPalmer. 2013. The verbcorner project: Toward anempirically-based semantic decomposition of verbs.In EMNLP.

Sture Holm. 1979. A simple sequentially rejectivemultiple test procedure. Scandinavian journal ofstatistics, pages 65–70.

Dongyeop Kang, Varun Gangal, Ang Lu, Zheng Chen,and Eduard H. Hovy. 2017. Detecting and explain-ing causes from text for a time series event. InEMNLP.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP.

Dan Klein and Christopher D Manning. 2003. Accu-rate unlexicalized parsing. In ACL.

Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

https://doi.org/10.3389/fpsyg.2017.02335



https://www.nytimes.com/2016/01/30/technology/facebook-gun-sales-ban.html



Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.2016. Commonsense knowledge base completion.In ACL.

Nishtha Madaan, Sameep Mehta, Taneea S. Agrawaal,Vrinda Malhotra, Aditi Aggarwal, and Mayank Sax-ena. 2017. Analyzing gender stereotyping in bolly-wood movies. CoRR, abs/1710.04117.

Marie-Catherine de Marneffe, Christopher D. Man-ning, and Christopher Potts. 2012. Did it happen?the pragmatic complexity of veridicality assessment.Computational Linguistics, 38:301–333.

Tomas Mikolov, Kai Chen, Gregory S. Corrado,and Jeffrey Dean. 2013. Efficient estimation ofword representations in vector space. CoRR,abs/1301.3781.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understandingof commonsense stories. In NAACL.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In ACL.

Deborah A Prentice and Erica Carranza. 2002. Whatwomen and men should be, shouldn’t be, are al-lowed to be, and don’t have to be: The contentsof prescriptive gender stereotypes. Psychology ofwomen quarterly, 26(4):269–281.

Anil Ramakrishna, Victor R Martınez, Nikolaos Ma-landrakis, Karan Singla, and Shrikanth Narayanan.2017. Linguistic analysis of differences in portrayalof movie characters. In ACL, pages 1669–1678,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Hannah Rashkin, Sameer Singh, and Yejin Choi. 2016.Connotation frames: A data-driven investigation. InACL.

Lena Reed, JiaQi Wu, Shereen Oraby, Pranav Anand,and Marilyn A. Walker. 2017. Learning lexico-functional patterns for first-person affect. In ACL.

Drew Reisinger, Rachel Rudinger, Francis Ferraro,Craig Harman, Kyle Rawlins, and Benjamin VanDurme. 2015. Semantic proto-roles. TACL, 3:475–488.

Irene Russo, Tommaso Caselli, and Carlo Strapparava.2015. Semeval-2015 task 9: Clipeval implicit polar-ity of events. In SemEval@NAACL-HLT.

Maarten Sap, Marcella Cindy Prasetio, Ari Holtzman,Hannah Rashkin, and Yejin Choi. 2017. Connota-tion frames of power and agency in modern films.In EMNLP, pages 2329–2334.

Karin Kipper Schuler, Anna Korhonen, and Su-san Windisch Brown. 2009. Verbnet overview, ex-tensions, mappings and applications. In NAACL.

Robyn Speer and Catherine Havasi. 2012. Represent-ing general relational knowledge in conceptnet 5. InLREC.

Yla R Tausczik and James W Pennebaker. 2016. Thepsychological meaning of words: LIWC and com-puterized text analysis methods. J. Lang. Soc. Psy-chol.

Emmett Tomai and Ken Forbus. 2010. Using narrativefunctions as a heuristic for relevance in story under-standing. In Proceedings of the Intelligent NarrativeTechnologies III Workshop, page 9. ACM.

Hoa Trong Vu, Graham Neubig, Sakriani Sakti,Tomoki Toda, and Satoshi Nakamura. 2014. Acquir-ing a dictionary of emotion-provoking events. InEACL.

Aaron Steven White, Drew Reisinger, Keisuke Sak-aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger,Kyle Rawlins, and Benjamin Van Durme. 2016.Universal decompositional semantics on universaldependencies. In EMNLP.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2015. From paraphrase database to compo-sitional paraphrase model and back. TACL, 3:345–358.

Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference. TACL, 5:379–395.

A Appendix

A.1 Event Extraction

We balance the number of content words to ensurethat the events are generalizable but still concreteenough to be labelled. We only keep events withat least two and less than five content words, de-fined as words that are not stop words, person tags,or blanks. We count phrasal verbs (such as “getup”) as content word. We limit the sets of eventsto those events that occur most frequently in ourcorpora, using corpus-specific thresholds. 5

A.2 Annotation Setup

Each event was presented to three different ratersrecruited via Amazon Mechanical Turk. Raterswere given the option to say that the event didnot make sense (invalid), at which point they werenot asked any other questions. If the rater markedthe event as valid, they were required to answerthe question about how PersonX typically feelsafter the event. Each rater was paid $0.10 perevent. Additionally we annotated a small numberof events where “It” was in the subject (e.g., Itrains all day). For these events, we onlyasked raters to say how other people typically feelafter the event (if they marked the event as valid).

B Event2Mind Training Details

In our experiments, we use Adam to train forten epochs, as implemented in Tensorflow (Abadiet al., 2015).

For baseline models, the dimension of the eventencoded embedding is H = 300. For our BiRNNmodel, we also experimented with an embeddingdimension of H = 100.

We define the vocabulary as the tokens appear-ing in the training data events and annotations atleast twice, plus the bigrams and trigrams that ap-pear more than five times. In cases where an an-notation for the intent/reaction was left blank (be-cause there was no intent or the event did not affectother people), we treated the annotation as equiv-alent to the word “none”. Because many of theannotations for intent started with “to” or “to be”,we stripped these two words from the beginningof all intent annotations.

5For ROC Story and Spinn3r events, we choose eventswith frequency at least five and 100, respectively. For Syn-tactic Ngrams, we took the top 10000 events.

Overlap criterion % of Event2Mind events

Any node 25%All annotations,with select relations

12 %

XIntent,with select relations

3%

XReact/OReact,with select relations

<1%

Table 5: Event2Mind events overlap with Con-ceptNet events. While a non-trivial amount arerepresented in some capacity, few events have in-tent or reactions.

C Comparison with ConceptNet

We match our events with the event nodes in Con-ceptNet and find 6 ConceptNet relations that com-pare to our intent and reaction dimensions. Specif-ically, we compare MotivatedByGoal, CausesDe-sire, HasFirstSubevent, and HasSubevent with the‘XIntent’ annotations, and ‘XReact’ and ‘OReact’annotations with the Causes and HasLastSubeventrelations. For each ConceptNet event, we thencompute unigram overlap between our annotationsand their ConceptNet proxy using the 6 relations.

We summarize overlap in Table 5, where weshow that 75% of Event2Mind events are not cov-ered in ConceptNet. We also show that while12% of our events have an edge with one of the6 relations, the actual overlap between our annota-tions and the ConceptNet data is very low (<5%).This overlap statistics indicates that our datasetprovides new commonsense knowledge that is notcovered by previous resources such as Concept-Net.

EventPersonX

1. Does this event make sense enough for you to answer2-5?

(Or does it have too many meanings?)

Yes, can answer No, can't answer or has too many meanings

Before the event2. Does PersonX willingly cause this event?

Yes No

a). Why?

(Try to describe without reusing words from the event)

Because PersonXwants ...

to (be)[write a reason]



After the event3. How does PersonX typically feel after the event?

PersonXfeels ...

[write areaction]

[writeanother reaction - optional]


4. Does this event affect people other than PersonX?

(e.g., PersonY, people included but not mentioned in the event)

Yes No

a). How do they typically feel after the event?

They feel ... [write areaction]



Figure 8: Main event phrase annotation setup. Each event was annotated by three Amazon MechanicalTurk raters.

event2mind: commonsense inference on events, intents, and ... · event2mind: commonsense inference...

Documents