computational extraction of social and interactional meaning from speech dan jurafsky and mari...

Computational Extraction of Social and Interactional Meaning

from Speech

Dan Jurafsky and Mari Ostendorf

Lecture 7: Dialog Acts & Sarcasm

Mari Ostendorf

Note: Uncredited examples are from Dialogue & Conversational Agents chapter.

H-H Conversation Dynamics(from Stolcke et al., CL 2000)

(from Jurafsky book)

Human-Computer Dialog

Greeting

Request

Clarification Question

InformResponse

Welcome to theCommunicator...

I wanna go from Denver

to ... What time doyou want to

leave Denver?

I’d like to leave in the morning ... Eight flight options

were returned. Option 1...

OverviewDialog acts

Definitions Important special cases Detection

Role of prosodySarcasm

In speechIn text

OverviewDialog acts



Speech/Dialog/Conversation Acts

Characterize the purpose of an utteranceAssociated with sentences (or intonational phrases)Used for:

Determining and controlling the “state” of a conversation in a spoken language system

Conversation analysis, e.g. extracting social informationMany different tag sets, depending on application

Example Dialog Acts

Aside: Speech vs. TextSpeech/dialog/conversation act inventories were

developed when conversations were spokenNow, conversations can happen online or via text

messagingDialog acts are also relevant here, researchers are

starting to look at thisSome differences:

Text is impoverished relative to speech, so extra punctuation, emoticons, etc., are added

Turn-taking & grounding

OverviewDialog acts



Special CasesQuestion detection

punctuation prediction4-category general set: statement, question,

incomplete, backchannel cross-domain training and transfer

Agreement vs. disagreement social analysis

Error corrections (for communication errors) human-computer dialogs

Questions: Harder than you’d think…

Indirect speech act

Correction Example

OverviewDialog acts



Automatic DetectionTwo problems:

Classification given segmentationSegmentation (often multiple DAs per turn)

Best treated jointly, but this can be computationally complex – start with known segmentation case

ok uh let me pull up your profile and I’ll be right with you here and you said you wanted to travel next week

1. ok uh let me pull up your profile and I’ll be right with you here

2. and you said you wanted to travel next week

1. ok 2. uh let me pull up your profile and 3. I’ll be right with you here and4. you said you wanted to travel5. next week

Looking at Segmentation (from Stolcke et al., CL 2000)

More Segmentation Challenges

A: Ok, so what do you think?B: Well that’s a pretty loaded topic. A: Absolutely.B: Well, here in uh – Hang on just a minute, the dog is barking -- Ok, here in Oklahoma, we just went through a major educational reform…

A: After all these things, he raises hundreds of millions of dollars. I mean uh the fella B: but he never stops talking about it. A: but okB: Aren’t you supposed to y- I mean A: well that’s a little- the Lord saysB: Does charity mean something if you’re constantly using it as a cudgel to beat your enemies over the- I’m better than you. I give money to charity.A: Well look, now I…

Knowledge Sources for ClassificationWords and grammar

“please,” “would you” – cue to requestAux inversion – cue to Y/N question“uh-huh,” “yeah” – often backchannels

ProsodyRising final pitch – Y/N question, declarative questionPitch & energy can distinguish backchannel (yeah) from

agreement, pitch reset may indicate incompletePitch accent type… (more on this)

Conversational structure (context)Answers follow questions

Feature extractionWords

N-grams as featuresDA-dependent n-gram language model scorePresence/absence of syntactic constituents

Prosody (typically with normalization)Speaking rateMean and variance of log energyFundamental frequency: mean, variance, overall contour

trend, utterance final contour shape, change in mean across utterance boundaries

Combining Cues with Context

With conversational structure: need a sequence model

d = dialog act sequence d1, …, dT

f = prosody features, w = word/grammar features

Direct model (e.g. conditional random field)

Generative model (e.g. HMM, or hidden event model)

Experimental results show small gain from context

argmax p(d|f,w) where p(d|f,w) = t p(dt|ft,wt,dt-1)

argmax p(f,w|d)p(d) where p(f,w|d) = t p(ft|dt) p(wt|dt) p(dt|dt-1)

Assuming Independent Segments

No sequence model, but DA prior (unigram) still important

Direct model:

features can extend beyond utterance to approximately capture context,

need to handle nonhomogeneous cues or make them homogeneous

Generative model:

Can predict dt using separate w and f classifiers, then do classifier combination

argmax p(dt|ft,wt)

argmax p(ft|dt) p(wt|dt) p(dt)

Some Results (not directly comparable)

42 classes (Stolcke et al., CL 2000)Hidden-event model: prosody & words (& context)42-class accuracy: 62-65% Switchboard ASR (68-71% hand transcripts)

4 classes (Margolis et al., DANLP 2009)Liblinear, n-grams + length (no prosody), hand transcripts4-class accuracy: 89% Swbd, 84% MRDA 4-class avg recall: 85% Swbd, 81% MRDA

2 classes (Margolis & Ostendorf, ACL 2011)Liblinear, n-grams + prosody, hand transcriptsquestion F-measure: 0.6 MRDA (recall = 92%)

3 classes (Galley et al., ACL 2004)Maxent, lexical-structural-duration features, hand transcripts3-class accuracy: 86% MRDA

Backchannel “Universals”

What is in common with backchannels across languages?Short length, low energy, NOT the words

Example:English: uh-huh, right, yeahSpanish: mmm, si, ya mmm, yes, already

Experiment: Cross-language DA classification for English vs. Spanish conversational telephone speechMargolis et al., 2009Statement, question, incomplete, backchannelUse automatic translation in cross-language classification

Spanish vs. English DAsBackchannels: • roughly 20% of DAs• lexical cues are useful within languages, so length is not used much• length more important across languages

Questions: • “<s> es que” often starts a statement in Spanish• translate: “<s> is that” indicates a question in English

OverviewDialog acts Role of prosodySarcasm

ProsodyImpact overall is small: from Stolcke et al., CL 2000

BUT, it can be important for some distinctions

Other examples: right, so, absolutely, ok, thank you, ….

Oh. (disappointment) vs. Oh! (I get it)

Yeah: positive vs. negative

Question Detection

From Margolis & Ostendorf, ACL 2011

Whatever! (Benus, Gravano & Hirschberg, 2007)

Production: 1st syllable more likely to have a pitch accent for negative interpretation.

Perception: Listeners negativity judgments from prosody on “whatever” alone is similar to having full context.


In speechIn text

SarcasmChanging the default (or literal) meaningObjectives of sarcasm

Make someone else feel bad or stupidDisplay anger or annoyance about somethingInside joke

Why is it interesting? More accurate sentiment detection More accurate agreement/disagreement detection General understanding of communication strategies

Negative positives in talk shows: yeah

and i don't think you’re going to be going back … yeahoh yeahthat's right yeahyeahyeah but …yeah well i well m my understanding is … yeah it it it gosh you know is that the standard that

prosecutors use the maybe possibly she's telling the truth standard

yeah i i don't think it was just the radical right yeah larry i i want to correct something randi said of

course

Negative positives (cont.) -- right

rightth that's rightthat's right yeahyou know what you're right but rightright but but you you can't say that punching him …right but the but the psychiatrists in this case were not

just …senators are not polling very well rightthen as a columnist who's offering opinions on what i

think the right policy is it seems to me…

Yeah, right. (Tepperman et al., 2006)

131 instances of “yeah right” in Switchboard & Fisher, 23% annotated as sarcastic

Annotation:In isolation: very low agreement between human

listeners (k=0.16)*In context, still weak agreement (k=.31)Gold standard based on discussion

Observation: laughter is much more frequent around sarcastic versions

* “Prosody alone is not sufficient to discern whether a speaker is being sarcastic.”

Sarcasm DetectorFeatures:

Prosody: relative pitch, duration & energy for each wordSpectral: class-dependent HMM acoustic model scoreContext: laughter, gender, pause, Q/A DA, location in

utteranceClassifier: decision tree (WEKA)Implicit feature selection in tree training

Results

• Laughter is most important contextual feature • Energy seems a little more important than pitch

Let’s do our own experiment

absolutelyMale

Female

yeahMale

Female

exactlyMale

Female


In speechIn text

Davidov, Tsur & Rappoport, 2010 – DTR10

Gonzalez-Ibanez, Muresan & Wacholder, 2011 – GIMW11

Sarcasm in Twitter & AmazonTwitter examples (DTR10)

“thank you Janet Jackson for yet another year of Super Bowl classic rock!”

“He’s with his other woman: XBox 360. It’s 4:30 fool. Sure I can sleep through the gunfire”

“Wow GPRS data speeds are blazing fast.”More twitter examples (GIMW11)

@UserName That must suck.I can't express how much I love shopping on black Friday.@UserName that's what I love about Miami.Attention to detail in preserving historic landmarks of the past.@UserName im just loving the positive vibes out of that!

Amazon examples (DTR10)“[I] Love The Cover” (book)“Defective by design” (music player)

Negative positive

Twitter #sarcasm issues

Problems: DTR10Used infrequentlyUsed in non-sarcastic cases, e.g. to clarify a previous

tweet (it was #Sarcasm)Used when sarcasm is otherwise ambiguous (prosody

surrogate?) – biased towards the most difficult casesGIMW11 argues that the non-sarcastic cases are easily

filtered by only using ones with #sarcasm at the end

DTR10 StudyData

Twitter: 5.9M tweets, unconstrained contextAmazon: 66k reviews, known product contextMechanical Turk annotation

K= 0.34 on Amazon, K = 0.41 on Twitter Features

Patterns of high frequency words + content word slots“[COMPANY] CW does not CW much”

PunctuationK-NN classifierSemi-supervised labeling of training samples

DTR10 Results

F-score

Punctuation 0.28

Patterns 0.77

Patts + punc 0.81

Enriched patts 0.40

Enriched punct 0.77

All (SASI) 0.83

Amazon results for different feature sets on gold standard

F-score

Amazon - Turk 0.79

Twitter - Turk 0.83

Twitter – #Gold 0.55

Amazon/Twitter SASI results for eval paradigms

GMW11 StudyData: 2700 tweets, equal amounts of positive,

negative and sarcastic (no neutral)Annotation by hashtags: sarcasm/sarcastic,

happy/joy/lucky, sadness/angry/frustratedFeatures:

Unigrams, LIWC classes (grouped), WordNet affectInterjections and punctuation, Emoticons & ToUser

Classifier: SVM & logistic regression

ResultsAutomatic system accuracy:

3-way S-P-N: 57%, 2-way S-NS: 65%Equal difficulty in separating sarcastic from positive and

negativeHuman S-P-N labeling: 270 tweet subset, K=0.48

Human “accuracy”: 43% unanimous, 63% avgNew humans S-NS labeling, K=.59

Human “accuracy”: 59% unanimous, 67% avgAutomatic: 68%

Accuracies & agreement go up for subset with emoticons

Conclusion: Humans are not so good at this task either…

SummaryDialog Acts

Purpose of an utterance in conversationUseful for punctuation in transcription, social analysis,

dialog management in human-computer interactionDetection leverages words, grammar, prosody & context

Prosody …matters for a small subset of DAs, but can matter a lot

for these casesIs realized in continuous (range) and symbolic (accents)

cues – needs contextual normalizationSarcasm: a difficult task! (for both text and speech)

Topics not covered …Joint segmentation and classificationSemi-supervised learningDomain-dependent tag set differencesetc.

computational extraction of social and interactional meaning from speech dan jurafsky and mari...

Documents

text slide

week slide

application slide

grounding slide

jurafsky book slide

indirect speech act

example dialog acts

text messaging dialog