sentiment analysis + extras - courses.cs.ut.ee · •using nlp/statistics/machine learning methods...

71
Sentiment Analysis + extras Natural Language Processing: Lecture 10 09.11.2017 Kairit Sirts

Upload: phunganh

Post on 07-Jul-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Sentiment Analysis + extrasNatural Language Processing: Lecture 10

09.11.2017

Kairit Sirts

Page 2: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Homework 2 – POS taggingScore Name Features

97.3 Faiz word +/-2 words; prefixes, suffixes; numeric, cap, all caps, first and last word, hyphen, lower

96.6 Maksym word and stem +/-1 stems; prefixes, suffixes, last 2 words; cap, P and N punct; word pos, sent len

96.5 Olha word/stem -2 words/stems, +1 word/stem; prefixes, suffixes

95.4 Ian word and stem -2 words/stems; previous POS tag

94.5 Andre stem +1 word -2 words; prefixes, suffixes; cap, punct

94.1 Alina word -1 word, stem -2 stems; prefixes, suffixes; contains num, punct; cap, is punct, is num

93.9 Liza word; prefixes, suffixes; num, first, last, 2nd last, alpha, cap, all caps, punct, article (a, an, the)

93.4 Viacheslav stem; prefixes, suffixes; characters (by frequency)

93.2 Aytaj word -2 words; prefixes, suffixes; cap, all caps, lower, contains .

92.9 Vladyslav word/stem -2 words/stems; prefixes, suffixes; cap, punct; word length

89.5 Artem stem

89.2 Yevhen, Ivan, Hendrik stem

79.4 Yurii word, stem; bigram of word and previous word 2

Page 3: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Topics

• Sentiment analysis

• Sentiment lexicons

• Extras• Topic modeling• More about evaluation

• Cross-validation• Precision and recall in multiclass setting

• Slides partially based on:• https://web.stanford.edu/~jurafsky/slp3/slides/7_NB.pdf• https://lct-master.org/files/MullenSentimentCourseSlides.pdf• https://web.stanford.edu/~jurafsky/slp3/slides/21_SentLex.pdf

3

Page 4: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Sentiment Analysis

4

Page 5: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Positive or negative movie review

• unbelievably disappointing

• Full of zany characters and richly applied satire, and some great plot twists

• this is the greatest screwball comedy ever filmed

• It was pathetic. The worst part about it was the boxing scenes.

5

Page 6: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

What is sentiment?

• Sentiment = feelings• Attitudes

• Emotions

• Opinions

• Subjective impressions, not facts

6

Page 7: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

What is sentiment?

• Generally, a binary opposition in opinions is assumed

• For/against, like/dislike, good/bad etc

• Some sentiment analysis jargon:• “Semantic orientation”

• “Polarity”

7

Page 8: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

What is sentiment analysis

• Using NLP/statistics/machine learning methods to extract, identify or otherwise characterize the sentiment content of a text unit

• Also referred to as:• Opinion extraction

• Opinion mining

• Sentiment mining

• Subjectivity analysis

8

Page 9: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Why sentiment analysis?

• Movie: is this review positive or negative?

• Products: what do people think about the new iPhone?

• Public sentiment: how is consumer confidence? Is despair increasing?

• Politics: what do people think about this candidate or issue?

• Prediction: predict election outcomes or market trends from sentiment

• Customer service: is this customer email satisfied or dissatisfied?

• Marketing: how are people responding to this ad/campaign/product release/news item?

9

Page 10: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Scherer Typology of Affective States

• Emotion: brief organically synchronized … evaluation of a major event

• angry, sad, joyful, fearful, ashamed, proud, elated

• Mood: diffuse non-caused low-intensity long-duration change in subjective feeling

• cheerful, gloomy, irritable, listless, depressed, buoyant

• Interpersonal stances: affective stance toward another person in a specific interaction

• friendly, flirtatious, distant, cold, warm, supportive, contemptuous

• Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons

• liking, loving, hating, valuing, desiring

• Personality traits: stable personality dispositions and typical behavior tendencies

• nervous, anxious, reckless, morose, hostile, jealous

10

Page 11: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Scherer Typology of Affective States

• Emotion: brief organically synchronized … evaluation of a major event

• angry, sad, joyful, fearful, ashamed, proud, elated

• Mood: diffuse non-caused low-intensity long-duration change in subjective feeling

• cheerful, gloomy, irritable, listless, depressed, buoyant

• Interpersonal stances: affective stance toward another person in a specific interaction

• friendly, flirtatious, distant, cold, warm, supportive, contemptuous

• Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons

• liking, loving, hating, valuing, desiring

• Personality traits: stable personality dispositions and typical behavior tendencies

• nervous, anxious, reckless, morose, hostile, jealous

11

Page 12: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

What is sentiment analysis?

• Sentiment analysis is the detection of attitudes“enduring, affectively colored beliefs, dispositions towards objects or persons”

1. Holder (source) of attitude

2. Target (aspect) of attitude

3. Type of attitude• From a set of types

• Like, love, hate, value, desire, etc.

• Or (more commonly) simple weighted polarity: • positive, negative, neutral, together with strength

4. Text containing the attitude• Sentence or entire document

12

Page 13: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Sentiment analysis

• Simplest task:• Is the attitude of this text positive or negative?

• More complex:• Rank the attitude of this text from 1 to 5

• Advanced:• Detect the target, source, or complex attitude types

13

Page 14: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Other related tasks

• Information extraction (discarding subjective information)

• Question answering (recognising opinion-oriented questions)

• Summarisation (accounting for multiple viewpoints)

• “Flame” detection

• Identifying child-suitability of videos based on comments

• Bias identification in news sources, fake news

• Identifying (in)appropriate content for ad placement

14

Page 15: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Applications in Business Intelligence

• Question: “Why aren’t consumers buying our laptop?”

• We know the concrete data: price, specs, competition etc

• We want to know subjective data: • “the design is tacky”

• “customer service was condescending”

• Misperceptions are also important, e.g. “updated drivers aren’t available” (even though they really are)

15

Page 16: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Challenges in Sentiment Analysis

• People express opinions in complex ways

• In opinion texts, lexical content alone can be misleading

• Negations and topic changes are common, both in the text and sentence level

• Rhetorical devices such as sarcasm, irony, implication etc

16

Page 17: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

A letter to a hardware store

“Dear <hardware store>

Yesterday I had occasion to visit <your competitor>. The had an excellent selection, friendly and helpful salespeople, and the lowest prices in town.

You guys suck.

Sincerely,”

17

Page 18: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Amazon.com 5 star

“The characters are so real and handled so carefully, that being trapped inside the Overlook is no longer just a freaky experience. You run along with them, filled with dread, from all the horrible personifications of evil inside the hotel's awful walls. There were several times where I actually dropped the book and was too scared to pick it back up. Intellectually, you know it's not real. It's just a bunch of letters and words grouped together on pages. Still, whenever I go into the bathroom late at night, I have to pull back the shower curtain just to make sure.”

18

Page 19: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Amazon.com 1 star

“The original Star Wars trilogy was a defining part of my childhood. Born as I was in 1971, I was just the right age to fall headlong into this amazing new world Lucas created. I was one of those kids that showed up early at toy stores [...] anxiously awaiting each subsequent installment of the series. I'm so glad that by my late 20s, the old thrill had faded, or else I would have been EXTREMELY upset over Episode I: The Phantom Menace... perhaps the biggest let-down in film history.”

19

Page 20: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Pitchfork.com (0.0 out of 10)

“Ten years on from Exile, Liz has finally managed to achieve what seems to have been her goal ever since the possibility of commercial success first presented itself to her: to release an album that could have just as easily been made by anybody else.”

20

Page 21: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Amazon.com 1 star

“It took a couple of goes to get into it, but once the story hooked me, I found it difficult to put the book down -- except for those moments when I had to stop and shriek at my friends, "SPARKLY VAMPIRES!" or "VAMPIRE BASEBALL!" or "WHY IS BELLA SO STUPID?" These moments came increasingly often as I reached the climactic chapters, until I simply reached the point where I had to stop and flail around laughing.”

21

Page 22: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

What to classify

• There are many possibilities of what we might want to classify• Users

• Texts

• Sentences / paragraphs / chunks of texts

• Predetermined descriptive phrases• <ADJ NOUN>, <NOUN NOUN>, <ADV ADJ>, etc

• Words

• Tweets

22

Page 23: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

SA as document classification

• Extract features from text• Ngrams

• POS tags, parse structures

• Negation words, emoticons, exclamation marks etc

• Distributed features

• Sentiment lexicons

• Build a classifier• Naïve Bayes

• Logistic regression

• SVM

23

Page 24: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

SA as document classification

• Extract features from text• Ngrams

• POS tags, parse structures

• Negation words, emoticons, exclamation marks etc

• Distributed features

• Sentiment lexicons

• Build a classifier• Naïve Bayes

• Logistic regression

• SVM

24

Page 25: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Sentiment Lexicons

25

Page 26: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Polarity keywords

• There seems to be some relation between positive words and positive reviews

• Can we come up with a set of keywords by hand to identify polarity?

• Results from (Pand et al., 2002)

26

Proposed word list Accuracy Ties

Human 1 Pos: dazzling, brilliant, phenomenal, excellent, fantasticNeg: suck, terrible, awful, unwatchable, hideous

58% 75%

Human 2 Pos: gripping, mesmerizing, riveting, spectacular, cool, awesome, thrilling, badass, excellent, moving, excitingNeg: : bad, cliched, sucks, boring, stupid, slow

64% 39%

Human 3 + stats

Pos: love, wonderful, best, great, superb, still, beautifulNeg: bad, worst, stupid, waste, boring, ?, !

69% 16%

Page 27: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Sentiment/Affective Lexicons

• The General Inquirer

• LIWC - Linguistic Inquiry and Word Count

• MPQA Subjectivity Cues Lexicon

• Bing Liu Opinion Lexicon

• SentiWordNet

27

Page 28: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

The General InquirerP. J. Stone et al., 1966. The General Inquirer: A Computer Approach to Content Analysis.

• Home page: http://www.wjh.harvard.edu/~inquirer

• List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm

• Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls

• Categories:• Positiv (1915 words) and Negativ (2291 words)

• Strong vs Weak, Active vs Passive, Overstated versus Understated

• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc

• Free for Research Use

28

Page 29: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

LIWC - Linguistic Inquiry and Word CountPennebaker et al., 2007. Linguistic Inquiry and Word Count: LIWC

• Home page: http://www.liwc.net/

• 2300 words, >70 classes

• Affective Processes• negative emotion (bad, weird, hate, problem, tough)

• positive emotion (love, nice, sweet)

• Cognitive Processes• Tentative (maybe, perhaps, guess), Inhibition (block, constraint)

• Pronouns, Negation (no, never), Quantifiers (few, many)

• Academic version $90, 30 day rental 10$

29

Page 30: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

MPQA Subjectivity Cues LexiconT. Wilson et al., 2005. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Riloff and Wiebe, 2003. Learning extraction patterns for subjective expressions.

• Home page: http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

• 6885 words• 2718 positive

• 4912 negative

• Each word annotated for intensity (strong, weak)

• GNU GPL

30

Page 31: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Bing Liu Opinion LexiconM. Hu and B. Liu, 2004. Mining and Summarizing Customer Reviews.

• Bing Liu's Page on Opinion Mining

• http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

• 6786 words• 2006 positive

• 4783 negative

31

Page 32: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

SentiWordNetS. Baccianella et al., 2010. SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining.

• Home page: http://sentiwordnet.isti.cnr.it/

• All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness

• [estimable(J,3)] “may be computed or estimated” • Pos 0 Neg 0 Obj 1

• [estimable(J,1)] “deserving of respect or high regard” • Pos .75 Neg 0 Obj .25

32

Page 33: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Semi-supervised Sentiment Lexicon Extraction

• Use a small amount of information• A few labeled examples

• A few hand-built patterns

• Then bootstrap a lexicon

33

Page 34: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Hatzivassiloglou and McKeown intuition for identifying word polarityV. Hatzivassiloglou and K. R. McKeown, 1997. Predicting the Semantic Orientation of Adjectives.

• Adjectives conjoined by “and” have same polarity• Fair and legitimate, corrupt and brutal

• *fair and brutal, *corrupt and legitimate

• Adjectives conjoined by “but” do not• fair but brutal

34

Page 35: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Hatzivassiloglou and McKeown 1997: Step 1

• Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)

• 657 positive: adequate central clever famous intelligent remarkable reputed sensitive slender thriving…

• 679 negative: contagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting…

35

Page 36: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Hatzivassiloglou and McKeown 1997: Step 2

• Expand seed set to conjoined adjectives

36

nice, helpful

nice, classy

Page 37: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Hatzivassiloglou and McKeown 1997: Step 3

• Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph:

37

Page 38: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Hatzivassiloglou and McKeown 1997: Step 3

• Clustering for partitioning the graph into two

38

Page 39: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Output Polarity Lexicon

• Positive• bold decisive disturbing generous good honest important large mature

patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…

• Negative• ambiguous cautious cynical evasive harmful hypocritical inefficient insecure

irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

39

Page 40: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Output Polarity Lexicon

• Positive• bold decisive disturbing generous good honest important large mature

patient peaceful positive proud sound stimulating straightforward strangetalented vigorous witty…

• Negative• ambiguous cautious cynical evasive harmful hypocritical inefficient insecure

irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

40

Page 41: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Turney AlgorithmTurney, 2002. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews.

1. Extract a phrasal lexicon from reviews

2. Learn polarity of each phrase

3. Rate a review by the average polarity of its phrases

41

Page 42: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Extract two-word phrases with adjectives and adverbs

First word Second word Third word (not extracted)

JJ NN or NNS anything

RB, RBR, RBS JJ Not NN nor NNS

JJ JJ Not NN or NNS

NN or NNS JJ Not NN or NNS

RB, RBR, RBS VB, VBD, VBN, VBG anything

42

Page 43: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

How to measure polarity of a phrase?

• Positive phrases co-occur more with “excellent”

• Negative phrases co-occur more with “poor”

• But how to measure co-occurrence?

43

Page 44: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

PMI - Pointwise Mutual Information

• How much more do events x and y co-occur than if they were independent?

• PMI between words: How much more do two words co-occur than if they were independent?

44

Page 45: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

How to estimate Pointwise Mutual Information?• Query search engine

• Turney used Altavista because it had NEAR operator

• P(word) estimated by hits(word)

• P(word1, word2) estimated by hits(word1 NEAR word2)

45

Page 46: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Does word occur more with “poor” or “excellent”?

46

Page 47: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Phrases from a positive review

47

Page 48: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Phrases from a negative review

48

Page 49: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Results of Turney algorithm

• 410 reviews from Epinions• 170 (41%) negative

• 240 (59%) positive

• Majority class baseline: 59%

• Turney algorithm: 74%

• Phrases rather than words

• Learns domain-specific information

49

Page 50: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Using WordNet to learn polarityS.M. Kim and E. Hovy. 2004. Determining the sentiment of opinions.M. Hu and B. Liu. Mining and summarizing customer reviews.

• WordNet: online thesuarus

• Create positive (“good”) and negative seed-words (“terrible”)

• Find Synonyms and Antonyms• Positive Set: Add synonyms of positive words (“well”) and antonyms of

negative words

• Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”)

• Repeat, following chains of synonyms

• Filter

50

Page 51: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Summary on semi-supervised lexicon learning

• Advantages:• Can be domain-specific• Can be more robust (more words)

Intuition

• Start with a seed set of words (‘good’, ‘poor’)

• Find other words that have similar polarity:• Using “and” and “but”• Using words that occur nearby in the same document• Using WordNet synonyms and antonyms

51

Page 52: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Using lexicon to detect document sentimentSimplest unsupervised method• Count the words with positive sentiment from a document

• Count the words with the negative sentiment

• Choose whichever value (positive or negative) has higher sum

52

Page 53: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Using lexicon to detect document sentimentSimplest supervised method• Build a classifier

• Predict sentiment (or emotion, or personality) given features

• Use “counts of lexicon categories” as a features

• Sample features:• LIWC category “cognition” had count of 7• NRC Emotion category “anticipation” had count of 2

• Baseline• Instead use counts of all the words and bigrams in the training set• This is hard to beat• But only works if the training and test sets are very similar

53

Page 54: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Conclusion on Sentiment Analysis

• Sentiment analysis is about detecting the polarity of subjective opinions

• Related tasks, all related to subjective aspects:• Emotion detection, emotion intensity detection

• Irony, sarcasm, humor detection

• Personality analysis

• In the simplest case SA is a document classification task, trying to predict the binary sentiment: positive or negative

• Task or domain-specific sentiment lexicons are useful in SA

54

Page 55: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

LDA topic modeling

55

Page 56: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

LDA topic modeling

• LDA – Latent Dirichlet Allocation

• Unsupervised document clustering into topics or themes

• Each document will be represented as a mixture of K topics

• The topic vectors can be used as document representations in further classification task

56

Page 57: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

57David Blei, 2012. Probabilistic Topic Models

Page 58: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Intuition of LDA

• Each topic is a distribution over words in a fixed vocabulary• The genetics topic has words about genetics with high probability

• The evolutionary biology topic has words about evolutionary biology with high probability

• Each document has a distribution over topics• In a two topic situation (genetics and evolutionary biology) a document may

be represented for instance with a vector [0.7, 0.3]

• This is a mixture of topics

58

Page 59: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Generative story

LDA is a generative model, thus it has a generative story

• For each topic, generate a distribution over words

• For each document, generate a distribution over topics

• For each word position in the document:• Sample a topic from the document’s topic distribution

• Sample a word from that topic distribution

59

Page 60: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

LDA model – plate diagram

60

Topic-word distributions

hyperparameters

wordsTopic index for each word

Document-topicdistributions

David Blei, 2012. Probabilistic Topic Models

Page 61: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Training an LDA model

• Learn the topic-word distributions

• For each document find the document-topic distributions

• Training equals inference because the model is unsupervised• By training the model on a text corpus we also obtain the topic vectors for

these documents

Two main methods (both are beyond the scope of our course)

• Gibbs sampling

• Variational Bayes

61

Page 62: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

LDA fit to 17000 articles from Science

62David Blei, 2012. Probabilistic Topic Models

Page 63: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Tools for training topic models

• Gensim

• Mallet

• Stanford Topic Modeling Toolkit

• jLDADMM

• R topic modeling package

63

Page 64: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Evaluation

64

Page 65: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Evaluation protocols

• Train-dev-test: 80/10/10, 70/20/10, 70/15/15• Train on training set

• Tune hyper-parameters on development set

• Test the best model on test set

• What if we can’t afford to split the data?• Cross-validation

65

Page 66: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

K-fold cross-validation

• Break up data into 5 folds

• For each fold:• Choose the fold as a temporary test

set

• Train on 9 folds, evaluate on test fold

• Report average performance of the 5 runs

66

TrainingTest

Test

Test

Test

Test

Training

Training Training

Training

Training

Iteration

1

2

3

4

5

Page 67: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Cross-validation

• Most commonly used 10-fold CV and 5-fold CV• Extreme case is LOOCV – Leave one out cross-validation

• Stratified K-fold CV – the label distribution is preserved in each fold.• If the dataset contains 10 positive items and 30 positive items and data is

divided into 5 fold then how many positive and how many negative items will be in each fold?

• The performance on different folds can vary a lot, especially when folds are small

• If possible, separate the test set and tune the hyperparameters using CV on the train set

67

Page 68: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Aggregating CV results

• Micro-averaging• Sum TP, FP and FN counts from all folds and then compute precision, recall

and F-score

• Macro-averaging• Compute precision, recall and F-score for each fold and then average.

68

Page 69: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Precision and recall in multiclass classification

• Contingency table

69Figure 6.5: https://web.stanford.edu/~jurafsky/slp3/6.pdf

Page 70: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Aggregating precision and recall

70Figure 6.6: https://web.stanford.edu/~jurafsky/slp3/6.pdf

Page 71: Sentiment Analysis + extras - courses.cs.ut.ee · •Using NLP/statistics/machine learning methods to extract, ... •Sentiment analysis is the detection of attitudes ... Semi-supervised

Conclusion for Extras

• LDA topic modeling – unsupervised method for clustering documents based on themes or topics

• It can provide useful features for document classification

• Use cross-validation when you can’t afford to split the data into train/dev/test

• Split data into train/test and perform CV with train set to choose features and hyperparameters

• With multiple classes evaluate precision and recall for every class and then aggregate with micro- or macro-averaging

71