natural language processing tjtsd66: advanced topics in social media dr. wang, shuaiqiang @ cs &...

66
Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: [email protected] Homepage: http://users.jyu.fi/~swang/ (Social Media Mining)

Upload: job-bell

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

Natural Language Processing

TJTSD66: Advanced Topics in Social Media

Dr. WANG, Shuaiqiang @ CS & IS, JYUEmail: [email protected]

Homepage: http://users.jyu.fi/~swang/

(Social Media Mining)

Page 2: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

2Social Media Mining Natural Language Processing Slide 2 of 66

Part I: What is Natural Language Processing?

Page 3: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

3Social Media Mining Natural Language Processing Slide 3 of 66

NLP Example: Question Answering

Page 4: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

4Social Media Mining Natural Language Processing Slide 4 of 66

NLP Example: Information Extraction

Page 5: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

5Social Media Mining Natural Language Processing Slide 5 of 66

NLP Example: Sentiment Analysis

• size and weight– nice and compact to carry!– since the camera is small and light, I won't

need to carry around those heavy, bulky professional cameras either!

– the camera feels flimsy, is plastic and very light in weight, you have to be very delicate in the handling of this camera

Attributes:zoomaffordabilitysize and weightflashEase of use

Page 6: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

6Social Media Mining Natural Language Processing Slide 6 of 66

NLP Example: Machine Translation

Page 7: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

7Social Media Mining Natural Language Processing Slide 7 of 66

Language Technology

Page 8: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

8Social Media Mining Natural Language Processing Slide 8 of 66

Part II: Preprocessing Techniques in NLP

Page 9: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

9Social Media Mining Natural Language Processing Slide 9 of 66

Normalization

• Information Retrieval: indexed text & query terms must have same form.

• am, is, are be• U.S.A. USA• car, cars, car’s, cars’ car• automate(s), automatic, automation

automat• US vs. us!!!

Page 10: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

10Social Media Mining Natural Language Processing Slide 10 of 66

Sentence Segmentation

• !,? Are relatively unambiguous• Period “.” is quite ambiguous– Sentence boundary– Abbreviations like Inc. or Dr.– Numbers like .02% or 4.3

• Build a binary classifier for “.”

Page 11: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

11Social Media Mining Natural Language Processing Slide 11 of 66

Sentence Segmentation: Decision Tree

Page 12: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

12Social Media Mining Natural Language Processing Slide 12 of 66

Word Tokenization

• English– “San Francisco”, “New York” as a token

• Chinese and Japanese– No spaces between words– 莎拉波娃现在居住在美国东南部的佛罗里达– 莎拉波娃 /现在 /居住 /在 /美国 /东南部 /的 /佛罗里达

– Also called Word Segmentation in Chinese (中文分词 )

Page 13: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

13Social Media Mining Natural Language Processing Slide 13 of 66

Part III: Language Modeling

Page 14: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

14Social Media Mining Natural Language Processing Slide 14 of 66

Probabilistic Language Models

• Assign a Probability to a sentence– Machine Translation:

• P(high winds tonight) > P(large winds tonight)

– Spell Correction• The office is about fifteen minuets from my house• P(about fifteen minutes from) > P(about fifteen

minuets from)

– Speech Recognition• P(I saw a van) >> P(eyes awe of an)

– Summarization, question-answering, etc.

Page 15: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

15Social Media Mining Natural Language Processing Slide 15 of 66

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words:– P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:– P(w5|w1,w2,w3,w4)

• A model that computes either of these:– P(W) or P(wn|w1,w2…wn-1) Is called a language

model

Page 16: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

16Social Media Mining Natural Language Processing Slide 16 of 66

How to Compute P(W)

•  

 

Page 17: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

17Social Media Mining Natural Language Processing Slide 17 of 66

Example

• P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)

• P(the|its water is so transparent that) = Count(its water is so transparent that the) / Count(its water is so transparent that) ???

• No! We’ll never see enough data for estimation.

Page 18: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

18Social Media Mining Natural Language Processing Slide 18 of 66

Markov Assumption

• Simplifying assumption:– P(the | its water is so transparent that) ≈

P(the | that)

• Or maybe– P(the | its water is so transparent that) ≈

P(the | transparent that)

Page 19: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

19Social Media Mining Natural Language Processing Slide 19 of 66

Markov Assumption

•  

Page 20: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

20Social Media Mining Natural Language Processing Slide 20 of 66

N-gram Models

•  

Page 21: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

21Social Media Mining Natural Language Processing Slide 21 of 66

Smoothing

• Intuition– Some words WITHOUT any occurrence in the

corpus do NOT mean its probability is ZERO!– Maybe because the corpus is too small

• Smoothing– Steal probability mass to generalize better

Page 22: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

22Social Media Mining Natural Language Processing Slide 22 of 66

Add-one (Laplace) Smoothing

•  

Page 23: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

23Social Media Mining Natural Language Processing Slide 23 of 66

Berkeley Restaurant Corpus

Page 24: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

24Social Media Mining Natural Language Processing Slide 24 of 66

Laplace-Smoothed Bigrams

 

Page 25: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

25Social Media Mining Natural Language Processing Slide 25 of 66

Reconstituted Counts

 

Page 26: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

26Social Media Mining Natural Language Processing Slide 26 of 66

Compare with Raw Counts

Page 27: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

27Social Media Mining Natural Language Processing Slide 27 of 66

Disadvantage

• Add-1 estimation is a blunt instrument• So add‐1 isn’t used for N-grams:– We’ll see better methods

• But add‐1 is used to smooth other NLP models– For text classification– In domains where the number of zeros isn’t so

huge.

Page 28: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

28Social Media Mining Natural Language Processing Slide 28 of 66

Good Turing Smoothing

Add-1 Smoothing

Add-k Smoothing:More general

¿𝑐 (𝑤 𝑖−1 ,𝑤𝑖 )+𝑚×

1𝑉

𝑐 (𝑤 𝑖−1 )+𝑚

Page 29: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

29Social Media Mining Natural Language Processing Slide 29 of 66

Unigram Prior Smoothing

• Let : • : the probability of each word under the

uniform distribution• The word distribution P(wi) can be

estimated with the unigram prior knowledge, the smoothing formula can be rewritten :

Page 30: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

30Social Media Mining Natural Language Processing Slide 30 of 66

Jelinek-Mercer Smoothing

• Let

• Then

¿𝑝 (𝑤𝑖∨𝑤𝑖− 1 )• Bigram interpolated model

𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑤𝑖∨𝑤𝑖− 1)=(1−𝜆)𝑝 (𝑤 𝑖∨𝑤 𝑖−1 )+𝜆𝑝 (𝑤 𝑖 )

Page 31: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

31Social Media Mining Natural Language Processing Slide 31 of 66

Part IV: Sentiment Analysis

Page 32: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

32Social Media Mining Natural Language Processing Slide 32 of 66

Positive or Negative in Movie Reviews?

• Unbelievably disappointing • Full of zany characters and richly applied

satire, and some great plot twists• This is the greatest screwball comedy

ever filmed• It was pathetic. The worst part about it

was the boxing scenes.

Page 33: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

33Social Media Mining Natural Language Processing Slide 33 of 66

Google Product Search

Page 34: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

34Social Media Mining Natural Language Processing Slide 34 of 66

Twitter sentiment vs. Gallup Poll of Consumer Confidence

Page 35: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

35Social Media Mining Natural Language Processing Slide 35 of 66

Twitter Sentiment

• Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter mood predicts the stock market, Journal of Computational Science 2:1, 1-8. 10.1016/j.jocs.2010.12.007.

Page 36: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

36Social Media Mining Natural Language Processing Slide 36 of 66

Other Names of Sentiment Analysis

• Opinion extraction• Opinion mining• Sentiment mining• Subjectivity analysis

Page 37: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

37Social Media Mining Natural Language Processing Slide 37 of 66

Why sentiment analysis?

• Movie: is this review positive or negative?

• Products: what do people think about the new iPhone?

• Public sentiment: how is consumer confidence? Is despair increasing?

• Politics: what do people think about this candidate or issue?

• Prediction: predict election outcomes or market trends from sentiment

Page 38: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

38Social Media Mining Natural Language Processing Slide 38 of 66

Sentiment Analysis

• Simplest task: – Is the attitude of this text positive or negative?

• More complex:– Rank the attitude of this text from 1 to 5

• Advanced:– Detect the target, source, or complex attitude

types

Page 39: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

39Social Media Mining Natural Language Processing Slide 39 of 66

Sentiment Classification for Movie Reviews

• Polarity detection:– Is an IMDB movie review positive or negative?

• Data: Polarity Data 2.0:– http

://www.cs.cornell.edu/people/pabo/movie-review-data

• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86.

• Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271- 278.

Page 40: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

40Social Media Mining Natural Language Processing Slide 40 of 66

IMDB data in the Pang and Lee database

• when _star wars_ came out some twenty years ago, the image of traveling throughout the stars has become a commonplace image. […]

• when han solo goes light speed, the stars change to bright lines, going towards the viewer in lines that converge at an invisible point.

• cool.• _october sky_ offers a much

simpler image--that of a single white dot, traveling horizontally across the night sky. […]

• "snake eyes" is the most aggravating kind of movie: the kind that shows so much potential then becomes unbelievably disappointing.

• it's not just because this is a brian depalma film, and since he's a great director and one who’s films are always greeted with at least some fanfare.

• and it's not even because this was a film starring nicolas cage and since he gives a brauvara performance, this film is hardly worth his talents.

Page 41: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

41Social Media Mining Natural Language Processing Slide 41 of 66

Baseline Algorithm

• Tokenization• Feature Extraction• Classification using different classifiers– Naïve Bayes– MaxEnt– SVM

Page 42: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

42Social Media Mining Natural Language Processing Slide 42 of 66

Sentiment Tokenization Issues

• Deal with HTML and XML markup• Twitter mark-up (names, hash tags)• Capitalization (preserve for words in all

caps)• Phone numbers, dates• Emoticons• Useful code:– Christopher Potts sentiment tokenizer– Brendan O’Connor twitter tokenizer

Page 43: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

43Social Media Mining Natural Language Processing Slide 43 of 66

Extracting Features

• How to handle negation– I didn’t like this movie! vs– I really like this movie!

• Which words to use?– Only adjectives– All words

• All words turns out to work better, at least on this data

Page 44: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

44Social Media Mining Natural Language Processing Slide 44 of 66

Negations

• Add NOT_ to every word between negation and following punctuation:– didn’t like this movie , but I

– didn’t NOT_like NOT_this NOT_movie but I

• Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting Market Sentiment from Stock Message Boards. In APFA.

• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86.

Page 45: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

45Social Media Mining Natural Language Processing Slide 45 of 66

Boolean feature for Multinomial Naïve Bayes

• Intuition:– For sentiment (and probably for other text

classification domains)– Word occurrence may matter more than word

frequency• The occurrence of the word fantastic tells us a lot• The fact that it occurs 5 times may not tell us much

more.

– Boolean Multinomial Naïve Bayes• Clips all the word counts in each document at 1

– Other possibility: log(freq(w))

Page 46: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

46Social Media Mining Natural Language Processing Slide 46 of 66

Boolean Multinomial Naïve Bayes

• From training corpus, extract Vocabulary• Calculate P(cj) terms

• Calculate P(wk|cj) terms

• Make predictions

𝑐 𝑁𝐵=max𝑐 𝑗 ∈𝐶

𝑃 (𝑐 𝑗 ) ∏𝑖∈ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠

𝑃 (𝑤𝑖∨𝑐 𝑗)

Page 47: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

47Social Media Mining Natural Language Processing Slide 47 of 66

Other issues in Classification

• MaxEnt and SVM tend to do better than Naïve Bayes

• Subtlety: makes reviews hard to classify– Perfume review in Perfumes: the Guide

• “If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”

Page 48: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

48Social Media Mining Natural Language Processing Slide 48 of 66

Thwarted Expectations and Ordering Effects

• “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”

• Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishburne is not so good either, I was surprised.

Page 49: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

49Social Media Mining Natural Language Processing Slide 49 of 66

Sentiment Lexicons

• The General Inquirer, 1966.– Home page: http://www.wjh.harvard.edu/~inquirer– List of Categories:

http://www.wjh.harvard.edu/~inquirer/homecat.htm

– Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls

– Categories:• Positive (1915 words) and Negative (2291 words)• Strong vs. Weak, Active vs. Passive, Overstated vs.

Understated• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive

Orientation, etc.

– Free for Research Use

Page 50: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

50Social Media Mining Natural Language Processing Slide 50 of 66

Sentiment Lexicons

• LIWC (Linguistic Inquiry and Word Count), 2007– Home page: http://www.liwc.net/– 2300 words, >70 classes– Affective Processes

• negative emotion (bad, weird, hate, problem, tough)• positive emotion (love, nice, sweet)

– Cognitive Processes• Tentative (maybe, perhaps, guess), Inhibition (block,

constraint)

– Pronouns, Negation (no, never), Quantifiers (few, many)

– $30 or $90 fee

Page 51: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

51Social Media Mining Natural Language Processing Slide 51 of 66

Sentiment Lexicons

• MPQA Subjectivity Cues Lexicon (EMNLP’03, 05)– Home page:

http://www.cs.piX.edu/mpqa/subj_lexicon.html• 6885 words from 8221 lemmas• 2718 positive and 4912 negative

– Each word annotated for intensity (strong, weak)

– GNU GPL

Page 52: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

52Social Media Mining Natural Language Processing Slide 52 of 66

Sentiment Lexicons

• Bing Liu Opinion Lexicon (KDD‐2004)– Bing Liu's Page on Opinion Mining– http://www.cs.uic.edu/~liub/FBS/opinion‐

lexicon-English.rar– 6786 words: 2006 positive and 4783 negative

Page 53: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

53Social Media Mining Natural Language Processing Slide 53 of 66

Sentiment Lexicons

• SentiWordNet (LREC‐2010)– Home page: http://sentiwordnet.isti.cnr.it/– All WordNet synsets automatically annotated

for degrees of positivity, negativity, and neutrality/objectiveness

Page 54: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

54Social Media Mining Natural Language Processing Slide 54 of 66

Disagreements between Polarity Lexicons

Christopher Potts, Sentiment Tutorial, 2011

Page 55: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

55Social Media Mining Natural Language Processing Slide 55 of 66

Learning Sentiment Lexicons

• Semi-supervised learning of lexicons– Use a small amount of information– A few labeled examples– A few hand‐built patterns– To bootstrap a lexicon

Page 56: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

56Social Media Mining Natural Language Processing Slide 56 of 66

Intuition for identifying word polarity (ACL’97)

• Adjectives conjoined by “and” have same polarity– Fair and legitimate, corrupt and brutal– *fair and brutal, *corrupt and legitimate

• Adjectives conjoined by “but” do not– fair but brutal

Page 57: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

57Social Media Mining Natural Language Processing Slide 57 of 66

Step 1

• Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)– 657 positive

adequate central clever famous intelligent remarkable reputed sensitive slender thriving…

– 679 negativecontagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting…

Page 58: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

58Social Media Mining Natural Language Processing Slide 58 of 66

Step 2

• Expand seed set to conjoined adjectives

Page 59: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

59Social Media Mining Natural Language Processing Slide 59 of 66

Step 3

• Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph:

Page 60: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

60Social Media Mining Natural Language Processing Slide 60 of 66

Step 4

• Clustering for partitioning the graph into two

Page 61: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

61Social Media Mining Natural Language Processing Slide 61 of 66

Output Polarity Lexicon

• Positive– bold decisive disturbing generous good honest

important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…

• Negative– ambiguous cautious cynical evasive harmful

hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

Page 62: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

62Social Media Mining Natural Language Processing Slide 62 of 66

Output Polarity Lexicon

• Positive– bold decisive disturbing generous good

honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…

• Negative– ambiguous cautious cynical evasive harmful

hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

Page 63: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

63Social Media Mining Natural Language Processing Slide 63 of 66

Using WordNet to learn polarity

• WordNet: online thesaurus (covered in later• lecture).• Create positive (“good”) and negative seed-

words (“terrible”)• Find Synonyms and Antonyms

– Positive Set: Add synonyms of positive words (“well”) and antonyms of negative words

– Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”)

• Repeat, following chains of synonyms• Filter

Page 64: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

64Social Media Mining Natural Language Processing Slide 64 of 66

Turney Algorithm (ACL’02)

1. Extract a phrasal lexicon from reviews2. Learn polarity of each phrase3. Rate a review by the average polarity of

its phrases

Page 65: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

65Social Media Mining Natural Language Processing Slide 65 of 66

Summary on Learning Lexicons

• Advantages:– Can be domain‐specific– Can be more robust (more words)

• Intuition– Start with a seed set of words (‘good’, ‘poor’) – Find other words that have similar polarity:

• Using “and” and “but”• Using words that occur nearby in the same

document• Using WordNet synonyms and antonyms

Page 66: Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS, JYU Email: shuaiqiang.wang@jyu.fishuaiqiang.wang@jyu.fi

66Social Media Mining Natural Language Processing Slide 66 of 66

Any Question?