quantifying text sentiment in r
TRANSCRIPT
![Page 1: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/1.jpg)
Happy, Sad, Indifferent … Quan3fying Text Sen3ment in R
Rajarshi Guha
CT R Users Group May 2012
![Page 2: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/2.jpg)
Preamble
• hHps://github.com/rajarshi/ctrug-‐tweet • Focus is on using R to perform this task • Won’t comment on validity, rigor, u3lity, … of sen3ment analysis methods
• Some of the example data is available freely, other parts available on request
![Page 3: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/3.jpg)
GeUng TwiHer Data
• Based on a collabora3on with Prof. Debs Ghosh (Uconn), studying obesity & social media
• Accessing TwiHer is easy using many languages – We obtained tweets via a PHP client running over an extended period of 3me
– Ended up with 108,164 tweets • Won’t focus on accessing TwiHer data from R – Very straighaorward with twitteR
![Page 4: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/4.jpg)
Cleaning Text
• Load in tweet data, get rid of urls, HTML escape codes, punctua3on etc
d <-‐ read.csv('pizza-‐unique.csv', colClass='character', comment='', header=TRUE) d$geox <-‐ as.numeric(d$geox) d$geoy <-‐ as.numeric(d$geoy) remove.urls <-‐ function(x) gsub("http.*$", "", gsub('http.*\\s', ' ', x)) remove.html <-‐ function(x) gsub('"', '', x) d$text <-‐ remove.urls(d$text) d$text <-‐ remove.html(d$text) d$text <-‐ gsub("@", "FOOBAZ", d$text) d$text <-‐ gsub("[[:punct:]]+", " ", d$text) d$text <-‐ gsub("FOOBAZ", "@", d$text) d$text <-‐ gsub("[[:space:]]+", ' ', d$text) d$text <-‐ tolower(d$text)
![Page 5: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/5.jpg)
Quan3fying Sen3ment
• Based on iden3fying words with posi3ve or nega3ve connota3ons
• Fundamentally based on looking up words from a dic3onary
• If a tweet has more posi3ve words than nega3ve words, the tweet is posi3ve
• More sophis3cated scoring schemes are possible
![Page 6: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/6.jpg)
0.0
0.2
0.4
0.6
0.8
1.0
adjective adverb noun verb
Proportion
Sentimentnegative
neutral
positive
BeHer Dic3onaries? • Sen3WordNet – Derived from WordNet, each term is assigned a posi3vity and nega3vity score
– 206K terms – Converted to simple CSV for easy import into R
• Ideally, should perform POS tagging
![Page 7: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/7.jpg)
Scoring Tweets • Given a scoring func3on, we can process the tweets – Perfect use case for parallel processing
– Easily switch out the scoring func3on
swn <-‐ read.csv('sentinet_r.csv', header=TRUE, as.is=TRUE) swn.match <-‐ function(w) { tmp <-‐ subset(swn, Term == w) if (nrow(tmp) >= 1) return(tmp[1,c(3,4)]) else return(c(0,0)) } score.swn <-‐ function(tweet) { words <-‐ strsplit(tweet, "\\s+")[[1]] cs <-‐ colSums(do.call('rbind', lapply(words, function(z) swn.match(z)))) return(cs[1]-‐cs[2]) } scores <-‐ mclapply(d$text, score.swn)
![Page 8: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/8.jpg)
![Page 9: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/9.jpg)
Profiling Makes Me Happy
• 6052 sec with 24 cores
• Rprof() is a good way to iden3fy boHlenecks*
• 461 sec with 24 cores
swn.match <-‐ function(w) { tmp <-‐ subset(swn, Term == w) if (nrow(tmp) >= 1) return(tmp[1,c(3,4)]) else return(c(0,0)) } score.swn <-‐ function(tweet) { words <-‐ strsplit(tweet, "\\s+")[[1]] cs <-‐ colSums(do.call('rbind', lapply(words, function(z) swn.match(z)))) return(cs[1]-‐cs[2]) } score.swn.2 <-‐ function(tweet) { words <-‐ strsplit(tweet, "\\s+")[[1]] rows <-‐ match(words, swn$Term) rows <-‐ rows[!is.na(rows)] cs <-‐ colSums(swn[rows,c(3,4)]) return(cs[1]-‐cs[2]) }
* overkill for this example
![Page 10: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/10.jpg)
Looking at the Scores
d$swn <-‐ unlist(scores.swn) d$breen <-‐ unlist(scores.breen) tmp <-‐ rbind(data.frame(Method='SWN', Scores=d$swn), data.frame(Method='Breen', Scores=d$breen)) ggplot(tmp, aes(x=Scores, fill=Method)) + geom_density(alpha=0.25) + xlab("Sentiment Scores")
0.0
0.5
1.0
1.5
2.0
2.5
-6 -4 -2 0 2 4 6Sentiment Scores
density
Method
SWN
Breen
• Bulk of the tweets are neutral
• Similar behavior from either scoring func3on
![Page 11: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/11.jpg)
Sen3ment & Time of Day • Group tweets by hour and evaluate how propor3ons of posi3ve, nega3ve, etc vary .
tmp <-‐ d tmp$hour <-‐ strptime(d$time, format='%a, %d %b %Y %H:%M')$hour tmp <-‐ subset(tmp, !is.na(swn)) tmp$status <-‐ sapply(tmp$swn, function(x) { if (x > 0) return("Positive") else if (x < 0) return("Negative") else return("Neutral") }) tmp <-‐ data.frame(do.call('rbind',
by(tmp, tmp$hour, function(x) table(x$status)))) tmp$Hour <-‐ factor(rownames(tmp), levels=0:23) tmp <-‐ melt(tmp, id='Hour', variable_name='Sentiment') ggplot(tmp, aes(x=Hour,y=value,fill=Sentiment))+geom_bar(position='fill')+ xlab("")+ylab("Proportion")
![Page 12: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/12.jpg)
Sen3ment & Time of Day
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Proportion
SentimentNegative
Neutral
Positive
![Page 13: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/13.jpg)
Contradic3ons?
• Tweets that are nega3ve according to one score but posi3ve according to another
subset(d, swn < -‐2 & breen > 1)
"i m trying to get some legit food right now like pizza or chicken not this shi7y ass school lunch” "24 i like reading 25 i hate hopsin 26 i love chips salsa 27 i love chevys 28 i
was a thug in middle school 29 i love pizza” "@naturesempwm had a raw pizza 4 lunch today but i was not impressed with the dried out
not fresh vegetable spring roll i bought threw out "
![Page 14: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/14.jpg)
swn-1
0
1
2
abs(swn)0.0
0.5
1.0
1.5
2.0
Sen3ment and Geography
• What’s the spa3al distribu3on of tweet sen3ment?
• Extract tweets located in the CONUS (~ 500) • Visualize the direc3on and strength of sen3ments
• Correlate with other socio-‐ economic factors?
![Page 15: Quantifying Text Sentiment in R](https://reader033.vdocuments.site/reader033/viewer/2022052618/554e9a9db4c90573338b5371/html5/thumbnails/15.jpg)
Other Considera3ons
• Should take into account nega3on – Scan for nega3on terms and adjust score appropriately
• Oblivious to sarcasm • Sen3ment scores should probably be modified by context
• Lots of M/L opportuni3es – Spa3al analysis – Topic modeling / clustering – Predic3ve models