quantifying text sentiment in r

15
Happy, Sad, Indifferent … Quan3fying Text Sen3ment in R Rajarshi Guha CT R Users Group May 2012

Upload: rguha

Post on 10-May-2015

2.212 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Quantifying Text Sentiment in R

Happy,  Sad,  Indifferent  …  Quan3fying  Text  Sen3ment  in  R  

Rajarshi  Guha    

CT  R  Users  Group  May  2012  

Page 2: Quantifying Text Sentiment in R

Preamble  

•  hHps://github.com/rajarshi/ctrug-­‐tweet  •  Focus  is  on  using  R  to  perform  this  task  •  Won’t  comment  on  validity,  rigor,  u3lity,  …  of  sen3ment  analysis  methods  

•  Some  of  the  example  data  is  available  freely,  other  parts  available  on  request    

Page 3: Quantifying Text Sentiment in R

GeUng  TwiHer  Data  

•  Based  on  a  collabora3on  with  Prof.  Debs  Ghosh  (Uconn),  studying  obesity  &  social  media  

•  Accessing  TwiHer  is  easy  using  many  languages  – We  obtained  tweets  via  a  PHP  client  running  over  an  extended  period  of  3me  

– Ended  up  with  108,164  tweets  •  Won’t  focus  on  accessing  TwiHer  data  from  R  – Very  straighaorward  with  twitteR  

Page 4: Quantifying Text Sentiment in R

Cleaning  Text  

•  Load  in  tweet  data,  get  rid  of  urls,  HTML  escape  codes,  punctua3on  etc  

d  <-­‐  read.csv('pizza-­‐unique.csv',  colClass='character',                                comment='',  header=TRUE)  d$geox  <-­‐  as.numeric(d$geox)  d$geoy  <-­‐  as.numeric(d$geoy)    remove.urls  <-­‐  function(x)  gsub("http.*$",  "",  gsub('http.*\\s',  '  ',  x))  remove.html  <-­‐  function(x)  gsub('&quot;',  '',  x)    d$text  <-­‐  remove.urls(d$text)  d$text  <-­‐  remove.html(d$text)  d$text  <-­‐  gsub("@",  "FOOBAZ",  d$text)  d$text  <-­‐  gsub("[[:punct:]]+",  "  ",  d$text)  d$text  <-­‐  gsub("FOOBAZ",  "@",  d$text)  d$text  <-­‐  gsub("[[:space:]]+",  '  ',  d$text)  d$text  <-­‐  tolower(d$text)  

Page 5: Quantifying Text Sentiment in R

Quan3fying  Sen3ment  

•  Based  on  iden3fying  words  with  posi3ve  or  nega3ve  connota3ons  

•  Fundamentally  based  on  looking  up  words  from  a  dic3onary  

•  If  a  tweet  has  more  posi3ve  words  than  nega3ve  words,  the  tweet  is  posi3ve  

•  More  sophis3cated  scoring  schemes  are  possible  

Page 6: Quantifying Text Sentiment in R

0.0

0.2

0.4

0.6

0.8

1.0

adjective adverb noun verb

Proportion

Sentimentnegative

neutral

positive

BeHer  Dic3onaries?  •  Sen3WordNet  – Derived  from  WordNet,  each  term  is  assigned  a  posi3vity  and  nega3vity  score  

– 206K  terms  – Converted  to  simple    CSV  for  easy  import    into  R  

•  Ideally,  should    perform  POS  tagging  

Page 7: Quantifying Text Sentiment in R

Scoring  Tweets  •  Given  a  scoring  func3on,  we  can  process  the  tweets  – Perfect  use  case  for    parallel    processing  

– Easily  switch  out  the    scoring    func3on  

swn  <-­‐  read.csv('sentinet_r.csv',  header=TRUE,                                    as.is=TRUE)    swn.match  <-­‐  function(w)  {      tmp  <-­‐  subset(swn,  Term  ==  w)      if  (nrow(tmp)  >=  1)  return(tmp[1,c(3,4)])      else  return(c(0,0))  }    score.swn  <-­‐  function(tweet)  {      words  <-­‐  strsplit(tweet,  "\\s+")[[1]]      cs  <-­‐  colSums(do.call('rbind',                                                  lapply(words,  function(z)                                                                  swn.match(z))))      return(cs[1]-­‐cs[2])  }    scores  <-­‐  mclapply(d$text,  score.swn)  

Page 8: Quantifying Text Sentiment in R
Page 9: Quantifying Text Sentiment in R

Profiling  Makes  Me  Happy  

•  6052  sec  with  24  cores  

•  Rprof()  is  a    good  way  to  iden3fy    boHlenecks*  

•  461  sec  with  24  cores  

swn.match  <-­‐  function(w)  {      tmp  <-­‐  subset(swn,  Term  ==  w)      if  (nrow(tmp)  >=  1)  return(tmp[1,c(3,4)])      else  return(c(0,0))  }  score.swn  <-­‐  function(tweet)  {      words  <-­‐  strsplit(tweet,  "\\s+")[[1]]      cs  <-­‐  colSums(do.call('rbind',                                                  lapply(words,  function(z)                                                                  swn.match(z))))      return(cs[1]-­‐cs[2])  }    score.swn.2  <-­‐  function(tweet)  {      words  <-­‐  strsplit(tweet,  "\\s+")[[1]]      rows  <-­‐  match(words,  swn$Term)      rows  <-­‐  rows[!is.na(rows)]      cs  <-­‐  colSums(swn[rows,c(3,4)])      return(cs[1]-­‐cs[2])      }    

*  overkill  for  this  example  

Page 10: Quantifying Text Sentiment in R

Looking  at  the  Scores  

d$swn  <-­‐  unlist(scores.swn)  d$breen  <-­‐  unlist(scores.breen)    tmp  <-­‐  rbind(data.frame(Method='SWN',  Scores=d$swn),                            data.frame(Method='Breen',  Scores=d$breen))  ggplot(tmp,  aes(x=Scores,  fill=Method))  +      geom_density(alpha=0.25)  +      xlab("Sentiment  Scores")  

0.0

0.5

1.0

1.5

2.0

2.5

-6 -4 -2 0 2 4 6Sentiment Scores

density

Method

SWN

Breen

•  Bulk  of  the  tweets  are  neutral  

•  Similar  behavior  from  either    scoring  func3on  

Page 11: Quantifying Text Sentiment in R

Sen3ment  &  Time  of  Day  •  Group  tweets  by  hour  and  evaluate  how  propor3ons  of  posi3ve,  nega3ve,  etc  vary  .  

tmp  <-­‐  d  tmp$hour  <-­‐  strptime(d$time,  format='%a,  %d  %b  %Y  %H:%M')$hour    tmp  <-­‐  subset(tmp,  !is.na(swn))  tmp$status  <-­‐  sapply(tmp$swn,  function(x)  {      if  (x  >  0)  return("Positive")      else  if  (x  <  0)  return("Negative")      else  return("Neutral")  })    tmp  <-­‐  data.frame(do.call('rbind',    

                         by(tmp,  tmp$hour,  function(x)  table(x$status))))  tmp$Hour  <-­‐  factor(rownames(tmp),  levels=0:23)  tmp  <-­‐  melt(tmp,  id='Hour',  variable_name='Sentiment')  ggplot(tmp,  aes(x=Hour,y=value,fill=Sentiment))+geom_bar(position='fill')+      xlab("")+ylab("Proportion")    

Page 12: Quantifying Text Sentiment in R

Sen3ment  &  Time  of  Day  

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Proportion

SentimentNegative

Neutral

Positive

Page 13: Quantifying Text Sentiment in R

Contradic3ons?  

•  Tweets  that  are  nega3ve  according  to  one  score  but  posi3ve  according  to  another  

subset(d,  swn  <  -­‐2  &  breen  >  1)  

"i  m  trying  to  get  some  legit  food  right  now  like  pizza  or  chicken  not  this  shi7y  ass  school  lunch”    "24  i  like  reading  25  i  hate  hopsin  26  i  love  chips  salsa  27  i  love  chevys  28  i    

 was  a  thug  in  middle  school  29  i  love  pizza”    "@naturesempwm  had  a  raw  pizza  4  lunch  today  but  i  was  not  impressed  with  the  dried  out    

 not  fresh  vegetable  spring  roll  i  bought  threw  out  "  

Page 14: Quantifying Text Sentiment in R

swn-1

0

1

2

abs(swn)0.0

0.5

1.0

1.5

2.0

Sen3ment  and  Geography  

•  What’s  the  spa3al  distribu3on  of  tweet  sen3ment?  

•  Extract  tweets  located  in  the  CONUS  (~  500)  •  Visualize  the  direc3on  and  strength  of  sen3ments  

•  Correlate  with  other  socio-­‐  economic  factors?  

Page 15: Quantifying Text Sentiment in R

Other  Considera3ons  

•  Should  take  into  account  nega3on    – Scan  for  nega3on  terms  and  adjust  score  appropriately  

•  Oblivious  to  sarcasm  •  Sen3ment  scores  should  probably  be  modified  by  context  

•  Lots  of  M/L  opportuni3es  – Spa3al  analysis  – Topic  modeling  /  clustering  – Predic3ve  models