social data science - text as data · resources...

social data scienceText as Data

Sebastian BarfortAugust 18, 2016

University of CopenhagenDepartment of Economics

1/60

text all over the place

The widespread use of the Internet has led to anastronomical amount of digitized textual dataaccumulating every second through email, websites, andsocial media. The analysis of blog sites and social mediaposts can give new insights into human behaviors andopinions. At the same time, large-scale efforts to digitizepreviously published articles, books, and governmentdocuments have been underway, providing excitingopportunities for social scientists.

Imai (2016).

We need to learn how to think about and work with these kinds ofnew data

2/60

resources

Names (selective): Will Lowe, Justin Grimmer, Kenneth Benoit,Margaret E. Roberts, Sven-Oliver Proksch, Suresh Naidy

R packages: tm, quanteda, stm, stringr, tidytext

3/60

is your favorite professor biased?

Jelveh, Zubin, Bruce Kogut, and Suresh Naidu. “Detecting LatentIdeology in Expert Text: Evidence From Academic Papers inEconomics”. EMNLP. 2014.

Previous work on extracting ideology from text has focusedon domains where expression of political views is expected,but it’s unclear if current technology can work in domainswhere displays of ideology are considered inappropriate.We present a supervised ensemble n-gram model forideology extraction with topic adjustments and apply it toone such domain: research papers written by academiceconomists. We show economists’ political leanings can becorrectly predicted, that our predictions generalize to newdomains, and that they correlate with public policy-relevantresearch findings. We also present evidence thatunsupervised models can underperform in domains whereideological expression is discouraged. 4/60

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.672.7629&rep=rep1&type=pdf



validation

We emphasize that the complexity of language implies thatautomated content analysis methods will never replacecareful and close reading of texts. Rather, the methods thatwe profile here are best thought of as amplifying andaugmenting careful reading and thoughtful analysis.Further, automated content methods are incorrect modelsof language. This means that the performance of any onemethod on a new data set cannot be guaranteed, andtherefore validation is essential when applying automatedcontent methods.

Grimmer and Stewart (2013).

5/60

bag of words

Bag of Words: The ordering and grammar of words does not informthe analysis.

Easy to construct sample sentences where word orderfundamentally changes the nature of the sentence, but for mostcommon tasks like measuring sentiment, topic modeling, etc. theydo not seem to matter (Grimmer and Stewart 2013)

7/60

pre-processing

Stemming: Dimensionality reduction. Removes the ends of words toreduce the total number of unique words in the data.

Ex: family, families, families’, etc. all become famili.Stop words: Words that do not convey meaning but primarily servegrammatical purposes.

Uncommon Words: Typically, words that appear very often or veryrarely are excluded.

Also typically discard punctuation (although not always!),capitalization, etc.

8/60

Classifying Documents into KnownCategories

9/60

introduction

Inferring and assigning text to categories is perhaps most commonuse of contant analysis in the social sciences

Ex: Classifying ads as positive/negative, is legislation aboutenviorenment, etc.

Two broad approaches:

Dictionary Methods: Use relative frequency of key words to measurepresence of category in a given text

Supervised Learning: Build on and extend familiar manual codingtasks using algorithms

10/60

dictionary methods

Perhaps the most simple and intuitive automated text classificationmethod

Use the rate at which key words appear in a text to classifydocuments into categories or to measure extent to which documentsbelong to particular category

Dictionary: a list of words that classify a particular collection ofwords

Note: For dictionary methods to work well, the scores attached toeach words must closely align with how the words are used in aparticular context

Dictionaries are rarely validated

11/60

supervised learning methods

Dictionary methods require that we are able to apriori identify wordsthat separate classes

This can be wrong and/or inefficient

Supervised learning models are designed to automate the handcoding of documents

Supervised learning models: Human coders categorize a set ofdocuments by hand. The algorithm then “learns” how to sort thedocuments into categories using these training data and apply itspredictions to new unlabeled texts

12/60

approach

1. Construct a training set2. Apply the supervised learning method using cross-validation3. Decide on “best” model and classify the remaining documents

13/60

Classification with Unknown Categories

14/60

introduction

Supervised and dictionary methods assume a well-defined set ofcategories

Often, this set of categories is difficult to derive beforehand

Is must be discovered from the text itself

Unsupervised Learning: Try to learn underlying features of textwithout explicitly imposing categories of interest

1. Estimate set of categories2. Assign documents (or part of documents) to those categories

Often: topic-models

15/60

Measuring Latent Features in Texts

16/60

introduction

Can we locate actors (politicians, newspapers, researchers) in anideological space using text data?

Assumption: Ideological dominance. Actors’ ideological preferencesdetermine what they discuss in texts.

Wordscores: Supervised learning approach. Special case ofdictionary method.

Wordfish: Unsupervised learning approach. Discover words thatdistinguish locations on a policy scale.

17/60

wordscore

1. Select reference texts that define the position in the policyspace (e.g. a conservative and liberal politician)

2. Use training data to determine relative frequency of words.Creates a measure of how well various words separate thecategories

3. Use these word scores to scale remaining texts.

Disadvantage: Conflates policy dominance with stylistic differences

18/60

Predicting Yelp Reviews

19/60

introduction

David Robinson: Does sentiment analysis work? A tidy analysis ofYelp reviews

Sentiment analysis is often used by companies to quantify generalsocial media opinion (for example, using tweets about severalbrands to compare customer satisfaction).

One of the simplest and most common sentiment analysis methodsis to classify words as “positive” or “negative”, then to average thevalues of each word to categorize the entire document.

Can we use this approach to predict Yelp reviews?

20/60

http://varianceexplained.org/r/yelp-sentiment/

http://varianceexplained.org/r/yelp-sentiment/

data

Can be downloaded from here

library(”readr”)library(”dplyr”)

infile = ”../nopub/yelp_academic_dataset_review.json”review_lines = read_lines(infile,

n_max = 50000,progress = FALSE)

21/60

https://www.yelp.com/dataset_challenge/dataset

from json to data frame

library(”stringr”)library(”jsonlite”)

reviews_combined = str_c(”[”,str_c(review_lines,

collapse = ”, ”),”]”)

reviews = fromJSON(reviews_combined) %>%flatten() %>%tbl_df()

22/60

tidy text data

Right now, there is one row for each review.

Remember bag of words assumption: predictors are at the word, notsentence level

-> We need to tidy the data

23/60

library(”tidytext”)review_words = reviews %>%select(review_id, business_id, stars, text) %>%unnest_tokens(word, text)

review_words %>% dim

## [1] 5930037 4

24/60

remove stop words

review_words = review_words %>%filter(!word %in% stop_words$word) %>%filter(str_detect(word, ”^[a-z’]+$”))

25/60

review_id business_id stars word

Ya85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 hoagieYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 institutionYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 walkingYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 throwbackYa85v4eqdd6k9Od8HbQjyA 5UmKMjUEUNdYWqANhGckJw 4 ago

26/60

afinn

AFINN = sentiments %>%filter(lexicon == ”AFINN”) %>%select(word, afinn_score = score)

27/60

word afinn_score

abandon -2abandoned -2abandons -2abducted -2abduction -2

28/60

reviews_sentiment = review_words %>%inner_join(AFINN, by = ”word”) %>%group_by(review_id, stars) %>%summarize(sentiment = mean(afinn_score))

29/60

review_id stars sentiment

__-r0eC3hZlaejvuliC8zQ 5 4.0000000__77nP3Nf1wsGz5HPs2hdw 5 1.6000000__DK9Vsmyoo0zJQhIl5cbg 1 -2.1000000__ELCJ0wzDM2QNRfVUq26Q 5 3.5000000__esH_kgJZeS8k3i6HaG7Q 5 0.2142857

30/60

df.review = reviews_sentiment %>%group_by(stars) %>%summarise(m.sentiment = mean(sentiment))

31/60

1

2

3

4

5

1

2

3

4

5

0 1Sentiment Score (mean)

Num

ber

of S

tars

on

Yelp

32/60

word frequency

review_words_counted = review_words %>%count(review_id, business_id, stars, word) %>%ungroup()

33/60

review_id business_id stars word n

__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 amazing 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 attentive 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 breakfast 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 cheese 1__-r0eC3hZlaejvuliC8zQ qemCNgjeYGcFsRwxW9x4xw 5 chili 1

34/60

word_summaries = review_words_counted %>%group_by(word) %>%summarize(businesses = n_distinct(business_id),

reviews = n(),uses = sum(n),average_stars = mean(stars)) %>%

ungroup() %>%arrange(reviews)

35/60

0

5000

10000

15000

20000

10 1000reviews

coun

t

36/60

word_summaries_filtered = word_summaries %>%filter(reviews >= 200, businesses >= 10)

37/60

word businesses reviews uses average_stars

lives 162 200 205 3.785000regret 162 200 206 3.915000bloody 80 201 246 3.621891courses 84 201 242 3.800995crowds 130 201 208 3.766169

38/60

0

50

100

150

200

1000 10000reviews

coun

t

39/60

positive words

df.0 = word_summaries_filtered %>%arrange(-average_stars)


gem 272 504 509 4.482143superb 171 250 253 4.460000incredible 268 519 554 4.458574amazing 927 3696 4240 4.391775highly 736 1660 1729 4.388554

40/60

negative words

df.1 = word_summaries_filtered %>%arrange(average_stars)


refused 178 205 226 1.604878worst 667 1215 1321 1.650206disgusting 252 321 347 1.735202rude 576 1005 1162 1.833831horrible 582 938 1043 1.835821

41/60

gemsuperb amazinghighlyknowledgeablegardens perfectfavoritespersonable deliciousbeautiful

refused worstdisgusting

rudeawful terribletasteless

poorly waste poorchargedstale dirty managerattitudeexcuse

foodservice

nice

friendly

staff

love

peoplerestaurant

menu

delicious

bad

fresh

amazing

2

3

4

1000 10000# of reviews

Ave

rage

Sta

rs

42/60

statistical analysis

1

2

3

4

5

−2.5 0.0 2.5 5.0sentiment

star

s

43/60

cross validation

library(”purrr”)library(”modelr”)gen_crossv = function(pol,

data = reviews_sentiment){data %>%

crossv_mc(200) %>%mutate(mod = map(train,

~ lm(stars ~ poly(sentiment, pol),

data = .)),rmse.test = map2_dbl(mod, test, rmse),rmse.train = map2_dbl(mod, train, rmse)

)}

44/60

set.seed(3000)df.cv = 1:8 %>%map_df(gen_crossv, .id = ”degree”)

45/60

Text Analysis of Donald Trump’s Tweets

47/60

file = paste0(”http://varianceexplained.org/”,”files/”,”trump_tweets_df.rda”)

load(url(file))

48/60

question

David Robinson: Text analysis of Trump’s tweets confirms he writesonly the (angrier) Android half

Who writes Donald Trump’s tweets?

They are written from two different devices: an iPhone and anAndroid

Can we examine quantitatively whether a tweet is written by DonalTrump himself or from someone on his staff?

49/60

http://varianceexplained.org/r/trump-tweets/

http://varianceexplained.org/r/trump-tweets/

library(”tidyr”)

tweets = trump_tweets_df %>%select(id, statusSource, text, created) %>%extract(statusSource,

”source”, ”Twitter for (.*?)<”) %>%filter(source %in% c(”iPhone”, ”Android”))

50/60

id source text created

762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46

51/60

https://t.co/Z80d4MYIg8

https://t.co/ia7rKBmioA

library(”tidytext”)library(”stringr”)

reg = ”([^A-Za-z\\d#@’]|’(?![A-Za-z\\d#@]))”tweet_words = tweets %>%filter(!str_detect(text, ’^”’)) %>%mutate(text =

str_replace_all(text,”https://t.co/[A-Za-z\\d]+|&”, ””)) %>%unnest_tokens(word, text, token = ”regex”,

pattern = reg) %>%filter(!word %in% stop_words$word,

str_detect(word, ”[a-z]”))

52/60

id source created word

676494179216805888 iPhone 2015-12-14 20:09:15 record676494179216805888 iPhone 2015-12-14 20:09:15 health676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016676509769562251264 iPhone 2015-12-14 21:11:12 accolade

53/60

trump

bad

cruz

america

people

#makeamericagreatagain

clinton

crooked

#trump2016

hillary

0 50 100 150n

54/60

android_iphone_ratios = tweet_words %>%count(word, source) %>%filter(sum(n) >= 5) %>%spread(source, n, fill = 0) %>%ungroup() %>%mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%mutate(logratio = log2(Android / iPhone))

55/60

#makeamericagreatagain#trump2016

join#americafirst

#votetrump#imwithyou

#crookedhillary#trumppence16

7pmtomorrow

agobrexit

jokemails

strongtalkingspentweakcrazybadly

−5.0 −2.5 0.0 2.5

56/60

sentiment analysis

nrc = sentiments %>%filter(lexicon == ”nrc”) %>%select(word, sentiment)

57/60

word sentiment

abacus trustabandon fearabandon negativeabandon sadnessabandoned anger

58/60

sadness fear anger disgust

surprise anticipation trust joy

0

1

2

3

0

1

2

3

−1

0

1

2

3

0

1

2

0

1

2

−4

−2

0

2

−3−2−1

012

−3

−2

−1

0

1

badl

ycr

azy

lost

wor

sedi

sast

er lieba

dki

lling

unfa

irto

ugh

craz

yfig

htw

orse

disa

ster

bad

killi

ngpo

lice

cour

tch

ange

terr

oris

t

craz

yan

gry

fight hi

tdi

sast

er lieba

dki

lling

unfa

ircr

ime

angr

ym

ess

disa

ster lie

bad

john

unfa

irdi

rty

liar

win

ning

chan

cedi

sast

erde

alho

pevo

teex

citin

gsh

ottr

ump

won

derf

ulte

rror

ist

leav

ew

inni

ng

time

deal

calls

cour

tw

atch

hope

vote top

win

ning

tom

orro

w

sena

teec

onom

ypo

lice

deal

syst

emtr

ade

calls

expl

ain

hono

rre

port

ersa

fe

deal

spec

ial

hope

vote

won

derf

ullo

ve pay

true

win

ning

safe

And

roid

/ iP

hone

log

ratio

Android

iPhone

59/60

your turn

Continue working with these data in groups

Think of interesting patterns you can explore in the data

What about the timing of tweets? Could everything be included inone large model?

60/60