social media analysis 21 november 2019 with nlp michael miller...
TRANSCRIPT
Social media analysiswith NLP
Michael Miller Yoder
21 November 2019
1
Overview
1. Motivation: language in social context
2
Overview
1. Motivation: language in social context
2. Examples of NLP approaches to modeling identity
3
Overview
1. Motivation: language in social context
2. Examples of NLP approaches to modeling identity
Effects of self-presentation on interactionin social media
4
Experiment 1
Overview
1. Motivation: language in social context
2. Examples of NLP approaches to modeling identity
Effects of self-presentation on interactionin social media
Portrayal of characters and relationshipsin narrative (fanfiction)
5
Experiment 1
Experiment 2
language embedded in social context
6
What types of social contexts is language used in?
7
What types of social contexts?
8
9
10
11
12
For NLP, what is language?
13
14
1990 2000 2010
statistical machine learning NLP
Penn Treebank
1987-1989
15
news
16
news1987-1989
17
1990 2000 2010
statistical machine learning NLP neural NLP
Penn Treebank
1987-1989
BERT
18
19
SOCIAL20
language
speakers audience
situations purposes
21
Penn Treebank
1987-1989
credit: Amir Zeldes, [Zeldes & Simonson 2016]
Typical rates in the secondary market : 8.65 % one month ; 8.65 % three months ; 8.55 % six months. BANKERS ACCEPTANCES : 8.52 % 30 days ; 8.37 % 60 days ; 8.15 % 90 days ; 7.98 % 120 days ; 7.92 % 150 days ; 7.80 % 180 days.
22
language is always embedded in social context
23
“Language is by and about people”
—Noah Smith, ACL 2017
https://homes.cs.washington.edu/~nasmith/slides/acl-8-1-17.pdf
NLP + social science: applications
24
hate speech detection community norms
NLP + social science: applications
25
fairness and bias
Garg et al. 2017
media framing
https://criticalmediareview.wordpress.com/2015/10/19/what-is-media-framing/
NLP + social science: applications
26
dialectal NLP tools
Garg et al. 2017www.tes.com
Overview
1. Motivation: language in social context
2. Examples of NLP approaches to modeling identity
Effects of self-presentation on interactionin social media
Portrayal of characters and relationshipsin narrative (fanfiction)
27
Experiment 1
Experiment 2
28
29
30
Models of identity
identity
31
Critical identity approaches
“identity is the product rather than the source of linguistic and other semiotic practices … is social and cultural rather than primarily internal”
sociolinguistics
[Bucholtz and Hall 2005]
32
identity
Critical identity approaches
“identity is the product rather than the source of linguistic and other semiotic practices … is social and cultural rather than primarily internal”
sociolinguistics
[Bucholtz and Hall 2005]
33
identity
society, culture
Critical identity approaches
“As a shifting and contextual phenomenon, gender does not denote a substantive being”
gender studies
[Butler 1990]
34
Critical identity approaches35
(changing) identity
“As a shifting and contextual phenomenon, gender does not denote a substantive being”
gender studies
[Butler 1990]
society, culture
Critical identity approaches
“race and sex become grounded in experiences that actually represent only a subset of a much more complex phenomenon.”
critical race theory
[Crenshaw 1989]
36
(intersectional)identity
Critical identity approaches
“people have multiple identities connected not to their ‘internal states’ but to their performances in society”
discourse analysis
[Gee 2000]
37
identities
Computational identity approaches
“classify latent user attributes, including gender, age, regional origin, and political orientation solely from Twitter user language”
computer science
[Rao et al. 2010]
38
identity
Computational identity approaches
“Inferring latent attributes of online users has many applications in public health, politics, and marketing”
computational linguistics
[Ardehaly and Culotta 2015]
39
identity
“a [deep neural network] can be used to identify sexual orientation from facial images”
computer vision
[Kosinski and Wang 2018]
40
identity
Computational identity approaches
Can we investigate the production of identity in language with computational models?
41
Avoid naturalizing structures of identity and further marginalizing those who don’t fit them (Butler 1990)
Discover how notions of identity are being reinforced/challenged/reinvented
42
?language + social
data y = f(x)
machine learning
1. Self-presentation effects on social media
43
Qinlan ShenCMU Language Technologies Institute
Alex CodaCMU Language Technologies Institute
Carolyn P. RoséCMU Language Technologies Institute
Yunseok JangU Michigan Computer Science & Eng
Yale SongMicrosoft Research
Kapil ThadaniYahoo Research
Explicit identity positioning
● Working identity definition: “social positioning of self and other” [Bucholtz & Hall 2010]
● How does the social positioning of self affect interaction on social media?
● Tumblr as a site with particular identity implications, as well as social interaction
44
45
46
Lyca / 25
Self-presentation on Tumblr
47
● Explicit social positioning: blog descriptions!
● Well these are messy
● "List descriptions"○ max | 18yo | she/they | girl with dreams | twerfs don't
follow○ andre | 22 | he/him | mexican ✨trans | too many
fandoms ○ hey! annie, she/hers, love me, infj
What effects of similarities and differences in self-positioning do we see on content propagation
in Tumblr?
48
What effects of similarities and differences in self-positioning do we see on content propagation
in Tumblr?
49
blog descriptions reblogging
Reblog prediction
● Reblog "opportunity"
50
follower
followee
post
followee
postsimilar time
Reblog prediction
● Reblog "opportunity"
● Learning to rank pairwise formulation follower
followee
post
51
followee
post
reblog
similar time
Reblog prediction
● Reblog "opportunity"
● Learning to rank pairwise formulation she/her
25 | nyc
post
52
reylo fan
post
reblog
similar time
Levels of identity abstraction
● Identity categories: dimensions of personal characteristics○ age, gender, personality type
● Identity labels:○ 17, trans man, infj
53
Identity category extraction
● Manually grouped most popular common n-grams into 11 categories
● Refined list with manual annotation of 1000 blog descriptions
● Regular expressions to extract features such as "girl", "ravenclaw", "25" to represent users
Identity category
age
ethnicity/nationality
fandoms
gender
interests
location
personality type
pronouns
relationship status
sexual orientation
zodiac sign54
Data
● Sampled 1000 users who have blog descriptions and minimum 10 reblogs
● Pair each reblog with up to 5 posts not reblogged, posted within 30 minutes of the paired reblog
Number of sampled users 1000
Total reblog opportunities 712,670
Timeframe June - Nov 201855
Features● Baseline features:
○ Post hashtags○ Number of likes, reblogs, comments○ Post type (text, photo, quote, video, audio, chat, link, answer)
● Category alignment features:○ Category match○ Category mismatch: one user provides the category, the other does not
● Label alignment features:○ Label match○ Label mismatch○ Specific label interaction count
56
Is there an effect?
57
What is the nature of this effect?
● Generally positive coefficients were learned for category and label match features, negative for mismatches
● Specific interaction features between labels sometimes most informative
58
What is the nature of this effect?
59
Features Likelihood of reblogging
Follower: presents pronounsFollowee: does not
↓
Race/ethnicity label alignment ↑
Nationality label alignment none
Follower: cisgender Followee: cisgender
↑
What is the nature of this effect?
60
Features Likelihood of reblogging
Similar ages (20 and 21, e.g.) ↑
Follower: animeFollowee: design
↑
Follower: gamingFollowee: manga
↑
Follower: memes Followee: history
↓
Conclusion
● Evidence for an association between explicit, self-presented identity information and content propagation
○ Most studies use only content and network features to predict content propagation [Naveed et al. 2011, Zhang et al. 2016,
Vosoughi et al. 2018]
● Users who presented labels that indicated shared interests or shared values were more likely to share each other’s content
61
2. Changes in portrayal of characters in narrative
62
Qinlan Shen
Luke Breitfeller
Carolyn P. Rosé
James Fiacco
Shefali GargEthan Xuanyue Yang
Huiming JinHariharan Muralidharan
Motivation
● Examine how others’ identity is positioned in narrative
● Can computational models capture basic changes in narrative portrayal of characters’ identity?
● Fanfiction: fiction created by fans of TV shows, movies, books, comics, etc
63
[Discourse Processes, in submission]
64
Can we capture changes in character and relationship framing in fanfiction
with word embedding-based methods?
65
66
● Word embeddings [Mikolov et al. 2013a] for social questions○ Stereotypes and bias in corpora [Garg et al. 2018]
○ Framing by different social groups [An et al. 2018)]
● Can word embeddings capture social framing of relationships in fanfiction?
Methods
67
1. Focusing on text that is relevant to characterization provides a stronger signal for learning shifts in relationship portrayal
2. Differences between canon and fanfiction vector representations in embedding space can represent changes in relationship portrayal
Hypotheses
Data
68
Harry Potter stories Archive of Our Own
>179k stories (as of 2018)
Characters
● Harry Potter● Hermione
Granger● Draco Malfoy● Ron Weasley● Ginny
Weasley
Pairings by popularity
● Draco/Harry● Hermione/Ron● Draco/Hermione● Ginny/Harry● Harry/Hermione● Harry/Ron
Prediction task
69
● Does the relationship match canon in being romantic/not romantic?
● True if
○ romantic in canon and romantic in fanfiction or
○ not romantic in canon and not romantic in fanfiction
Text extraction
70
github.com/michaelmilleryoder/fanfiction-nlpBased on BookNLP [Bamman et al. 2014]
Relationship representations
71
Harry wept at the sight of Hermione in the garden.
Ron looked down at his shoe. Troll bogeys. He would have to tell Harry about this.
Harry Hermione Harry Ron
● Weighted average of word embeddings in a 10-word window around character name mentions
72
Visualization
● Track changes in contextualized embeddings for character names across fics
○ Train RNN-based language model and take final hidden state as contextualized word representation [Peters et al. 2018]
73
Visualization
Hermione sat in the front of the classroom. She...
Fleur whistled softly. "Hermione! Come here...
[ 0.34 0.72 0.21 … ]
[ 0.89 0.06 0.53 … ]
74
75
76
Canon vector is close to the center of the fanfiction vectors: harry
Canon vector is on the edge of fanfiction vectors: draco, remus, sirius
Conclusion
77
● Word embedding approaches can capture types of character framing
○ See evidence of differences in characterization, relationships
● Differences often match known fanfiction trends
Conclusion
78
Computational models of identity in language
● Assumption: identity is not only reflected, but also constructed, in language
● Computational techniques to analyze and model the presentation of identity in discourse
79
Computational models of identity in language
● Shift focus from predicting latent user attributes from language to exploring how people are positioning themselves and others in language
● Enables exploring the effects of the choice of self-presentation (Experiment 1)
● Acknowledges that identities can be framed and represented in varied, changing ways in narrative (Experiment 2) 80
language embedded in social context
81
Thank you!
82
draco canon vector is on the edge of fanfiction vectors
83
Representation for Ron
draco canon vector is on the edge of fanfiction vectors
84
Representation for Ron
Differences when cast in a canon relationship vs. when excluded
Data
85
● For each character pairing, sampled stories with at least 5 paragraphs with both characters mentioned
● Balanced dataset across 6 pairings
● Each instance is a particular pairing in a story
Interaction on Tumblr
● How does the social positioning of self affect interaction on social media?
● Primary form of interaction on Tumblr: "reblogging" [Xu et al.
2014]
● Reblogging as content propagation; most studies use only content and network features to predict content propagation [Naveed et al. 2011, Zhang et al. 2016, Vosoughi et al. 2018]
86
Identity category annotation
87
Prediction tasks
88
● Canon: does the relationship match canon in being romantic/not romantic?
● Auxiliary tasks to test if simply capturing something else
○ Romantic: is the relationship romantic?
○ M/M: is the relationship between 2 males? (Regardless of whether it's romantic.)