mining the social web - pdfs.semanticscholar.org€¦ · personalized web search - personalized...

45
Mining the Social Web Asmelash Teka Hadgu [email protected] L3S Research Center May 02, 2017

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Mining the Social Web

Asmelash Teka [email protected]

L3S Research CenterMay 02, 2017

Page 2: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Outline

2

Methods & Applications: Selected papers

Hands On: Doing Web Science

Introduction: The big picture

Page 3: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Definition from Web Science Conference1

Web Science is the emergent science of the people, organizations, applications, and of policies that shape and are shaped by the Web

Studying the online world to understand the offline world

3

What is Web Science?

http://websci16.org

Page 4: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Web 2.0 describes World Wide Web sites that emphasize user-generated content, usability, and interoperability.

The social web is a set of social relations that link people through the World Wide Web.

4

The Social Web

https://makeawebsitehub.com/social-media-sites

Page 5: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

5

Application Domains

Page 6: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

● User Identification● Network Analysis● Content Analysis● Privacy Issues

6

Methods and Techniques

Page 7: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Problem: How can we automatically construct user profiles?

7

User Identification

Page 8: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Authoritative users extraction - Discovering expert users for a target topic.

Personalized web search - Personalized social media posts retrieval.

User recommendation - Suggesting new interesting users to a target user.

Preprocessing step for many interesting downstream analysis.

8

User Identification Applications

Page 9: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Challenges:

Labels - Generating labelled data to train Models

Feature Engineering - What are good features

Modeling - Mostly off the shelf Predictive models

9

Machine Learning for User Identification [1,2,3]

Page 10: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Baseline

Bag-of-words (BOW) model with weighting (e.g., TF-IDF)

Each user is represented as a bag of the words in their tweets.

These words are then used as features to train a classifier.

10

Feature Engineering

Page 11: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Feature engineering is key in a machine learning task.

● Profile features● Tweeting behaviour features● Linguistic content● Social network features

11

Feature Engineering

Page 12: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Ideally Profile information should be sufficient. In reality it does not contain enough quality information to be directly used for user classification.

- Length of name, Number of alphanumeric chars.- Capitalization forms in user name- Use of avatar picture- Number of followers, friends- Regular expression matches in bio:

(I |i)(m|am| m|[0 − 9] + (yo|yearold)man|woman|boy |girl

12

Profile Features

Page 13: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

A set of statistics capturing the way users interact with the micro-blogging service.

- Number of tweets of a user.- Number and fraction of retweets of a user.- Ave. number of hashtags and URLs per tweet.- Ave. time and std between tweets.

13

Tweeting Behaviour Features

Page 14: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Linguistic content contains the user’s lexical usage and the main topics of interest to the user.

- List of socio-linguistic words such as emoticons, ellipses etc.- Prototypical words, hashtags etc. Given n classes, each class ci containing

seed users Si . Each word w from seed users is assigned a score:

(1)

where |w, Si| is the number of times the word w is issued by all users for class ci .

- Latent Dirichlet Allocation (LDA) to generate topics used as features for classification.

- Sentiment words14

Linguistic Content Features

Page 15: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

These features contain the social connections between a user and those one follows, replies to or whose messages they retweet.

Prototypical ‘friend’ accounts are generated by exploring the social network of users in the training set by bootstrapping as in Eq. (1)

- Number of prototypical friends, percentage number of prototypical friend - Prototypical replied users, Prototypical retweeted users

15

Social Network Features

Page 16: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Manually building ground truth data and validating result not easy.

● Combine multiple data sources. e.g. scraping websites that contain users in each class.

● Using key terms in profile of users. Prone to systematic bias and care should be taken on features.

● Using crowdsourcing services such as Amazonmechanical turk1 and CrowdFlower2

16

Ground-Truth and Validation

1. https://www.mturk.com 2. http://www.crowdflower.com/

Page 17: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Identifying Computer Science researchers on Twitter

17

A Complete Pipeline

Page 18: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Goal: study how social media shapes the networked public sphere and facilitate communication between communities with different political orientations.

18

Network Analysis [4]

Framework:

● Data gathering● Identifying political content● Political communication networks● Network analysis

Page 19: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Identifying Political Content Political communication - any tweet containing at least one politically relevant hashtag.

- Political hashtags bootstrapped from seed hashtags (e.g. #p2 and #tcot) using co-occurrence patterns.

- Let S be set of tweets containing seed hashtag and T set of tweets containing another hashtag. Jaccard coefficient between S and T :

if σ is big the two hashtags are deemed to be related.

19

Identifying Political Content

Page 20: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Political communication networks are generated from the tweets containing politically relevant hashtags where nodes represent users.

- In the retweet network an edge runs from a node A to a node B if B retweets content originally broadcast by A.

- In the mention network an edge runs from A to B if A mentions B in a tweet.

20

Communication Networks

Page 21: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

To establish structure, community detection using a label propagation method for two communities

● Label propagation - Assign an initial arbitrary cluster membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors.

● Modularity to measure segregation. Retweet and mention networks have values of 0.48 and 0.17, respectively.

21

Cluster Analysis

Page 22: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Political retweet network (left) and mention network(right)

22

Polarized Clusters

Page 23: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Do clusters have similar content?

● Associate each user with a profile vector of hashtags in their tweets, weighted by frequency.

● Cosine similarity among users.

23

Content Homogeneity

clusters Retweet Mention

A ↔A 0.31 0.31

B ↔B 0.20 0.22

A ↔B 0.13 0.26

Page 24: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Do clusters in the retweet network correspond to groups of users of similar political alignment?

Qualitative content analysis from social science. One author annotates 1,000 random users as ‘left’ or ‘right’. Another user annotates 200 random users from the 1,000 users above.

Inter annotator agreement measured using Cohen’s Kappa

where P(α) is observed rate of agreement between annotators and P( ε) is expected rate of random agreement given relative

24

Political Polarization

Page 25: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

To establish structure, community detection using a label propagation method for two communities

● Label propagation - Assign an initial arbitrary cluster membership to each node and then iteratively update each node’s label according to the label that is shared by most of its neighbors.

● Modularity to measure segregation. Retweet and mention networks have values of 0.48 and 0.17, respectively.

25

Cluster Analysis

Page 26: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Political Hashtag Trends, PHT, an analysis tool for political polarization of Twitter hashtags in the US

26

Content Analysis [5]

https://www.politicalhashtagtrends.sandbox.yahoo.com

Page 27: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Start with a set of seed political users such as @BarackObama and @MittRomney whose political leaning is known.

Collect users that retweet seed users’ tweets.

Get their tweets using Twitter REST API1

27

Data Set

https://dev.twitter.com/rest/reference/get/statuses/user_timeline

Page 28: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Named entity recognition to limit analysis to the U.S.

28

Location Filtering

Page 29: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Dataset validation against Web Directories

29

Evaluating Data Quality

Precision = 0.98, 0.93 for Wefollow and Twellow respectively.

Manual inspection for disagreements:

“greatest environmentalist. Also, despise republicans”

Page 30: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Look into co-occurrence with seed political hashtags (#p2, #tcot,#gop, #ows) and (‘obama’, ‘romney’, ‘politic’,‘liberal’, ‘conservative’, ‘democ’, or ‘republic’)

Volume filtering to avoid rare noisy hashtags.

30

Detecting Political Hashtags

Page 31: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Trending - currently popular. Having a higher volume than expected.

Examples:

#obamagotosama: 01 May 2011 to 08 May 2011.

#ows: 25 Sep. 2011 to 2 Oct. 2011.

Non-trending hashtags: #vote, #democracy.

31

Identifying Trending Hashtags

Page 32: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Leaning - tendency to prefer or agree with a particular set of beliefs, opinions etc.

Using Voting approach: Let vL denote the aggregated user volume of h in w for the left leaning. Let VL denote the total left user volume of all hashtags in w. Similarly for vR and VR .

where a leaning of 1.0 is fully left and 0.0 is fully right.

32

Assigning Leaning to Hashtags

Page 33: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Who wants to get fired?

33

Privacy Issues[6]

http://fireme.l3s.uni-hannover.de/fireme.php

Page 34: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Problem Definition: Study posting behaviour of individuals who tweet they hate their job or boss.

Observation:

Many users are not aware of their audience.

Build an alerting system to address the danger of public online data.

34

Privacy Issues

Page 35: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Collect ‘haters’ - users mentioning sentences like: ‘I hate my job’, ‘I hate my boss’, ‘I hate the worst job’ . . .

Collect ‘lovers’ - Users posting positive messages like: ‘I love my job’, ‘my bosss is the best’.

Crawled users for one week, June 18 - June 26, 2012. 21,851

haters and 44,710 lovers.

Randomly select 10,000 users from each group. Get the latest 200 tweets of each user.

35

Data Set

Page 36: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Characterising groups:

● lovers are more connected. Three times as many followers, and 20% more friends.

● haters are more active in terms of tweeting speed - twice as many tweets per day.

● haters link to their facebook profile in their bio.

36

Data Analysis

Page 37: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Shows user, her hate tweet and FireMeter! Score

FireMeter! score: fraction of hate tweets in the latest 100 tweets of a user 37

FireMe! Features

FireMe! Provides a leaderboard

Page 38: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Sent out over 4000 alert messages to haters. ‘What are you going to do about it?’

Feedback:

42% delete hate tweet

18% change privacy setting

40% don’t care

Overall 60% users concerned about their personal data.

38

Alert Impact

Page 39: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Majority of users deleted the compromising tweets.

Tone of alert messages (neutral, aggressive, action) is important.

39

Validation and more analysis...

Page 40: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Most popular social Media websites have Officially supported APIs.

Mining the Social Web (2nd edition)1 an excellent freely available book

40

Official APIs

1. https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition

Page 41: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Wikipedia / DBPedia

http://wiki.dbpedia.org/downloads-2016-04

News from GDELT

http://www.gdeltproject.org/data.html#rawdatafiles

DBLP Monthly snapshots

http://dblp.dagstuhl.de/xml/release/ 41

Data dumps

Page 42: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

- In unexpected events or crisis situation, social media sites such as Twitter are used for informal communication

- Leveraging social media to support help organizations by filtering relevant content, marking rumors etc. in real time.

In the K3 project, the L3S team is responsible for social media analysis e.g. for information filtering and for identifying rumors.

- content analysis

42

K3: Social Media for Crisis Communication

http://www.independent.co.uk/news/world/europe/berlin-attack-latest-tunisian-man-arrested-christmas-market-anis-amri-a7498811.html

Page 43: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

Dashboard of social feeds about flood in Germany in June 201343

Sover! The Social Web Observer

Page 44: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

References

1. Pennacchiotti, Marco, and Ana-Maria Popescu. "Democrats, republicans and starbucks afficionados: user classification in twitter." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

2. Rao, Delip, et al. "Classifying latent user attributes in twitter." Proceedings of the 2nd international workshop on Search and mining user-generated contents. ACM, 2010.

3. Hadgu, Asmelash Teka, and Robert Jäschke. "Identifying and analyzing researchers on twitter." Proceedings of the 2014 ACM conference on Web science. ACM, 2014.

4. Conover, Michael, et al. "Political Polarization on Twitter." ICWSM 133 (2011): 89-96.5. Weber, Ingmar, Venkata Rama Kiran Garimella, and Asmelash Teka. "Political hashtag trends." European Conference on Information

Retrieval. Springer Berlin Heidelberg, 2013.6. Kawase, Ricardo, et al. "Who wants to get fired?." Proceedings of the 5th Annual ACM Web Science Conference. ACM, 2013.

44

Page 45: Mining the Social Web - pdfs.semanticscholar.org€¦ · Personalized web search - Personalized social media posts retrieval. ... Linguistic content contains the user’s lexical

● Current State-of-the-art methods for classification are based on deep learning.

● Key difference is in representation learning - instead of hand crafting features, layers of neural networks, learn representations (features) from raw inputs.

● Bag of words -> words in semantic spaces e.g., word2vec

45

Deep Learning