master project reportan analysis of credibility in...
TRANSCRIPT
University of California Santa Barbara
Department of Computer Science
Master Degree in Computer Science
Master Project Report
An Analysis of Credibility in Microblogs
Author
Byungkyu Kang
Perm. 4985065
Committee in charge
Tobias Hollerer, Chair
Matthew Turk, Xifeng Yan and John O’Donovan
June 7th 2012
Abstract
User-generated microblog content is being provided at an increasingly rapid rate, mak-
ing it increasingly difficult to distinguish credible and newsworthy information from a
mass of “background noise”. In contrast to the information flow paradigm of traditional
media, which has a few-to-many relation between information producers and consumers,
microblogs support a many-to-many flow paradigm, which raises the challenge of as-
sessing the credibility of information providers based on a relatively small window of
information.
To address this challenge, we present an assessment of information flow, credibility
and provenance in Twitter, through a set of two different experiments. We first provide
credibility assessment models (social, content and hybrid models) derived from sampled
Twitter datasets and evaluate them in terms of predictive accuracy on a set of crowd-
sourced credibility ratings on crawled tweets. Next, we present an in-depth analysis on
the utility and distribution of predictive features across a diverse set of Twitter contexts.
Results of the first experiment show that our model based on social features (e.g:
ratio of friends to followers) outperforms content-based and hybrid models with 88%
accuracy in predicting tweets assessed as credible, using a J48 decision tree algorithm.
Our second experiment reveals higher feature usage in microblogs commenting on unrest
or emergency situations and different distributions upon various other topics.
Table of Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Downside of User-generated Content . . . . . . . . . . . . . . . . . . . . 1
1.3 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of This Report . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Credibility of Information Sources . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Credibility and Trust on the Web . . . . . . . . . . . . . . . . . . 6
2.1.2 Information Credibility on Twitter . . . . . . . . . . . . . . . . . 7
3 Twitter 11
3.1 Twitter Jargons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Common Twitter Coins . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Follower and Followee . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Usage of Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Noise Data in Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Experiment I: Credibility 17
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Credibility Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Definition of Credibility . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Modeling Credibility . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
I
II TABLE OF CONTENTS
4.3.1 Data Mining and Preliminary Analysis . . . . . . . . . . . . . . . 23
4.4 User Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.2 Predicting Credibility . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Experiment II: Feature Analytics 37
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Data Gathering and Processing . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Online Evaluation of Credibility . . . . . . . . . . . . . . . . . . 39
5.2.2 Retweet Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Dyadic Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.1 Credibility Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.2 Analysis across Multiple Corpora . . . . . . . . . . . . . . . . . . 45
5.4.3 Feature Distribution in Retweet Chains . . . . . . . . . . . . . . 48
5.4.4 Feature Distribution in Dyadic Pairs . . . . . . . . . . . . . . . . 49
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Conclusion 51
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Bibliography iii
List of Figures
3.1 Distribution of content in Twitter (excerpt from [1] and mashable.com
infographic, 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Depiction of conceptual graphs of following action and information flow
in Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Illustration of the scope of crawled data for each of 7 current topics. . . 23
4.2 Word cloud showing origin of tweets in the Libya data set. . . . . . . . . 25
4.3 Word cloud showing distribution of popular terms in the Libya data set.
(This visualization is generated using Wordle, http://www.wordle.net/) 26
4.4 Three graphs showing (a) gender distribution, (b) familiarity with mi-
croblogging service and (c) age distribution of the participant of the on-
line user survey in this experiment. (from left to right, respectively) . . 27
4.5 Plot showing perceived credibility in each of four Twitter contexts (Neg-
ative, Positive, True, Null). . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Friend to Follower patterns across four of our topic-specific data sets. All
other sets exhibited a similar distribution. . . . . . . . . . . . . . . . . . 30
4.7 Histogram of average credibility rating from the online survey across num-
ber of followers for the tweeters (binned). . . . . . . . . . . . . . . . . . 31
4.8 Plot showing a sample of feature distributions for the content based model
on a 3000 labeled tweets from the Libya dataset. . . . . . . . . . . . . . 32
4.9 Comparisons of each feature used in computing the social credibility
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.10 Comparisons of link and no-link contexts across four different groups in
the online user survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III
IV LIST OF FIGURES
4.11 Comparison of predictive accuracy over the manually collected credibility
ratings for each model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.12 Statistical results from a J48 tree learner for the best performing credi-
bility prediction strategy (Feature Hybrid) . . . . . . . . . . . . . . . . 35
5.1 Data crawling procedure used for gathering 8 datasets from the twitter
social graph, consisting of over 20 million users and 200 million tweets in
total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Analysis of the distribution of selected content-based features based on
two credibility contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Analysis of the distribution of selected content-based features across a
diverse set of Twitter topics. . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Two different charts showing the average occurrence score of (a) each
feature across all dataset and (b) all features per topic. . . . . . . . . . . 47
5.5 Distribution of predictive features in three contexts formed around length
of retweet chains. a) Non-retweeted content b) Short chains (1-3 hops)
c) Long chains (≥4 hops). . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Distribution of predictive features dyadic pairs. Tweets were selected for
this group if they occurred in a pairwise conversation between two users
in which more than two messages were exchanged, as measured by the
mention and retweet metadata in Twitter API. . . . . . . . . . . . . . . 49
List of Tables
3.1 Common technical abbreviations on Twitter . . . . . . . . . . . . . . . . 13
4.1 Three predictive models for identifying credible information from Kang
et al. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Overview of 7 topic-specific data collections mined from both the Twitter
REST and streaming APIs. . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1 Context classes in our Evaluation. . . . . . . . . . . . . . . . . . . . . . 38
5.2 Overview of 7 topic-specific data collections mined from the Twitter
streaming API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Examples of 8 topic-specific tweets from the crawled sets. . . . . . . . . 40
5.4 The set of Twitter features analyzed in our evaluation. Features are
grouped into three classes: a) Social, b) Content-based and c) Behavioral.
Each is a representative subset for the larger feature sets in each class. . 42
V
VI LIST OF TABLES
Chapter 1
Introduction
1.1 Background
The recent success of social networks, such as Twitter 1, LinkedIn2 and Facebook3,
accelerated massive adoption of user-generated content (UGC) and consequently, end
users on the web became a major source of information by providing news, media
production, blogging and forum comments, for example. Information flow has shifted
from a few dedicated professionals to a set of distributed general end-users on the
web. As the number of unverified information sources increases, growing uncertainty of
information quality(or credibility) of sources means that we need better ways to filter
out non-credible data for consumers.
1.2 Downside of User-generated Content
The sudden increase of user-generated information brings a new challenge for todays
web application designers: How to detect trustworthy and accurate information. For
example, one can easily expect to have more than half of their web search results con-
tain user-generated content as a result of an arbitrary search. Pieces of contents from
Wikipedia, Blogger, Blogspot, Facebook, Twitter and floods of other services are good
examples of this. Therefore, the untold amount of user generated content [3]4 yields
1http://www.twitter.com2http://www.linkedin.com3http://www.facebook.com4The volume of activity on Twitter has increased at an extraordinary rate, to an average of 140
million tweets per day as of March 2011 according to the TwitterBlog [3].
1
2 CHAPTER 1. INTRODUCTION
the need to efficiently analyze and immediately understand big chunks of social data,
and find trustworthy information sources amongst them. Without a smart strategy for
filtering out unnecessary data scattered on the web, there would be too much noisy data
for end users to manually filter. Moreover, information seekers become more vulnerable
to spam or malicious information generated by anonymous users.
Thus, this project tackles two different problem statements below:
Problem Statement 1 How well can we assess the credibility of standalone content
in microblogs?
Problem Statement 2 Can we leverage social graph to improve credibility assessment
of microblog content?
In a macroscopic perspective, we study “What type of features and structures in
microblogs we can utilize in order to evaluate credibility of information sources.” This
question is applied throughout the studies covered in this project. However, we also scru-
tinize two different scientific approaches: (1) social and content based credibility mod-
eling on topic-specific settings, (2) in-depth analysis on feature clusters in microblogs.
1.3 Project Overview
This project provides the reports of the two different experiments focusing on informa-
tion credibility. Each consists of an individual experimental procedure and evaluation
of an approach discussed above.
1: Modeling Topic-specific Credibility in Twitter In this study, we assumed
that both social structural metadata and raw content can be used as important features
for finding quantitative credibility scores. Our initial expectation was that the hybrid
model which combines social and content features would outperform the two individual
models in predicting credibility scores. Interestingly, however, our experiment with
real-world Twitter data set shows that the social model performs best, followed by the
hybrid and the content models respectively. By carefully observing candidate features
obtained from the data crawler, 5 social features and 19 content features were chosen
for the exploratory data analysis. As a result, the social model predicted credibility
in tweet level with 88.17% accuracy, and this is achieved using J48(C4.5) decision tree
classification algorithm.
1.4. ORGANIZATION OF THIS REPORT 3
2: Context Based Analysis of Feature Utility As the second phase of our cred-
ibility exploration, we focused on the detailed analysis of feature distribution in mi-
croblogs. We scrutinized more than a hundred features categorized into three different
contexts: credibility, chain length, and dyadic interaction. We also conducted a compar-
ison of feature distribution across 8 different topic-specific tweet data sets. The result
of this analysis shows that the frequency of feature occurrences in general tends to
increase in social emergency situations such as natural disasters(#earthquake) or civil
revolutions(#libya or #egypt). Momentary social events(#superbowl) or national elec-
tion(#romney) also demonstrate increased usage of external information sources such
as urls or mentions of news archives.
1.4 Organization of This Report
This report is organized into six parts; in Chapter 1 and 2, we introduce work and
outline relevant literature, discussing a number of different approaches in the field of
information credibility, respectively. Next, we explain the general characteristics of
Twitter in Chapter 3, as one of the leading social network microblogging services. To
be specific, we discuss Twitter in two different perspectives: its directed-graph structure
including newly introduced features such as lists and its rapidly growing popularity in
public with some real-world practices via third-party applications. Next, we present
two individual experiments in Chapter 4 and 5 including their experiment setups. All
two experiments rely on the Python based data crawler we designed. We also describe
the refined data set for each project. We particularly detail the experimental setup of
the first project followed by a report on several self-evaluation tests as well as the user
survey that we conducted, since the original data set from the first project is used in
the next project. Finally, in Chapter 6, we conclude with an overall discussion of our
work, its limitations, and avenues for future work.
4 CHAPTER 1. INTRODUCTION
Chapter 2
Related Work
Recently, microblogging services have become one of the most rapidly emerging web
technologies and their popularity is clear from statistics of usage along with the abun-
dance of different applications. For example, the vast majority of major news companies,
such as the New York Times and the Guardian, for example equipped every page of
their websites with social plugin buttons linked to Facebook, Twitter and Google Plus.
Since micro-blogging services are built on top of social network structure, any content
produced by any user can be easily shared and disseminated through socially connected
edges, potentially coming from remote strangers. Furthermore, because of its short
message lengths (e.g. 140 characters for Twitter), they are mobile device friendly and
have many of third-party applications that provide various types of services.
The main area of research related to our work is the assessment of information
credibility in microblogs [2, 4–6]. In this section, we give an overview of existing credi-
bility assessment systems. According to the literature, credibility can be measured by
a number of mathematical models based on the content of information or social status
of users [2, 4].
2.1 Credibility of Information Sources
Credibility of an information source is important when it comes to sharing the infor-
mation with other people. The ratio of usage of microblogging in business, research
or politics continues to increase rapidly and therefore, the importance of accuracy and
5
6 CHAPTER 2. RELATED WORK
trustworthiness of such information cannot be underestimated. Since Twitter was pub-
licly released in August 2006 as a free social networking service, a large number of
research has focused on evaluating the credibility of content or its producers. In this
discussion of related research, we focus on a few of the most relevant works. Our analysis
falls into three broad categories: research in the general area of trust and credibility on
the web, research in the micro-blog domain, and finally an overview of relevant works
from the field of recommender systems, particularly those works that have guided the
models and methods presented in this project report.
2.1.1 Credibility and Trust on the Web
A good overview of research on trust on the social web is presented in Artz et al. [7]
which encompasses more than a hundred related articles. They adopt two different cat-
egories from Olmedilla et al. [8] for measuring trust of information: policy-based trust
and reputation-based trust. In addition to those two categories, general trust models
and trust in information resources were discussed. A number of recent researches on in-
formation credibility adopt reputation based method into social network and microblog
settings such as [2, 10, 33, 81]. Here, policy is expressed as a trust or credibility model
comprised of existing social meta data, such as number of followers and followees or
age of an account in Twitter, underlying social network structure. Including the litera-
ture mentioned above [2,10,33,81], our approach for measuring credibility falls into the
category of trust in information resources.
Another interesting study by Fogg et al. [9] on ‘which elements on the web affect
people’s perception of credibility’ is worth mentioning here. They conducted a large
scale user survey over 1,400 participants from the U.S. and Europe, evaluating 51 dif-
ferent web site elements. They verified five types of elements which increase credibility
perceptions of people–‘real-world feel’, ‘ease of use’, ‘expertise’, ‘trustworthiness’, and
‘tailoring’–and two types of elements which damage credibility–‘commercial implica-
tions’ and ‘amateurism.’ A few of these elements seem to be applicable to Twitter-like
microblogging settings such as ‘expertise’, ‘trustworthiness’, ‘commercial implications’
and ‘amateurism’ that have nothing to do with the layout or design of content. This is
because micro-blogging content is mostly viewed as a simple list of short-length messages
without any intricate or detailed design.
2.1. CREDIBILITY OF INFORMATION SOURCES 7
Policy-based Trust In a policy-based trust system, a particular set of conditions
are necessary to obtain trust. When these conditions are met, one can have expected
outcomes from the system. In this case, the policies frequently involve the exchange
or verification of credentials, which are information issued by one entity. Here, the
credential generally contain a description of qualities, features or credibility of another
entity. For example, the credential of a certification means its holder has been verified
by the issuing institution as having a qualification of a certain ability or education.
This process involves the holder with the issuer of that credential. This policy-based
system can be used to verify the credibility or quality of an individual entity if another
entity has no interaction with or reference to the entity (credential holder). Since the
credential issuer (or a third party) is trusted in a network or domain, it can perform
the inquired verification process as an authority on behalf of the credential holder. [7].
Reputation-based Trust Reputation can be assessed in a network or an social com-
munity based on the history of interactions with or observations of an entity, either
directly with the evaluator (personal experience) or as reported by others (recommen-
dations or third party verification). In the latter case, it is similar to the policy-based
trust due to the involvement of the third party verification process. However, the type of
history and the way of interaction or observation vary depending on a variety of factors.
In a social network system, for example other entities can directly evaluate a specific
entity through quantitative rating system. Moreover, this ratings can be time-variant
or only accumulative regardless of time passed. The recursive problems of trust also
can occur when referencing of information from others. Although, both credentials and
reputation involve the transfer of trust from one to another, each technique has its dis-
tinct problems that have stimulated the existing research in trust. There are a number
of literature on reputation-based trust or credibility analysis such as [10–12](Empirical
Analysis of the eBay’s Reputation System) [7](Reputation-based analysis on semantic
web).
2.1.2 Information Credibility on Twitter
Research on information credibility in Twitter has been popular, since most of the
information, such as tweets and user profiles, in Twitter are public and the usage of
its service is explosive [citation for statistics]. For instance, from Kochen & Poole’s
8 CHAPTER 2. RELATED WORK
experiments [13] and Milgram’s famous small worlds experiment [14], trust has been
shown to play an important role in social dynamics of a network. With social web
API’s, researchers now have many orders of magnitude more data at their fingertips, and
they can experiment and evaluate new concepts far more easily. This is evident across
a variety of fields, for example, social web search [10], semantic web [7, 15, 16], online
auctions [11,12,17], personality and behavior prediction [18,19], political predictions [20]
and many others.
Recently, a large number of research efforts [21–23] focus on Twitter, particularly
on the perception of credibility in microblogs [24], its retweeting mechanism [25, 26] or
distinguishing elite users, who are celebrities, bloggers and representatives of media out-
lets, and ordinary users by exploiting a recently introduced ‘list ’ feature of Twitter [27].
Another research on Twitter at a community level also has been conducted [28–30]. In
addition to this, Caverlee et al. [31] analyzed credibility-based link to identify web spam-
ming links in Twitter. Moreover, Canini et al. [4] used ‘Baysian-based’ approach when
they evaluate features of tweet in contrast to Sridar et al. [32] focusing on non-Bayesian
models for learning beliefs derived from social influence weights.
Measurement of credibility from contents Several research conducted credibility
assessments from various information sources. For example, Wei et al. [33] evaluated
information credibility on an e-learning social network and Suzuki et al. [34] conducted
an experiment to assess credibility score of contents of social network services such as
Facebook1 and LinkedIn2 using Wikipedia for messages. Agichtein et al. [35] also pro-
posed their general classification framework in order to find high-quality user-generated
content (UGC) in social-media focusing on ‘Yahoo! Answers3’. Juffinger et al. [36] de-
veloped an algorithm to evaluate blog-depth (blog-unit) credibility score based on com-
parison to a set of blogs by analyzing blog entries and author profile information of each
blog. For assessing information credibility in microblogs, specifically Twitter, [2, 4–6]
proposed credibility models by exploiting a wealth of social and content features using
Twitter API.
1http://www.facebook.com/2http://www.linkedin.com/3http://answers.yahoo.com/
2.1. CREDIBILITY OF INFORMATION SOURCES 9
Real-time event detection(social sensors) A number of studies have focused on
the role of social media for detecting major social events in real-time. For example,
Lotan et al. [37] conducted an empirical study on the networked production and dis-
semination of news on Twitter during snapshots of the 2011 Tunisian and Egyptian evo-
lutions by differentiating between authors’ user types and analyzing patterns of sourcing
and routing information among them. Their analytical result is followed by a discus-
sion of how social network or micro-blogging services such as Twitter play a key role
in amplifying and spreading timely information across the world. In addition to this
literature, many other papers [38–41] study the role of Twitter as a social sensor, and
a discussion of all works is beyond the scope of this project report.
10 CHAPTER 2. RELATED WORK
Chapter 3
In this project, we focus on Twitter for the two different experiments in Chapter 4
and 5 to evaluate our predictive models. Twitter is currently one of the most popular
social network services on the web [21]. Most social network services such as Facebook,
for example support only ‘bi-directional’ relationship between two users. This means,
the relationship between two persons can be established once both entities accept each
other as friends. However, in Twitter, a single user can complete a relationship and this
action is called ‘following’. Once a user follows another user or account, that particular
simplex-channel (uni-directional channel) becomes activated and this user can receive
all the contents published by the other account lying on the opposite end (being followed
by this user). This mechanism is identical to the action of subscription on a conventional
online publisher and illustrated in Figure 3.2.
Figure 3.1: Distribution of content in Twitter (excerpt from [1] and mashable.cominfographic, 2010)
11
12 CHAPTER 3. TWITTER
3.1 Twitter Jargons
When Twitter opened its service to the public in 2008, it was a simple micro-blogging
service which only served tweet postings of 140 character between users. Popular terms
such as “retweet” or “mention” did not exist from the beginning. The majority of users
only used Twitter as a simple discussion channel or personal journal. People scribbled
their ordinary life or jotted down some of their lightning ideas on Twitter. Those sin-
gle lines of text were named tweets. However, as people discovered the convenience of
uni-directional relationships for sharing information, Twitter users coined information
sharing mechanism as of ‘retweeting’, which exploits Twitter’s social network structure.
In this perspective, Twitter can be viewed as a self evolving communication medium.
3.1.1 Common Twitter Coins
Below definitions can be found on the official Twitter web page.1
Tweets Any message with fewer than 140 characters posted to Twitter.
Mentions A Tweet containing another user’s Twitter username, preceded by the “@”
symbol, like this: Hello @NeonGolden! What’s up?
Replies A Tweet that begins with another user’s username and is in reply to one of
their Tweets, like this: @NeonGolden I can’t believe you thought that movie was cheesy!
Direct Messages (DM) A personal message sent directly to someone who follows
you or sent directly to you from someone you follow.
Retweet (RT) A retweet is a re-posting of someone else’s Tweet. Twitter’s retweet
feature helps you and others quickly share that Tweet with all of your followers. Some-
times people type ”RT” at the beginning of a Tweet to indicate that they are re-posting
someone else’s content. This isn’t an official Twitter command or feature, but signifies
that one is quoting another user’s Tweet.
1Twitter Basics, https://support.twitter.com/groups/31-twitter-basics#topic 109
3.2. FOLLOWER AND FOLLOWEE 13
3.1.2 Abbreviations
We briefly summarize some of the most useful technical abbreviations on Twitter2 in
Table 3.1 below.
Abbreviation Description
MT Modified tweet. This means the tweet you’re looking at is a para-phrase of a tweet originally written by someone else.
RT Retweet. The tweet you’re looking at was forwarded to you by an-other user.
DM Direct message. A direct-message is a message only you and theperson who sent it can read. IMPORTANT: To DM someone all youneed to type is D username message.
PRT Partial retweet. The tweet you’re looking at is the truncated versionof someone else’s tweet.
HT Hat tip. This is a way of attributing a link to another Twitter user.
CC Carbon-copy. Works the same way as email
Table 3.1: Common technical abbreviations on Twitter
3.2 Follower and Followee
Figure 3.2: Depiction of conceptual graphs of following action and information flow inTwitter.
2Twitter Abbreviations You Need To Know, http://articles.businessinsider.com/2010-08-02/tech/30060587 1 tweet-abbreviations-twitter-user#ixzz1ffeSQHCM
14 CHAPTER 3. TWITTER
In the perspective of social network structure, we can divide the all entities in Twitter
into two different categories: Followers and Followees. Interestingly, the term ‘followee’
is used by a number of literature studied about Twitter, however, it is not an official term
used by Twitter. Currently, ‘Following’ is being used on the official Twitter website. We
provide brief definitions of the two categories below. Following action and information
flow through connected edges can be seen in Figure 3.2 below.
Follower A user can follow other users on Twitter unless an account has modified
private settings. This action can be understood as subscribing to another user’s tweets.
This is because a new connection is established once a user follows the other and counter-
following or confirmation made by the followee is not required. This also means that
the following user receives every tweet from the accounts the user follows.
Followee Here, we use ‘Followee’ instead of ‘Following’ since it is easier to understand
the structure of Twitter network in terms of the relationship between users. Suppose
your account is following a handful of users on Twitter. These users become your
followees, and, at the same time, you are one of their followers. You can also have your
own followers who subscribe your tweets you publish on Twitter.
3.3 Usage of Twitter
Twitter has been used for a variety of purposes since the service was offered to the
public. Although the Twitter webpage only asks to users regarding daily personal
events, a number of different usages have been discussed from previous literatures. Java
et al. [30] and Naaman et al. [22] grouped Twitter usages into several categories such as
daily conversations, information sharing/seeking, self promotion and so on. Westman
and Freund [42] conducted genre analysis on tweets using co-occurrence percentages
for genre features and identified values for 5 out of 6 genre facets: who, what, where,
when, why, and how. As the number of accounts with business purpose increases in
microblogging services, a few studies started to look at this new research field. For
example, Popescu and Jain [43] embarked an initial exploration of business Twitter
accounts by trying to identify business accounts as well as deals and events in Twitter.
3.4. NOISE DATA IN TWITTER 15
3.4 Noise Data in Twitter
We discussed the transition of usage of Twitter with its increasing popularity in the
previous section. As a result, we can witness more frequent spam, rumors and false
information on our Twitter timeline. For instance, during election season, a large num-
ber of rumors or biased opinions about a specific party or candidate arise within the
entire topic of elections [44,45]. Moreover, critical events such as natural disasters cause
immediate responses online from a massive amount of people living around the epicen-
ter of that event in order to convey the atmosphere of the real-time situation. In this
process, word-of-mouth news or malicious propaganda affects urgent information flow
along with factual information sources. A good example of studies dealing with this
problem is Mendoza et al. [39]. They thoroughly investigated the behavior of Twit-
ter users during the 2010 earthquake in Chile. Although this is a post-hoc analysis of
an emergency situation, their time-variant analysis of propagation following information
dissemination shows that false rumors tend to be questioned much more and their speed
of dissemination yields downward curve in contrast to the confirmed truth.
16 CHAPTER 3. TWITTER
Chapter 4
Experiment I: Credibility
Modeling Topic-specific Credibility in Twitter
As mentioned in previous chapters, three individual experiments were conducted to
evaluate our predictive credibility assessment model and topic-similarity model as well
as feature distribution analysis. In this chapter, we detail motivation, experimental
setup, evaluation and discussion of our experimental evaluation of the topic-specific
credibility in Twitter.
4.1 Motivation
A logical basis of the fact that credibility of information source matters with regard
to the user-provided content has been discussed in the previous chapters along with
a number of related works. Now, we would like to discuss our first experiment for
evaluating credibility based on the three different models: social, context and hybrid
model and its result. These models are developed from previous feature analysis using
content of tweets and user profile information.
4.2 Credibility Models
This experiment focuses on two different sets of features–social features and content
features–in order to model credibility in microblogs. In total, three credibility models
including a combination of the both feature sets named ‘hybrid model’ are introduced
17
18 CHAPTER 4. EXPERIMENT I: CREDIBILITY
in this section. Before providing detailed information of the credibility models, we first
present the definition of credibility that we use in this experiment.
4.2.1 Definition of Credibility
For the purpose of our discussion about information credibility, we firstly define two
types of “credibility” within a specific topic of interest. The two definitions of the
credibility we discuss here are as follows:
Definition 1 Tweet-Level Credibility: A degree of believability that can be assigned
to a tweet about a target topic, i.e: an indication that the tweet contains believable
information.
Definition 2 Social Credibility: The expected believability imparted on a user as a result
of their standing in the social network with regard to a specific topic domain, based on
any and all available metadata.
Castillo et al. [5] also had similar notion of credibility as our “tweet-level credibility”
except the topic level constraint that we have. Both tweet-level(content level) and social
level credibility are complimentary so that an individual tweet bears social credibility
of its author, and vice versa.
4.2.2 Modeling Credibility
When it comes to the recommendation system, the scope of a single recommendation
used to be based on a user’s interest. In other words, recommendations as outcomes of
the system must be personalized in order to satisfy one’s expected interest in something.
Traditional recommendation strategies adapted by [46–48] compute a personalized set
of recommendations for a target user harnessing a user’s profile information with her
item preferences. However, since we are not targeting an individual user’s preference
but focusing on a topic-specific setting, a group of people who share a common topic of
interest can benefit from our algorithm. Since the majority of newsworthy and reusable
information being shared among users on social networks are topic-specific information,
and social links in the network are also established based upon a topic of interest, we
focus on predicting credible information within an individual topic.
4.2. CREDIBILITY MODELS 19
CredibilityModel
Description
SocialModel
A weighted combination of positive credibility indicatorsfrom the underlying social network.
ContentModel
A probabilistic language-based approach identifying pat-terns of terms and other tweet properties that tend to leadto positive feedback such as retweeting and/or credible userratings.
HybridModel
A combination of the above, firstly by simple weighting, andsecondly through cascading / filtering of output.
Table 4.1: Three predictive models for identifying credible information from Kang etal. [2]
Now, we define the domain of our experiment, followed by the three different credi-
bility models. To evaluate the utility of our models for predicting information credibility,
we trained several classifiers using 5,000 manually rated tweets from a user evaluation.
Detailed process and results of the analysis are followed in the evaluation section.
Definition 3 The Twitter domain can be represented as a quintuple (U,Fo, Fe, T,X),
where Fo and Fe are two U×U matrices representing binary mappings f ∈ Fo, Fe 7→ 0, 1
between users in U (termed “follower” and “following” groups, respectively). T is the set
of tweets, distributed over U , and X is the set of topics in T .
As it can be found on the above definition, our domain encompasses both content
and social metadata. We will show detailed information of the social metadata in the
next section. Now we propose the following three approaches for identifying credible
information source along with a brief definition for each model enumerated in Table 4.1.
Social Model
The web of connected user accounts on Twitter enhances information flow and helps
users to easily disseminate their own-published information as well as other users’ con-
tent by “retweeting” action. However, its short length of postings makes the task of
identifying credible information more difficult, since publishing activity is frequently
performed and, furthermore, users tend to post newsworthy information in multiple
of statuses with shortened urls pointing to external sources due to the 140 characters
constraint. That is, the timeline of each user in Twitter changes dynamically. This
20 CHAPTER 4. EXPERIMENT I: CREDIBILITY
ephemeral nature is also enhanced by ease of using content metadata such as hash-
tag(#topic), mention(@screen name) and retweet(RT @screen name). We believe that
our social model can help dealing with the characteristic of dynamic content in Twit-
ter by leveraging social features embedded in its own network structure. For instance,
observation of the ratio of follower to followee tells us the type of specific user account.
Celebrities or public accounts of news archives apt to have followers up to three or-
ders of magnitude greater than their followings. As an illustration of the celebrities in
Twitter, we can consider the Internet pioneer Tim Berners-Lee. As of June 1, 2012, his
account in Twitter follows 108 users and has 69,663 followers. In this sense, underlying
social features in our social model distinguish celebrities and spammers from general
user accounts in Twitter, grouping the texture of information provenance into several
categories. Our social model attempts to alleviate these problems by weighting various
credibility indicators within a target topic.
Since a majority of the postings in Twitter are “retweets”, we firstly identify credi-
bility of information by measuring the retweet ratio of the information source. Equation
4.1 shows a weighted credibility based on the deviation of a user u ∈ U ’s retweet rate
RTu from the average retweet rate RTx in a specific topic x ∈ X. In our experiment, a
log − log scale is used to soften the imbalanced distribution of values and enclose large
outliers in the data. Notation has been left out for simplicity.
CredRT (u, x) =∣∣RTu −RTx
∣∣ (4.1)
Equation 4.2 measures utility of a user in disseminating information by calculating
number of retweets and number of followers Fe. Since the number of followings can be
considered as a potential number of retweets, and the deviation between a user space
and topic space becomes a utility metric.
UtilityRT (u, x) =
∣∣∣∣RTu,x × Fu,e
tu,x− RTx × Fx,e
tx
∣∣∣∣ (4.2)
Information flux can also be a good metric for identifying active information source
on a social graph structure. In this sense, we believe that the network topology functions
as a good indication of credibility of a user. Equation 4.3 computes a social credibility
score as the deviation in the number of user u’s followers(information subscribers) from
the average number of followers in the topic space. This is normalized by number of
4.2. CREDIBILITY MODELS 21
tweets.
Credsocial(u) =
∣∣∣∣Fo(u)
tu− Fo
t
∣∣∣∣ (4.3)
As we discussed in the beginning of this sub-section, we can now also weight previous
Equation 4.3 by factoring in the ratio of friends to followers as a difference from the norm
for a given topic. For example, an information gathering agent for a direct marketing
company is likely to follow many profiles, but have few followers. Equation 4.4 describes
the social balance of a user u as the ratio of follower (Fo) to following (Fe) group size.
Balancesocial(u) =Fo(u)
Fe(u)− Fo
Fe
(4.4)
We also take social connections within a specific topic space into account for iden-
tifying credibility. This can be express as follow by taking a mean value of followers in
a particular topic space.
Credsocial(u, x) =
∣∣∣∣Fo(u, x)
tu,x− Fo,x
tx
∣∣∣∣ (4.5)
Content-based Model
Since Twitter only allows short content, it is especially difficult to make credibility
judgements about tweets. However, we believe that this characteristic also helps us find
certain patterns in the text such as punctuation or matching words to some lexicons.
One of our assumptions is that some habitual expressions or particular vocabularies
are highly likely to appear for a specific type of information. For example, a personal
conversation contains emotional words more frequently than news articles. Accordingly,
our second model focuses on the text of tweets in order to find linguistic patterns from
the content of Twitter.
In the content-based model, we assigned 12 numeric and 7 binary features as in-
dicators of information credibility. These feature values will be used later to predict
manually annotated credibility score from our real-world user survey. Approximately
half of the these features are taken from a 2011 study by Castillo et al. [5]. However,
our features are applied to individual tweets, and this approach differs from their work;
Castillo et al. defined a much larger set:10 tweets are bound to make a single input
document in (1) a pool of previously selected factual tweets and (2) a multiple topic
22 CHAPTER 4. EXPERIMENT I: CREDIBILITY
setting. In this model, we only included content-specific features only because we purely
focus on context-based setting.
Numeric Indicators:
1. Positive Sentiment Factor : Number of positive words (matching our lexicon)
2. Negative Sentiment Factor : Number of negative words
3. Sentiment Polarity : Sum of sentiment words with intensifier weighting (x2) (’very’, ’ex-
tremely’ etc)
4. Number of intensifiers: ’very’, ’extremely’ etc., based on our lexicon.
5. Number of swearwords: Simple count, based on lexicon.
6. Number of popular topic-specifc terms: Simple count, based on lexicon.
7. Number of Uppercase Chars: Simple Count
8. Number of Urls: Simple Count
9. Number of Topics: Number of topics ’#’ (All have at least 1)
10. Number of Mentions: Number of user’s mentioned with ’@’
11. Length of Tweet (Chars): simple count.
12. Length of Tweet (Words): simple count.
Binary Indicators:
1. Is Only Urls: No text, only links.
2. Is a Retweet : From metadata
3. Has a Question Mark : ’?’ or contains any of Who/What/Where/Why/When/How
4. Has an Exclamation Mark : ’ !’
5. Has multiple Questions/Exclamations: ’??’ ’???’ ’ !!’ ’ !!!’ etc.
6. Has a positive emoticon: :) :-) ;-) ;) etc.
7. Has a negative emoticon: :( :-( ;-( ;( etc.
Hybrid Model
We have discussed two different models: the“social model”that harnesses user metadata
in a social graph and the “content-based model” which discovers linguistic patterns
4.3. EXPERIMENT SETUP 23
from tweet text itself, respectively. Our initial intuition was that the integrated set of
all features we have used would result in a better predictive outcome for identifying
credible/non-credible provenance of information. Therefore, proposed this integrated
feature set as a third “hybrid” model to maximize predictive ability from both social and
content-based features. This exhaustive approach has higher granularity of fingerprint
information in order to accurately predict desired credible information and its source.
4.3 Experiment Setup
So far, we have presented the definition of credibility, the domain of Twitter and three
different credibility models. In this section, we incorporate these models into a real
world system to recommend credible source of information in microblogs. To conduct
our exploratory experiment, we first collect 8 different topic-specific data sets from
Twitter and conduct a brief statistical analysis on the data to understand how each set
of information varies across different topic spaces. Next, we provide the procedure of
the online survey performed for collecting manually annotated credibility scores from
real world users as our “ground truth” for the evaluation of our models. In the following
paragraph, we describe how our crawler works and provide our methodology in detail.
4.3.1 Data Mining and Preliminary Analysis
Figure 4.1: Illustration of the scope of crawled datafor each of 7 current topics.
In order to evaluate our credi-
bility models on a topic-specific
setting, we first crawled tweet
texts and the authors’ profile
information as well as their
network structure in Twitter.
To be specific, we developed
our crawler program using both
Twitter REST and Streaming
APIs through the“python-twitter”
wrapper1. The primary reason
1python-twitter : A python wrapper around the Twitter API. http://code.google.com/p/python-twitter/
24 CHAPTER 4. EXPERIMENT I: CREDIBILITY
for using python as a platform for our crawling task was that python supports opti-
mized environment for natural language processing with its brevity in programming
expression. The major challenge in this task was the rate-limiting policy for tweets and
network information imposed by the Twitter API. For example, additional tweets or a
list of followers/followings of a particular user can be fetched in the rate of 150 queries
per hour with 200 tweets per a single query. In contrast, seed tweets and their user
profile information were easily obtainable through the Twitter Streaming API in real-
time2. To overcome the time constraint stated above, we ran our crawler for 2 months
from a cluster of 12 machines using 14 different Twitter authentications. Through the
Twitter API, tweet texts along with content and social metadata are returned in JSON
format, and are stored in a relational database on our server after a series of simple
parsing processes. For example, we can parse location or language information of a user
using parameters such as ‘geocode’, ‘lang’ or ‘locale’ from ‘GET search’ method of the
Twitter API. A conceptual diagram is depicted in Figure 4.1.
We carefully chose 8 different topics for our experiment. The main criteria are as
follows: (1) trendy topics that yield comparatively sufficient amount of postings within
a given time (e.g. 100 statuses in a minute) (2) topics that allow us to obtain significant
interconnections in the underlying social graph. This rich network structure will be
utilized later in the following chapter. After we collected the seed tweets and their
author profiles, further data collection process was performed for each topic to gather
network structures (including followers and followings of the seed authors) originating
from the original set of authors as well as all the tweets posted by those authors. As
illustrated in Table 4.2, the network information of the authors are significantly larger
than core author and seed tweets from the first crawling process. In this experiment,
we primarily focus on the “Libya” dataset since it has the highest interconnections in its
own social graph. In addition, a simple visualization of physical information sources for
the collected tweets in the Libya data set is shown in Figure 4.2. Also, the word cloud
visualization of the same data is shown in Figure 4.3.
2The Streaming API roughly provides from 0.5% to 5% of real-time statuses. A larger amount oftweets can be obtained by registering the “Firehose” service in Twitter. This stream of information isdelivered through UDP network connection.
4.3. EXPERIMENT SETUP 25
Set Name Core Authors Core Tweets Fo and Fe
(with overlap)Fo and Fe
(distinct)
Libya 37k 126k 94m 28mFacebook 433k 217k 62m 37mObama 162k 358k 24m 5mJapanquake 67k 131k 25m 4mLondonRiots 26k 52k 30m 4mHurricane 32k 116k 35m 5mEgypt 49k 217k 73m 36m
Table 4.2: Overview of 7 topic-specific data collections mined from both the TwitterREST and streaming APIs.
Figure 4.2: Word cloud showing origin of tweets in the Libya data set.
26 CHAPTER 4. EXPERIMENT I: CREDIBILITY
Figure 4.3: Word cloud showing distribution of popular terms in the Libya data set.(This visualization is generated using Wordle, http://www.wordle.net/)
4.4 User Survey
To collect credibility assessments on our dataset from real users and evaluate our pre-
diction models, we conducted an online user survey. For this survey, a set of tweets from
the“Libya”dataset is used. In total, 145 participants were asked to rate their impression
of the credibility of 40 tweets on a Likert scale ranging from 1 to 5. An individual tweet
is only shown to one participant and not to others again. Our respondents also had
an option to choose “can not answer” and it is scaled as ‘0’ along with other 5 ratings.
The tweets marked with a “can not answer” option are all discarded in order to avoid
unwanted bias to our result. The population of this survey displays that we had 39%
female, 61% male, varying in age from 19 to 56 with an average age of 26. The detailed
statistics of our participants can be seen in Figure 4.4.
In this experiment, we were also interested in the effects of various contexts including
those from Castillo et al. [5] on perceived credibility. Accordingly, we examined the effect
on perceived credibility scores of the respondents by providing four different source
context (dependent variable). Each context is presented as an individual group in the
survey showing 10 tweet texts with or without additional contextual information. Group
1 provided only the tweet text in each question. Group 2 provided statistically poor
property of social metadata for the tweeter (e.g. less number of followers, followees and
retweets) along with the raw text of each tweet. Group 3 provided statistically good,
and socially strong, property of the tweeter (e.g. significantly more number of followers,
4.4. USER SURVEY 27
Figure 4.4: Three graphs showing (a) gender distribution, (b) familiarity with microblog-ging service and (c) age distribution of the participant of the online user survey in thisexperiment. (from left to right, respectively)
Figure 4.5: Plot showing perceived credibility in each of four Twitter contexts (Negative,Positive, True, Null).
followees and retweets) with the raw tweet text in each question. Lastly, group 4 was
shown the true context, i.e. the tweeter’s real number of followers, followees and retweets
along with the tweet text. Except the final group (true context), the other three groups
are only used for evaluating the effect on perceived credibility scores. As can be seen in
Figure 4.5, the average perceived credibility score in the positive context is ranked as
the highest. In this figure, the positive context (2.34) shows an increase of 26% in the
average perceived credibility score from the negative context (1.58). Clearly, this result
implies that the context of Twitter does have an effect on perceived credibility.
28 CHAPTER 4. EXPERIMENT I: CREDIBILITY
4.5 Evaluation
So far we have presented our experiment setup and described an online user survey to
collect ground truth credibility ratings in the previous sections. Now we evaluate our
credibility models by predicting the crawled dataset with manually annotated tweets
with credibility ratings from real users. The evaluation is done by comparing prediction
accuracy of each model. Since we could only collect a sufficient amount of resources
on the “Libya” dataset due to the rate limitation of the Twitter API, we focus on that
dataset for most of the following evaluation.
4.5.1 Data Analysis
Before we discuss results of each prediction model, we now analyze some interesting
patterns that we found from our crawled dataset in a broader perspective. A thorough
analysis on all the features we computed in this experiment is left for the next chapter
as another individual experiment. Through the macroscopic analysis on the crawled
dataset, we arrive at a high level understanding of the Twitter domain as well as an
additional insight about some anomalies in feature distribution. We focus on a selec-
tion of the most influential features in the predictive models such as Fo, Fe, links and
retweets. These features were mainly identified using the best-first feature analysis in
WEKA3.
Followings and Followers The illustrations of ‘followings to followers’ patterns
across five of our topic-specific datasets are shown in Figure 4.6. The first plot (a)
in Figure 4.6 shows the number of followers Fo to the number of following users Fe over
the 37,000 core users in the Libya dataset. In this graph, we labeled four typical ar-
eas: ‘suspicious zone’, ‘cold start zone’, ‘low credibility zone’ and ‘celebrity zone’, with
shaded boxes. For example, both the ‘celebrity zone’ and the middle area on the bottom
of this plot, which represent accounts having reasonable number of followings, presum-
ably contain most of the credible users. Intuitively, celebrities in Twitter tend to have
a reasonable number of followings with a significantly large number of followers. Since
the accounts following unreasonable number of users are highly likely to be automated
agents (substantially large number of Fo and Fe) or information collectors (substantially
3www.cs.waikato.ac.nz/ml/weka/
4.5. EVALUATION 29
large number of Fe and only a few number of Fe). However, the ‘cold start zone’ which
has small number of Fe and Fo represents new or irregular users. Since users in this area
do not have sufficient social information, they are considered as low credibility accounts.
We found another interesting pattern along the line of threshold on the following axis
(2,000 followings) from all of the 5 graphs. In fact, this limitation is imposed by Twitter
in order to prevent system strain and limit abuse4. However, this line on the log-log
scale can also be equivalent to the long tail of a power law distribution, as defined by
Barabasi in his Nature paper [49]. In general, we found similar distributions across all
of the topics in Table 4.2.
Credibility Distribution across Followers Figure 4.7 shows a histogram of the
average credibility score for tweets in the online user survey compared with number of
followers. In this plot, credibility scores for individual tweets are binned with regard
to the corresponding number of followers. A pattern of monotonic increase in credi-
bility rating is observed up to a following size of approximately 1,500. However, we
have a sharp decrease after 100,000 followers. This result is reflected in the “balance”
component of our social model and these outliers (followers ≥ 1,500) are penalized by
weighting socially balanced users in Equation 4.4.
Links and Other Credibility Features We now briefly discuss other interesting
discoveries from a number of feature distribution analyses. The feature distribution
graph in Figure 4.8 and the feature correlation graph in Figure 4.9 show an overview of
a subset of the features that we used in this experiment (colored by credibility rating5).
Features of the external link, which is equivalent to the presence of urls in each tweet,
exhibits a slight correlation to credibility and, occurred more frequently in younger
profiles. Presence of links in tweets also shows an interesting correlation to sentiment
(number of occurrence of positive or negative sentiment keywords): if the sentiment
score6 is polarized in either direction, links rarely occur, and users tend to assess cred-
ibility more highly. For example, retweets including links have an average credibility
4Twitter’s technical follow limits. https://support.twitter.com/articles/66885-follow-limits-i-can-t-follow-people
5red: credible tweets, blue: non-credible tweets (binarized)6sentiment score = sentiment pos - sentiment neg
30 CHAPTER 4. EXPERIMENT I: CREDIBILITY
(a) #Libya
(b) #JapanQuake (c) #Hurricane
(d) #EnoughIsEnough (e) #Facebook
Figure 4.6: Friend to Follower patterns across four of our topic-specific data sets. Allother sets exhibited a similar distribution.
4.5. EVALUATION 31
Figure 4.7: Histogram of average credibility rating from the online survey across numberof followers for the tweeters (binned).
rating of 1.4589 while retweets without a link have a score of 1.3702, showing a relative
increase of 6.47%. Figure 4.10 shows the distribution of credibility in the two contexts
of link existence across four different groups in our online survey. More detailed analysis
across all of the features is covered in Chapter 5
4.5.2 Predicting Credibility
To evaluate our credibility models (social, content-based and hybrid models), we per-
formed a total of 3 prediction tests with a set of manually evaluated tweets from the
online survey. Each individual test represented one of our credibility models. After we
compute all of the individual components from each credibility model, a set of weighted
features are obtained. Three different feature sets were loaded as an input file to the
WEKA7 machine learning toolkit. The objective of this evaluation was to accurately
predict our manually labeled“ground-truth”data from the survey. First, we performed a
preliminary evaluation by applying a variety of classification algorithms such as Baysian
classifiers and a number of the other decision tree algorithms supported by the WEKA.
Afterwards, we chose to use a J48 (C4.5) decision tree algorithm since (1) this algorithm
showed the best performance in our prediction task and (2) it allows us to compare our
results with Castillo et. al’s similar evaluation in [5].
7www.cs.waikato.ac.nz/ml/weka/
32 CHAPTER 4. EXPERIMENT I: CREDIBILITY
Figure 4.8: Plot showing a sample of feature distributions for the content based modelon a 3000 labeled tweets from the Libya dataset.
Figure 4.9: Comparisons of each feature used in computing the social credibility model.
4.5. EVALUATION 33
Figure 4.10: Comparisons of link and no-link contexts across four different groups inthe online user survey.
Figure 4.11: Comparison of predictive accuracy over the manually collected credibilityratings for each model.
34 CHAPTER 4. EXPERIMENT I: CREDIBILITY
After we filter out a set of tweets with the rating of 0 (“can not answer”) and 3 (“the
median score with possible ambiguity”) from the group of ‘true’ context, predictions
were performed on a training set of 591 tweets with annotated credibility scores. A 10-
fold cross-validation was applied, and training sets were separate from test sets. Four
different credibility ratings (1,2,4,5) were split into two classes (1,2 for ‘non-credible’
class and 4,5 for ‘credible’ class), and the J48 algorithm classified each test instance
into one of those two classes. In our evaluation, the prediction process is performed at
the tweet level and this is one of the major difference compared to the evaluation in
Castillo et al. [5]. Class instances were evenly distributed in the training sets.
The results of the evaluation are shown in Figure 4.11. In this experiment, both the
content-based and hybrid models fairly performed at the prediction task, however, the
social model showed significantly higher performance than those two other models with
an accuracy of 88.17%. In other words, there was a remarkable improvement of 11% over
the next best performer (the hybrid model). At the beginning of this experiment, our
expectation was that the more features used, the more accurate our predictions would
be. However, the results seem counter-intuitive since the social model outperforms the
hybrid and content-based methods significantly.
According to this result, the features from the content-based model supposedly have
somewhat negative effects on the performance of the hybrid model. Since the content-
based model displayed the lowest performance among the three credibility models, the
fact that the hybrid model performed as the next best also supports this supposition.
As we mentioned in the previous section, a closer analysis across all of the features of
the content-based model will be covered in the next chapter. Also we assume that the
short length of tweet text attributed relatively poor performance (67%) of the content-
based model in this prediction task. An overview of the statistical output from the
J48 learner process is provided in Figure 10 for our best performing method, showing
a correct classification of 902 of the 1023 instances, yielding 88.172% accuracy. In
conclusion, our results in this experiment indicate that the underlying network and
dynamics of information flow are better indicators of credibility than text content.
4.6. DISCUSSION 35
Figure 4.12: Statistical results from a J48 tree learner for the best performing credibilityprediction strategy (Feature Hybrid)
4.6 Discussion
In this experiment, we conducted an evaluation on our three computational models in
predicting topic-specific credibility on a set of crawled tweets. However, we believe that
this is not an exhaustive study of credibility models and, therefore, there are still better
prediction strategies to be found. According to the results in this study, content-based
model seems to have rather negative effect than those features from the social model
in predicting credible information. Because of this, we need to find other underlying
content-based features in microblogs that offset interfering noisy features that we found
here. As we will discuss more in the next experiment, we need to scrutinize various
complex correlations between different features and align them with our desired patterns
in order to maximize predictive performance. When this goal is successfully performed,
we will be able to generalize our model in the domain of microblog showing stable and
robust performance in predicting credibility.
36 CHAPTER 4. EXPERIMENT I: CREDIBILITY
Chapter 5
Experiment II: Feature Analytics
An Analysis of Feature Distributions in Twitter
In Chapter 4, we have discussed credibility assessment models from two pre-defined
feature groups in addition to a combination of models in terms of predictive accuracy
over a set of manually collected credibility ratings on crawled tweets. Now, we take a
closer look at feature distributions in Twitter in order to find the utility of individual
features for predicting credibility of information source. For example, the utility of an
individual feature can be measured in terms of predictive accuracy, and also presence
of a feature in a specific topic domain. This experiment is based on (1) context of
credibility, (2) multiple topics, (3) dyadic pairs of tweets (conversation tweets) and (4)
retweet chain lengths. We start by providing a brief motivation of this project and our
dataset used in this experiment. Next, a context-based evaluation of a set of feature is
presented for predicting manually provided credibility ratings on a collection of tweets,
followed by an evaluation of the feature distribution across 8 diverse topics crawled
from Twitter. Finally, an analysis of feature distribution across dyadic pairs of tweets
and retweet chains of various lengths is described. For the ease of comparison, we use
previously evaluated 5025 tweets (on page 26) from the user survey.
37
38 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
Class Description # of Contexts
Diverse Top-ics
Diverse topics in Twiter; eg:#Romney #Facebook
8 different topics (see Table 5.2)
Credibility Manually provided assessmentsof tweets
Credible or non credible
Chain length Mined retweet chains and classi-fied based on length
Long, short, no chain
Dyadic pairs Mined interpersonal interactionand classified
Dyadic or non dyadic
Table 5.1: Context classes in our Evaluation.
5.1 Motivation
Credibility models can leverage many features for predicting trustworthiness of infor-
mation source. However, there are some important issues to address when it comes to
the specific context, circumstances or scenarios. In other words, each of a variety of
contexts require different set of features in order to make the most optimal credibility
assessment.
In order for us to answer the question: “which features are best for determining
content credibility?”, we would like to propose a method to study credibility related
features along various dimensions. The objective of this preliminary study is to analyze
the distributions of the features across multiple contexts, and confirm that whether a
cluster of features would improve the performance of credibility assessment. We also
study that which features tend to provide similar or distinct types of information. We
can use these categories to reduce our the complexity of feature dimension and study
the contribution of each category to the ground truth on credibility.
5.2 Data Gathering and Processing
We have gathered 8 different datasets based on specific topics. These sampled tweets
represent several categories of topic which are frequently discussed in Twitter such as
“revolution”(#Libya, #Egypt),“natural disaster”(#Earthquake),“Movement”(#EnoughIsE-
nough), “Politics” (#Romney), “Sports-Big Match” (#Superbowl) and “Daily Chatter-
ing” (#Love and #Facebook). This set was chosen because our study focuses on eval-
uating distribution of multiple credibility features in various contexts (See Table 5.1).
These eight topic-specific datasets are listed in Table 5.2. The table contains number
5.2. DATA GATHERING AND PROCESSING 39
Set Core Core Fo and Fe Fo and Fe
Name Tweeters Tweets (overlapped) (distinct)
Libya 37K 126K 94M 28MEnoughIsEnough 85K 129K 13M 4MEgypt 49K 217K 73M 36MEarthquake 67K 131K 15M 5MSuperbowl 191K 227K N/A N/ARomney 226K 705K N/A N/AFacebook 433K 217K 62M 37MLove 312K 227K 21M 7M
Table 5.2: Overview of 7 topic-specific data collections mined from the Twitter stream-ing API.
of each subset of individual dataset. Table 5.3 shows a set of sample tweets from each
topic that illustrates content diversity of our sampled dataset. Our crawling process is
also described as a pseudocode in Figure 5.1. We first evaluate feature distributions
of diverse topic context as a general analysis and focus on the “Libya” dataset in three
specific contexts in Section 5.4.
5.2.1 Online Evaluation of Credibility
Our first specific context on the “Libya” dataset is “credibility”. In order to classify and
evaluate this context, we conducted an online evaluation through two different phases
and obtained a set of crowd-sourced reference credibility ratings on our sampled tweets.
This online evaluation is performed on Amazon Mechanical Turk. 700 participants were
asked to rate their impression of the credibility of 30 tweets on a Likert scale. Tweets
used for a participant is not shown again to another respondent. To maintain objectivity
of our evaluation, we let all the participants can select “0” which represents “can not
answer” as well as the scores ranging from 1 to 5 and discarded the tweets marked as
0 (This process is not shown to the respondents and processed after closing the online
evaluation.) We also provided 4 general questions and 7 pre-test questions to each
participant. General questions are used for understanding population and familiarity
with microblogging services of our respondents. To validate the result of this evaluation,
pre-test questions are used to filter out unreasonable respondents.
Demographic statistics of our evaluation show that overall participants were 61%
male and 39% female, varying in age from 19 to 56 (median 28). Participants were
generally familiar with the Twitter (4 out of 5 rating on average). In total, 10,851
40 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
Set TweetsName
Libya #Libya: Muammar #Gaddafi’s base takenhttp://t.co/UKvSn7Jk #drumit
Superbowl RT @mashable: The Giants may have won the #Super-Bowl, but Madonna won the Google search competition -http://t.co/YRqErdkg
Romney #Romney outlines economic plan - cut taxes across the boardto boost economy, but won’t add to the deficit. #GOP2012http://t.co/AjbDWLKy
Love You deserve better friends #love you.
Facebook I posted 15 photos on Facebook in the album “21.09.2011. MCat UCLA Royce Hall” http://t.co/9WQQcfRS
Enoughisenough i Mean come on Now #EnoughIsEnough
Egypt #Egypt #July8 Egypt Mubarak-era minister jailed for corrup-tion Albany Times Union http://t.co/5ucF7kDA #Feb17
Earthquake A light intensity #earthquake, of magnitude 4.3 on the RichterScale, occurred here at 8.51 p.m.
Table 5.3: Examples of 8 topic-specific tweets from the crawled sets.
1: procedure crawl(topicsList, tweetsList)2: store = ∅3: for all topic ∈ topicsList do4: store← topic5: topicTag = topic.getTopicTag()6: for all tweet ∈ tweetsList do7: if tweetContains(topicTag) then8: store← getRelevantTweet()9: end if
10: end for11: for all tweet ∈ store.getTweets() do12: store← getUsers()13: end for14: for all user ∈ store.getUsers() do15: store← getTweets()16: end for17: for all user ∈ store.getUsers() do18: store← getFollowers()19: store← getFollowings()20: end for21: end for22: end procedureFigure 5.1: Data crawling procedure used for gathering 8 datasets from the twittersocial graph, consisting of over 20 million users and 200 million tweets in total.
5.3. FEATURE SET 41
tweets were rated with an annotated credibility score on the Libya topic collection.
5.2.2 Retweet Chains
Additional crawling process was performed on the Libya dataset from Table 5.2 to
understand the influence of retweet activity on feature distribution. In this process, we
collected retweet chains (up to 15 hops away) from our core tweets. In total 36,768
chains were obtained, ranging from 1 to 15 hops in length with an average length of
3.5492. We classified the chains into three contexts shown in Table 5.1, (1) long chains,
having 4 or greater than 4 hops, (2) short chains having less than 4, and (3) no chains.
5.2.3 Dyadic Pairs
Conversational tweets can be distinguished from general tweets or retweets in Twitter
domain and this type of tweet is defined by a mention tag (@) followed by a screen name.
In order to analyze whether dyadic communications have distinctive feature distribution
patterns in Twitter, this context is computed on our “Libya” dataset by tracking the
“@” mention tag. A group of tweets constructing a pairwise conversation with at least
two messages are selected from the entire set of tweets. As can be shown in Table 5.1,
this context contains two simple classes, (1) dyadic, and (2) non dyadic.
5.3 Feature Set
In the domain of Twitter, a variety of features to predict credibility of a user or a piece
of information can be found and extracted based on various viewpoints about criteria
defining “features”. For example, we can consider a number of features in Twitter
based on semantic, psychological or social perspectives in various contexts. Recent
studies [2, 5, 39, 50] investigate features in Twitter by classifying them into social and
content-based classes. Features can be categorized into content-based, network-based
or social feature types as well. In this experiment, we define a number of features into
three different feature classes: social, content and behavioral. A set of features used in
this study is shown in Table 5.4.
Social Features The social class represents features which are derived from prop-
erties of users or social network structure in the microblog. For instance, age of an
42 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
Name %Present
Averagescore
Class
age 100.00 610.64 Social
listed count 100.00 11.82 Social
status count 100.00 554.49 Social
status rt count 100.00 10.17 Social
favourites count 100.00 57.96 Social
followers 100.00 295.15 Social
followings 100.00 315.03 Social
fofe ratio 100.00 5.81 Social
char 100.00 120.55 Content
word 100.00 18.69 Content
question 7.95 0.10 Content
excl 10.10 0.15 Content
uppercase 10.23 11.27 Content
pronoun 92.84 4.22 Content
smile 42.24 0.02 Content
frown 1.81 0.43 Content
url 14.17 0.42 Content
retweet 8.71 0.74 Content
sentiment pos 71.51 1.53 Content
sentiment neg 59.07 1.23 Content
sentiment 74.20 0.29 Content
num hashtag 42.09 0.83 Content
num mention 19.25 0.25 Content
tweet type 100.00 1.10 Content
ellipsis 2.11 0.29 Content
news 5.13 2.03 Content
average balance of conversation 100.00 0.32 Behavioral
average number of friends in timeline 100.00 2086.28 Behavioral
average spacing between statuses in seconds in timeline 100.00 21959.07 Behavioral
average text length in timeline 100.00 104.52 Behavioral
average general response time 100.00 3.27 Behavioral
average number of messages per conversation 100.00 4.34 Behavioral
average trust value in conversation 100.00 0.10 Behavioral
fraction of statuses in timeline that are retweets 100.00 0.55 Behavioral
Table 5.4: The set of Twitter features analyzed in our evaluation. Features are groupedinto three classes: a) Social, b) Content-based and c) Behavioral. Each is a representa-tive subset for the larger feature sets in each class.
5.4. FEATURE ANALYSIS 43
account represents profile information and number of followers or followings contain
social information of a user. This type of features can be obtained through Twitter API
since plentiful social metadata exist along with the original content.
Content-based Features Content-based features are extracted by applying simple
natural language processing algorithms. In Twitter, content-based features can be less
rich compared to those in web pages or other information sources since Twitter limits
each individual content by 140 characters. We have a number of complex content-
based features as well as basic features such as number of exclamation and question
mark, number of word. For example, number of url, hashtag and mention tag can be
considered as content-based feature. Sentiment score is another example of advanced
features in content class. For example, either positive or negative sentiment factors can
be computed by comparing each word with lexicons of keywords. Our content-based
feature set also includes a news feature which is computed through comparison with a
popular news archive. Features in the content class are independent of other factors
such as social or profile information except the content itself.
Behavioral Features Behavioral features represent the dynamics of information flow
in Twitter domain. In Table 5.4, we provide subset of a large class of features which focus
on the dissemination and interpersonal communication in the microblog. By looking at
these behavioral features, we can understand a number of factors affecting the dynamics
of information such as a user’s propagation energy, balance of a conversation and a rich
set of communication statistics. The behavioral features used in this experiment are
from Adali et al. [50].
5.4 Feature Analysis
So far, we have discussed our data collection process, the dataset to be used in this
experiment and description of the features we use with three different categories. In this
section, we provide result of our feature analysis in four different contexts. We begin with
an evaluation in the context of credibility by comparing two different classes:credible
and non-credible set of tweets.
44 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
5.4.1 Credibility Analysis
As an extension of our first experiment covered in Chapter 4, we now provide an analysis
of a set of features in the context of credibility. Since the distribution of content-based
features for identifying credible information source was not fully described in our first
experiment, we provide the distribution in two different contexts: credible tweets and
non-credible tweets. Before we begin our discussion on the result we obtained, let us
describe how we evaluate this context.
Credible and Non Credible Tweets To analyze feature distribution in the context
of credible information in Twitter, we need to categorize a set of tweets into two different
classes. To obtain these two separate datasets (credible and non-credible tweets), we
followed the same process used in the first experiment in Chapter 4. However, at this
time, we performed our online evaluation on the Amazon Mechanical Turk in two phases.
This data gathering process is discussed earlier in Section 5.2.
Figure 5.2: Analysis of the distribution of selectedcontent-based features based on two credibilitycontexts.
Figure 5.2 shows two different
mean distribution of features per
tweet in credible and non-credible
contexts. From the online evalua-
tion, we collected 100K tweets rang-
ing from 1 to 5 in Likert scale.
Tweets with scores of 1 or 2 and
4 or 5 were considered as not credi-
ble and credible for this experiment,
respectively. Tweets with score of 3
were discarded in order to reduce
the possibility of ambiguity.
In Figure 5.2, the distributions
of 18 content-based feature occur-
rences are shown across the two
classes. Overall, none of the fea-
tures in this figure presents sig-
nificant difference between credible
and non-credible classes. However,
5.4. FEATURE ANALYSIS 45
the features which occurred less frequently such as ‘question’, ‘exclamation’ and ‘smile’
show more frequency in the credible class. The ‘Ellipsis’ feature also occurred more
frequently in the credible class. Presumably, those ellipsis (“...” mark at the end of
text) implies that more contents exist in the external information source linked through
the included urls in each tweet text and is omitted in each tweet.
We found another interesting result from the distribution of sentiment features (‘sen-
timent’, ‘sentiment pos’ and ‘sentiment neg’). Basically, a ‘sentiment’ feature is com-
puted by subtracting ‘sentiment neg’ (negative sentiment score) from ‘sentiment pos’
(positive sentiment score) and both ‘sentiment pos’ and ‘sentiment neg’ are computed
using correlations to positive and negative lexicons. In the distribution of ‘sentiment’
feature (the overall sentiment score), credible class appears more frequent. However, it
can be clearly seen in both ‘sentiment pos’ and ‘sentiment neg’ that the non-credible
class (the negative credibility context) have more frequent occurrence in any sentiment.
This seems to imply that positive sentiment is not necessarily associated with credibility.
The ‘news’ feature is also computed using correlations with top 300 news archive in
terms of the number of followers from the Twibes1 website. This feature shows more
frequency in occurrence in the credible context as we expected. All in all the total score
for the credible context was 0.145. Since the total score of 0.148 for the non-credible
context shows nearly close proximity to the credible context, this implies that the simple
presence of features does not necessarily increase credibility.
5.4.2 Analysis across Multiple Corpora
We have analyzed the distribution in credible context with credible and non-credible
contexts in the previous evaluation. Now, we need to examine if there is possible
variation of feature distribution across multiple topics in Twitter. Since the context
of multiple topics must be distinguished from the context of credibility, we assign the
type of topic as an independent variable by not considering the context of credibility
simultaneously. This approach is adopted to the next two following contexts as well.
To analyze the distribution of content-based features across a number of topics in
Twitter, we use the 8 different topic-specific dataset from Table 5.2. The distribution
of our topic collection is illustrated in Figure 5.3. This figure presents the average
occurrence score of each feature per tweet on the x-axis. All the average occurrence
1www.twibes.com
46 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
Figure 5.3: Analysis of the distribution of selected content-based features across a di-verse set of Twitter topics.
5.4. FEATURE ANALYSIS 47
(a) A plot of the average occurrence of eachfeature per tweet across all collected data.
(b) Overview of the distribution of all fea-tures per topic.
Figure 5.4: Two different charts showing the average occurrence score of (a) each featureacross all dataset and (b) all features per topic.
scores are normalized with the maximum number of occurrence of each feature, ranging
from 0 to 1. This is because we want to see the relevant frequency in occurrence between
each feature. Clearly, this graph exhibits significant variance in occurrence for the most
of features across different topics.
We found an interesting distribution patterns across multiple topics. As can be
seen in Figure 5.4 (b), which represents the average distributions of all features across
the independent topics, the topics about crisis or emergency (#Earthquake, #Egypt
and #Libya)2 display a significant increase in average occurrence score compared to
the other topics. For example, the topic of #Libya has a score of 0.14 while the topic
of #Superbowl, which is a non-crisis event, scored 0.09. Here, we can see a relative
increase of 55% in feature occurrence to the crisis situation. On the contrary, run-of-
the-mill topics such as #Facebook (0.6) and #Love (0.7) have the lowest scores in the
frequency of feature occurrence. This result implies that we can expect to have better
performance in applying feature-based predictive model on emergency or crisis topics,
since those topics tend to have richer set of features with more frequent occurrence.
However, this tendency can not be considered on every features available since some of
the features show opposite pattern in the frequency of feature occurrence, for example
2These datasets were crawled during the 2011 evolutions occurred in both Libya and Egypt.
48 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
‘num mention’ feature in Figure 5.3 showing significantly less frequency in #Libya,
#Egypt and #Earthquake topics than in #Love and #Superbowl. In general, we
found significantly different feature distribution across different topics. According to
this result, we can conclude that the type of topic has more significant effect on the
identification of credible information than each individual feature.
5.4.3 Feature Distribution in Retweet Chains
Figure 5.5: Distribution of predictive fea-tures in three contexts formed aroundlength of retweet chains. a) Non-retweetedcontent b) Short chains (1-3 hops) c) Longchains (≥4 hops).
The third context we use in this exper-
iment is established in order to examine
the behavioral aspects of the Twitter do-
main by closely looking at the feature dis-
tribution across different length of retweet
chains in its social graph. In fact, this is
a combinatorial evaluation of social and
content features rather than simply rely-
ing on the content.
Then, why do we want to observe this
specific context here? Suppose a user have
multiple tweets retweeted by a few users
she follows. Can we say all of those tweets
are retweeted because their contents are
credible? We can not clearly say all the
retweets have credible information since
we have a few counterexamples. For in-
stance, it is easy to witness that a number
of spams or tweets bearing noisy content
without any newsworthy information also get retweeted in Twitter. To find possible
evidence toward this question, we designed and explore this context. Here, we particu-
larly focus on the feature distribution across a set of tweets that were found at the ends
of ‘long chains’, and those that had been propagated, but only in ‘short chains’. The
method we used to distinguish both long and short chains can be found in Section 5.2
“Retweet Chains”.
A set of non-propagated tweets were randomly collected as the class of ‘no chain’
5.4. FEATURE ANALYSIS 49
for this comparison. Figure 5.5 shows the feature distribution for this context. It is
notable that the URL feature in the longer chains occurs 50% more frequent than in the
short or no chain contexts. This can imply that tweets with links (URLs) pointing to
external information sources tend to construct longer distance in its propagation chain.
Similarly, tweets involved in long retweet chains have tendency toward constructing
longer content in terms of words and characters.
5.4.4 Feature Distribution in Dyadic Pairs
Figure 5.6: Distribution of predictive featuresdyadic pairs. Tweets were selected for this groupif they occurred in a pairwise conversation be-tween two users in which more than two messageswere exchanged, as measured by the mention andretweet metadata in Twitter API.
The final context in this experiment
is also used to explore the behav-
ioral context in Twitter. We named
this context as “Dyadic pairs of
tweets” since conversational tweets
in Twitter are exchanged between
a pair of users by using the ‘@men-
tion’ or ‘@reply’ tags. These
two tags are devised to help users
point to another user, and distin-
guished from the ‘retweet’ mecha-
nism. That is, if the ‘retweet’ activ-
ity is responsible for broadcasting
information to a set of users, ‘men-
tion’ activity takes charge of dyadic
communication between two users
in Twitter. Again, the detailed
information of the dyadic dataset
is explained in Section 5.2 “Dyadic
Pairs”. Figure 5.6 shows the fea-
ture distributions across two differ-
ent classes for this context. In the
context of dyadic communication,
feature distribution shows lower variance compared to the other three contexts. We
can see slight increase in both ‘char’ and ‘word’ features for the non-dyadic class. A
50 CHAPTER 5. EXPERIMENT II: FEATURE ANALYTICS
significant decrease is shown in the feature of ‘uppercase’ for the dyadic context, and
this is presumably showing a pattern in text on the conversational tweets.
5.5 Discussion
In this chapter, we provided an in-depth analysis of the distribution of the salient fea-
tures in Twitter that can be used to find newsworthy [39] and credible [2] information.
The experiment is performed on four individual contexts: credibility; diverse corpora;
retweet chain length and dyadic communication. In this analysis, we have shown that in
general, feature distribution significantly varies across different topics and the frequency
of occurrence of most features tends to increase in emergency or unrest situations. We
carefully conclude that while some features may help in predicting credible information,
the degree of utility for each feature can significantly vary with context, both in terms
of the occurring frequency of an individual feature, and the way in which it is used.
Furthermore, it is still extremely difficult to predict credible information in Twitter
since the complex behavior of the features along with its time-variant characteristics
in microblogs. Since the pattern of information in the domain of microblogs changes
rapidly depending on contemporary trends and issues, our next study will focus on the
classification of topic space with more advanced natural language processing methodol-
ogy, to address the challenge of finding credible content in the rapidly changing domain
of microblogs.
Chapter 6
Conclusion
6.1 Conclusion
As with most information usages and evaluations on the social network and contem-
porary web space, the channel for rapid information flow in Twitter is limited for us
to accurately identify desired credibility. Along with the information overload phe-
nomenon on the Internet, finding reliable and newsworthy information becomes one of
the essential tasks on which we need to work. In this project, since this forum becomes
more popular, we focused on the credibility of information provenance and information
flow in Twitter. Through two different experiments, we provided and evaluated three
computational models in finding information credibility, explore feature distribution in
four different classes across social, content and behavioral features.
In the first experiment, we provided credibility assessment models(social, content
and hybrid models) derived from 8 collections of tweets about trending topics, including
the associated user profile information for each tweeter, as provided by the Twitter API.
Next, an automated evaluation of the predictive accuracy of each model was performed,
predicting on a collection of manually assessed tweets on the “Libya” dataset. This
set of annotated tweets were collected in an online user survey. Results showed that
the hybrid feature combination model outperformed both the social and content-based
models, achieving a predictive accuracy of 89%, compared with 69% and 78% for social
and the next best performing hybrid (weighted strategy) respectively.
The first experiment is followed by an in-depth analysis of the distribution of the
salient features in Twitter. The goal of this experiment was to find the best set of
51
52 CHAPTER 6. CONCLUSION
features in order to find interesting, newsworthy and credible information. The analysis
focused on feature distributions in four distinct contexts: diverse topics; credibility
levels; retweet chains and dyadic interactions. The second result showed that in general
(across 8 data sets), feature usage tends to increase in emergency situations or situations
of unrest. From this result, we concluded that the utility of each feature can vary
significantly with context, both in terms of the occurrence of a particular feature, and
the manner in which it is used. Due to the size and rapid evolution of microblogs such
as Twitter, it was exceedingly challenging to fully understand the subtle links between
feature presence/usage and truly credible information.
A follow up research will focus on combining predictive ability of different features
with both distribution and particular usages to gain a more in-depth knowledge of the
complex interactions that occur in the Twitter space, and to provide insight for future
credibility prediction models. We will also focus on extracting and classifying semantics
from both social and content features for topic-specific domain in microblogs.
6.1. CONCLUSION i
ii CHAPTER 6. CONCLUSION
Bibliography
[1] P. Analytics, “2009 twitter study at http://pearanalytics.com/,” Mar. 2009.
[2] B. Kang, J. O’Donovan, and T. Hollerer, “Modeling topic specific credibility on
twitter,” in Proceedings of the 2012 ACM international conference on Intelligent
User Interfaces, IUI ’12, (New York, NY, USA), pp. 179–188, ACM, 2012.
[3] Twitter, “Twitter blog. “#numbers,” march 14, 2011 at http://blog.twitter.com/,”
Mar. 2011.
[4] K. Canini, B. Suh, and P. Pirolli, “Finding credible information sources in social
networks based on content and social structure,” in Privacy, security, risk and trust
(passat), 2011 ieee third international conference on and 2011 ieee third interna-
tional conference on social computing (socialcom), pp. 1 –8, oct. 2011.
[5] C. Castillo, M. Mendoza, and B. Poblete, “Information credibility on twitter,” in
Proceedings of the 20th international conference on World wide web, WWW ’11,
(New York, NY, USA), pp. 675–684, ACM, 2011.
[6] L. Yang, T. Sun, M. Zhang, and Q. Mei, “We know what you #tag: does the dual
role affect hashtag adoption?,” in Proceedings of the 21st international conference
on World Wide Web, WWW ’12, (New York, NY, USA), pp. 261–270, ACM, 2012.
[7] D. Artz and Y. Gil, “A survey of trust in computer science and the semantic web,”
Web Semant., vol. 5, pp. 58–71, June 2007.
[8] D. Olmedilla, O. F. Rana, B. Matthews, and W. Nejdl, “Security and trust issues
in semantic grids,” in In Proceedings of the Dagstuhl Seminar, Semantic Grid:
The Convergence of Technologies, Volume 05271. 2005. [PD05] [PPI04] Panteli,
pp. 191–200.
iii
iv BIBLIOGRAPHY
[9] B. J. Fogg, J. Marshall, O. Laraki, A. Osipovich, C. Varma, N. Fang, J. Paul,
A. Rangnekar, J. Shon, P. Swani, and M. Treinen, “What makes web sites credible?:
a report on a large quantitative study,” in Proceedings of the SIGCHI conference on
Human factors in computing systems, CHI ’01, (New York, NY, USA), pp. 61–68,
ACM, 2001.
[10] K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and P. Briggs, “Towards a
reputation-based model of social web search,” in Proceedings of the 15th interna-
tional conference on Intelligent user interfaces, IUI ’10, (New York, NY, USA),
pp. 179–188, ACM, 2010.
[11] D. Houser and J. C. Wooders, “Reputation in Auctions: Theory, and Evidence
from eBay,” Journal of Economics and Management Strategy, Vol. 15, pp. 353-
369, Summer 2006.
[12] P. Resnick and R. Zeckhauser, “Trust among strangers in Internet transactions:
Empirical analysis of eBay’s reputation system,” in The Economics of the Internet
and E-Commerce (M. R. Baye, ed.), vol. 11 of Advances in Applied Microeconomics,
pp. 127–157, Elsevier Science, 2002.
[13] I. de Sola Pool and M. Kochen, “Contacts and influence,” Social Networks, vol. 1,
no. 1, pp. 5–51, 1978.
[14] S. Milgram, “The small world problem,” Psychology Today, vol. 1, pp. 61–67, May
1967.
[15] J. Golbeck, Computing with Social Trust. Human-Computer Interaction Series,
Springer, 2008.
[16] H. Zhao, W. Kallander, T. Gbedema, H. Johnson, and F. Wu, “Read what you
trust: An open wiki model enhanced by social context,” in Privacy, security, risk
and trust (passat), 2011 ieee third international conference on and 2011 ieee third
international conference on social computing (socialcom), pp. 370 –379, oct. 2011.
[17] J. O’Donovan, B. Smyth, V. Evrim, and D. McLeod, “Extracting and visualizing
trust relationships from online auction feedback comments,” in Proceedings of the
20th international joint conference on Artifical intelligence, IJCAI’07, (San Fran-
cisco, CA, USA), pp. 2826–2831, Morgan Kaufmann Publishers Inc., 2007.
BIBLIOGRAPHY v
[18] J. Golbeck, C. Robles, M. Edmondson, and K. Turner, “Predicting personality from
twitter,” in Privacy, security, risk and trust (passat), 2011 ieee third international
conference on and 2011 ieee third international conference on social computing
(socialcom), pp. 149 –156, oct. 2011.
[19] S. Adali, R. Escriva, M. K. Goldberg, M. Hayvanovych, M. Magdon-ismail, B. K.
Szymanski, W. A. Wallace, and G. T. Williams, “Measuring behavioral trust in
social networks.”
[20] J. Golbeck and D. Hansen, “Computing political preference among twitter follow-
ers,” in Proceedings of the 2011 annual conference on Human factors in computing
systems, CHI ’11, (New York, NY, USA), pp. 1105–1108, ACM, 2011.
[21] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social network or a
news media?,” in WWW ’10: Proceedings of the 19th international conference on
World wide web, (New York, NY, USA), pp. 591–600, ACM, 2010.
[22] M. Naaman, J. Boase, and C.-H. Lai, “Is it really about me?: message content in
social awareness streams,” in Proceedings of the 2010 ACM conference on Computer
supported cooperative work, CSCW ’10, (New York, NY, USA), pp. 189–192, ACM,
2010.
[23] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury, “Twitter power: Tweets as
electronic word of mouth,” J. Am. Soc. Inf. Sci. Technol., vol. 60, pp. 2169–2188,
Nov. 2009.
[24] M. R. Morris, S. Counts, A. Roseway, A. Hoff, and J. Schwarz, “Tweeting is believ-
ing?: understanding microblog credibility perceptions,” in Proceedings of the ACM
2012 conference on Computer Supported Cooperative Work, CSCW ’12, (New York,
NY, USA), pp. 441–450, ACM, 2012.
[25] B. Suh, L. Hong, P. Pirolli, and E. H. Chi, “Want to be retweeted? large scale
analytics on factors impacting retweet in twitter network,” in Proceedings of the
2010 IEEE Second International Conference on Social Computing, SOCIALCOM
’10, (Washington, DC, USA), pp. 177–184, IEEE Computer Society, 2010.
[26] D. Zarrella, “The science of retweets, http://danzarrella.com/the-science-of-
retweets-report.html, http://danzarrella.com/science-of-retweets.pdf,” Sept. 2009.
vi BIBLIOGRAPHY
[27] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts, “Who says what to whom
on twitter,” in Proceedings of the 20th international conference on World wide web,
WWW ’11, (New York, NY, USA), pp. 705–714, ACM, 2011.
[28] D. Zhao and M. B. Rosson, “How and why people twitter: the role that micro-
blogging plays in informal communication at work,” in Proceedings of the ACM
2009 international conference on Supporting group work, GROUP ’09, (New York,
NY, USA), pp. 243–252, ACM, 2009.
[29] G. Convertino, S. Kairam, L. Hong, B. Suh, and E. H. Chi, “Designing a cross-
channel information management tool for workers in enterprise task forces,” in
Proceedings of the International Conference on Advanced Visual Interfaces, AVI
’10, (New York, NY, USA), pp. 103–110, ACM, 2010.
[30] A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: understand-
ing microblogging usage and communities,” in Proceedings of the 9th WebKDD
and 1st SNA-KDD 2007 workshop on Web mining and social network analysis,
WebKDD/SNA-KDD ’07, (New York, NY, USA), pp. 56–65, ACM, 2007.
[31] J. Caverlee and L. Liu, “Countering web spam with credibility-based link analy-
sis,” in Proceedings of the twenty-sixth annual ACM symposium on Principles of
distributed computing, PODC ’07, (New York, NY, USA), pp. 157–166, ACM, 2007.
[32] U. Sridhar and S. Mandyam, “Information sources driving social influences: A new
model for belief learning in social networks,” Social Network Analysis and Mining,
International Conference on Advances in, vol. 0, pp. 321–327, 2011.
[33] W. Wei, J. Lee, and I. King, “Measuring credibility of users in an e-learning envi-
ronment,” in Proceedings of the 16th international conference on World Wide Web,
WWW ’07, (New York, NY, USA), pp. 1279–1280, ACM, 2007.
[34] Y. Suzuki and A. Nadamoto, “Credibility assessment using wikipedia for messages
on social network services,” in Proceedings of the 2011 IEEE Ninth International
Conference on Dependable, Autonomic and Secure Computing, DASC ’11, (Wash-
ington, DC, USA), pp. 887–894, IEEE Computer Society, 2011.
[35] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, “Finding high-
quality content in social media,” in Proceedings of the international conference on
BIBLIOGRAPHY vii
Web search and web data mining, WSDM ’08, (New York, NY, USA), pp. 183–194,
ACM, 2008.
[36] A. Juffinger, M. Granitzer, and E. Lex, “Blog credibility ranking by exploiting
verified content,” in Proceedings of the 3rd workshop on Information credibility on
the web, WICOW ’09, (New York, NY, USA), pp. 51–58, ACM, 2009.
[37] G. Lotan, E. Graeff, M. Ananny, D. Gaffney, I. Pearce, and D. Boyd, “The revo-
lutions were Tweeted: Information flows during the 2011 Tunisian and Egyptian
revolutions,” International Journal of Communication, vol. 5, 2011.
[38] A. Gupta and P. Kumaraguru, “Credibility ranking of tweets during high impact
events,” in Proceedings of the 1st Workshop on Privacy and Security in Online
Social Media, PSOSM ’12, (New York, NY, USA), pp. 2:2–2:8, ACM, 2012.
[39] M. Mendoza, B. Poblete, and C. Castillo, “Twitter under crisis: can we trust what
we rt?,” in Proceedings of the First Workshop on Social Media Analytics, SOMA
’10, (New York, NY, USA), pp. 71–79, ACM, 2010.
[40] A. Iyengar, T. Finin, and A. Joshi, “Content-based prediction of temporal bound-
aries for events in Twitter,” in Proceedings of the Third IEEE International Con-
ference on Social Computing, IEEE Computer Society, October 2011.
[41] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-
time event detection by social sensors,” in Proceedings of the 19th international
conference on World wide web, WWW ’10, (New York, NY, USA), pp. 851–860,
ACM, 2010.
[42] S. Westman and L. Freund, “Information interaction in 140 characters or less:
genres on twitter,” in Proceedings of the third symposium on Information interaction
in context, IIiX ’10, (New York, NY, USA), pp. 323–328, ACM, 2010.
[43] A.-M. Popescu and A. Jain, “Understanding the functions of business accounts on
twitter,” in Proceedings of the 20th international conference companion on World
wide web, WWW ’11, (New York, NY, USA), pp. 107–108, ACM, 2011.
[44] R. K. Garrett, “Troubling consequences of online political rumoring,” Human Com-
munication Research, vol. 37, no. 2, pp. 255–274, 2011.
viii BIBLIOGRAPHY
[45] D. Gayo-Avello, “”i wanted to predict elections,” CoRR, vol. abs/1204.6441, 2012.
[46] P. Melville, R. J. Mooney, and R. Nagarajan, “Content-boosted collaborative filter-
ing for improved recommendations,” in Eighteenth national conference on Artificial
intelligence, (Menlo Park, CA, USA), pp. 187–192, American Association for Arti-
ficial Intelligence, 2002.
[47] J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering
recommendations,” in Proceedings of the 2000 ACM conference on Computer sup-
ported cooperative work, CSCW ’00, (New York, NY, USA), pp. 241–250, ACM,
2000.
[48] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an
open architecture for collaborative filtering of netnews,” in Proceedings of the 1994
ACM conference on Computer supported cooperative work, CSCW ’94, (New York,
NY, USA), pp. 175–186, ACM, 1994.
[49] A.-L. Barabasi, “The origin of bursts and heavy tails in human dynamics,” Nature,
vol. 435, p. 207, 2005.
[50] S. Adali, F. Sisenda, and M. Magdon-Ismail, “Actions speak as loud as words:
predicting relationships from social behavior data,” in WWW, pp. 689–698, 2012.
[51] S. Y. Rieh and D. R. Danielson, “Credibility: A multidisciplinary framework,”
Annual Review of Information Science and Technology, vol. 41, no. 1, pp. 307–364,
2007.
[52] K. Ericsson, The Cambridge Handbook of Expertise And Expert Performance. Cam-
bridge Handbooks in Psychology, Cambridge University Press, 2006.
[53] M. Cha, H. Haddadi, F. Benevenuto, and K. Gummadi, “Measuring user influence
in twitter: The million follower fallacy,” 4th International AAAI Conference on
Weblogs and Social Media (ICWSM), 2010.
[54] Y. Duan, L. Jiang, T. Qin, M. Zhou, and H.-Y. Shum, “An empirical study on
learning to rank of tweets,” in COLING’10, pp. 295–303, 2010.
[55] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi, “Short and tweet: experi-
ments on recommending content from information streams,” in Proceedings of the
BIBLIOGRAPHY ix
28th international conference on Human factors in computing systems, CHI ’10,
(New York, NY, USA), pp. 1185–1194, ACM, 2010.
[56] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi, “Eddi:
interactive topic-based browsing of social status streams,” in Proceedings of the
23nd annual ACM symposium on User interface software and technology, UIST
’10, (New York, NY, USA), pp. 303–312, ACM, 2010.
[57] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach.
Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
[58] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, “Learning to classify short and sparse
text & web with hidden topics from large-scale data collections,” in Proceedings of
the 17th international conference on World Wide Web, WWW ’08, (New York,
NY, USA), pp. 91–100, ACM, 2008.
[59] J. Weng, E.-P. Lim, J. Jiang, and Q. He, “Twitterrank: finding topic-sensitive
influential twitterers,” in Proceedings of the third ACM international conference on
Web search and data mining, WSDM ’10, (New York, NY, USA), pp. 261–270,
ACM, 2010.
[60] A. Ritter, C. Cherry, and B. Dolan, “Unsupervised modeling of twitter conver-
sations,” in Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, HLT
’10, (Stroudsburg, PA, USA), pp. 172–180, Association for Computational Linguis-
tics, 2010.
[61] D. Ramage, S. T. Dumais, and D. J. Liebling, “Characterizing microblogs with
topic models,” in ICWSM, 2010.
[62] D. Gayo-Avello, “Nepotistic relationships in twitter and their impact on rank pres-
tige algorithms,” CoRR, vol. abs/1004.0816, 2010.
[63] D. Talbot,“How google ranks tweets, http://www.technologyreview.com/web/24353/,”
Jan. 2010.
x BIBLIOGRAPHY
[64] A. Ghosh and P. McAfee, “Incentivizing high-quality user-generated content,” in
Proceedings of the 20th international conference on World wide web, WWW ’11,
(New York, NY, USA), pp. 137–146, ACM, 2011.
[65] R. Xiang, J. Neville, and M. Rogati, “Modeling relationship strength in online social
networks,” in Proceedings of the 19th international conference on World wide web,
WWW ’10, (New York, NY, USA), pp. 981–990, ACM, 2010.
[66] X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic sentiment analysis in
twitter: a graph-based hashtag sentiment classification approach,” in Proceedings
of the 20th ACM international conference on Information and knowledge manage-
ment, CIKM ’11, (New York, NY, USA), pp. 1031–1040, ACM, 2011.
[67] A. Bifet and E. Frank, “Sentiment knowledge discovery in twitter streaming data,”
in Proceedings of the 13th international conference on Discovery science, DS’10,
(Berlin, Heidelberg), pp. 1–15, Springer-Verlag, 2010.
[68] O. Phelan, K. McCarthy, M. Bennett, and B. Smyth, “Terms of a feather: Content-
based news recommendation and discovery using twitter,” in Clough et al. [73],
pp. 448–459.
[69] G. Jeh and J. Widom, “Simrank: a measure of structural-context similarity,” in
Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, KDD ’02, (New York, NY, USA), pp. 538–543, ACM,
2002.
[70] S. Vargas and P. Castells, “Rank and relevance in novelty and diversity metrics
for recommender systems,” in Proceedings of the 2011 ACM Conference on Recom-
mender Systems, RecSys 2011, Chicago, IL, USA, October 23-27, 2011, pp. 109–
116, 2011.
[71] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diversifying search results,”
in Proceedings of the Second ACM International Conference on Web Search and
Data Mining, WSDM ’09, (New York, NY, USA), pp. 5–14, ACM, 2009.
[72] J. Hannon, K. McCarthy, and B. Smyth, “Finding useful users on twitter: Twit-
tomender the followee recommender,” in Clough et al. [73], pp. 784–787.
BIBLIOGRAPHY xi
[73] P. Clough, C. Foley, C. Gurrin, G. J. F. Jones, W. Kraaij, H. Lee, and V. Mur-
dock, eds., Advances in Information Retrieval - 33rd European Conference on IR
Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings, vol. 6611 of
Lecture Notes in Computer Science, Springer, 2011.
[74] K. Puniyani, J. Eisenstein, S. Cohen, and E. P. Xing, “Social links from latent
topics in microblogs,” in Proceedings of the NAACL HLT 2010 Workshop on Com-
putational Linguistics in a World of Social Media, WSA ’10, (Stroudsburg, PA,
USA), pp. 19–20, Association for Computational Linguistics, 2010.
[75] D. Ramage, S. Dumais, and D. Liebling, “Characterizing microblogs with topic
models,” 2010.
[76] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach.
Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
[77] O. Phelan, K. McCarthy, M. Bennett, and B. Smyth, “Terms of a feather: content-
based news recommendation and discovery using twitter,” in Proceedings of the
33rd European conference on Advances in information retrieval, ECIR’11, (Berlin,
Heidelberg), pp. 448–459, Springer-Verlag, 2011.
[78] B. Smyth, P. Briggs, M. Coyle, and M. O’Mahony, “Google shared. a case-study in
social search,” in Proceedings of the 17th International Conference on User Model-
ing, Adaptation, and Personalization: formerly UM and AH, UMAP ’09, (Berlin,
Heidelberg), pp. 283–294, Springer-Verlag, 2009.
[79] K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and P. Briggs, “Social and
collaborative web search: an evaluation study,” in Proceedings of the 16th inter-
national conference on Intelligent user interfaces, IUI ’11, (New York, NY, USA),
pp. 387–390, ACM, 2011.
[80] J. O’Donovan, M. Schaal, B. Kang, T. Hollerer, and S. Barry, “A network-based
analysis of topic similarity in microblogs,” in Proceedings of the N-th international
conference on Recommendation Systems, RecSys ’12, (New York, NY, USA), pp. 0–
0, ACM, 2012.
xii BIBLIOGRAPHY
[81] D. H. McKnight and C. J. Kacmar, “Factors and effects of information credibility,”
in Proceedings of the ninth international conference on Electronic commerce, ICEC
’07, (New York, NY, USA), pp. 423–432, ACM, 2007.
[82] K. Tanaka, H. Ohshima, A. Jatowt, S. Nakamura, Y. Yamamoto, K. Sumiya,
R. Lee, D. Kitayama, T. Yumoto, Y. Kawai, J. Zhang, S. Nakajima, and Y. Inagaki,
“Evaluating credibility of web information,” in Proceedings of the 4th International
Conference on Uniquitous Information Management and Communication, ICUIMC
’10, (New York, NY, USA), pp. 23:1–23:10, ACM, 2010.
[83] K. Puniyani, J. Eisenstein, S. Cohen, and E. P. Xing, “Social links from latent
topics in microblogs,” in Proceedings of the NAACL HLT 2010 Workshop on Com-
putational Linguistics in a World of Social Media, WSA ’10, (Stroudsburg, PA,
USA), pp. 19–20, Association for Computational Linguistics, 2010.
[84] B. Gretarsson, J. O’Donovan, S. Bostandjiev, T. Hollerer, A. Asuncion, D. New-
man, and P. Smyth, “Topicnets: Visual analysis of large text corpora with topic
modeling,” ACM Trans. Intell. Syst. Technol., vol. 3, pp. 23:1–23:26, Feb. 2012.