method for measuring twitter content influencetwitter is a microblogging website with specific...
TRANSCRIPT
Method for Measuring Twitter Content Influence Subtitle as needed (paper subtitle)
Euijong Lee
Dept. of Computer and Radio
Communications Engineering
Korea University
Seoul, Republic of Korea
Email: [email protected]
Jeong-Dong Kim
Dept. of Computer and Radio
Communications Engineering
Korea University
Seoul, Republic of Korea
Email: [email protected]
Doo-Kwon Baik
Graduate School of Convergence IT
Korea University
Seoul, Republic of Korea
Email: [email protected]
Abstract—Twitter is a microblogging website with specific
characteristics not found in other social network services. This
platform contains a good deal of valuable content, and users can
access this content using Twitter search. However, Twitter search
returns only time-descending ordered content including
keywords. Thus, we propose a linear-time method of measuring
the influence of Twitter content considering not only time, but
also characteristics of each Twitter account. In analyzing these
characteristics, we have found that the number of retweets can
measure shareability, while the number of followers held by the
content author can measure spreadability. We perform
experiments using real Twitter data for proving the effectiveness
of the proposed method. We demonstrate that this proposed
method is effective at finding up-to-date content. Further, in
comparing our method with analysis via PageRank, we
demonstrate that our method is more effective at accurately
measuring influence.
Keywords-Twitter; Contents Search; Retweet; Follower,
Contents Influence
I. INTRODUCTION
Twitter is a microblogging website that allows only posts of 140 characters or less (called tweets) to create content. This service has a well-defined markup vocabulary. Unlike other social networking services, however, Twitter supports one-sided relationships between users. If one user wants to view another user’s contents in real-time, that user can add the other user to the user’s social network list without approval of the other user. This behavior is referred to as ‘following’ the other user, with the following user called the ‘follower’. The main sharing mechanism of content on twitter is what is referred to as ‘retweeting’. A retweet keeps the information of the original content while also including the opinion of users who retweeted that content. Those simple mechanisms, though restricted to making various types of contents, but those lead to the development of Twitter. Those are easy to make sharing of information on Twitter and, making it easy to writing content succinctly.
In 2012, about 140 million tweets a day were created on Twitter [1]. Literally, it is very big data, and therefore, of course, users face problems in finding useful information on this social network [5, 8]. Twitter does provide a content search service, through which users can search for content on Twitter by entering relevant keywords and receive the corresponding content in a time-descending order (see figure 1). Twitter also
provides advanced search offering four categories of options: ‘words’, ‘people’, ‘places’, and ‘other’. The ‘words’ category provides options related to specific keywords, the ‘people’ category provides options related to specific accounts, the ‘places’ category provides options related to specific locations, and the ‘other’ category provides options limiting results to those including positive or negative emoticons or to questions, as well as an option to include otherwise-excluded retweets in the results (see figure 2). Still, advanced search also provides content only in a time-descending order, with no attention to relevancy. Up-to-date information is important in Twitter; however, Twitter search needs to consider not only time, but other characteristics as well [3, 5, 8]. We focus on the problem of improving Twitter search, and in this research, we propose a method of measuring the influence of individual Twitter content using characteristics of individual Twitter accounts in content search.
II. RELATED WORKS
The most representative studies on Twitter analyze characteristics like follower count, followee count, and retweet count [2-5]. There exists considerable research on Twitter search based on these characteristics.
[2] attempted to measure user influence of Twitter accounts, finding three contributory factors: number of followers, number of retweets, and number of mentions. A large number of followers implies a large audience for the user. More specifically, it means a large audience receives a user’s content directly. Number of followers, then, measures a user’s ability to spread information. Number of retweets demonstrates the ability of the user to make content with pass-along value; a retweet means that there is worth to be shared. Number of mentions demonstrates the ability of the user to relate to others in a conversation; it shows advertisement value.
[3] researched the possibility of Twitter serving as a news media. They found that number of retweets is the chief measure of user influence, and number of followers is the chief measure of user popularity. Further, they found that trends on Twitter match up with new media trends fairly closely (about 85%); as a result, they claimed that the role of Twitter is not only social networking service, but also news media.
659
[5] researched differences between microblog search and web search by analyzing search log information. They found that users using Twitter search want to find ‘timely information’, ‘social information’ and ‘topical information’. Queries in Twitter searches are shorter than web searches, but words are longer, and Twitter search users use site-specific grammar (for example, using ‘@’ or ‘#’). [8] suggested that Twitter search should reflect characteristics of the users. In surveying users, they found that users of Twitter search most often want to find events, trending topics, or specific people.
In this research, we use characteristics of Twitter extracted in previous research to produce a method for measuring the influence of a single post that can be used via Twitter search and mining.
III. CHARACTERISTICS OF TWITTER AND A METHOD OF
MEASURING CONTENT INFLUENCE
We want to solve the problem that Twitter content search only reflects time of posting. We chose characteristics of Twitter and created a method for measuring the influence of
content using these characteristics. In Ⅲ.A, we describe the
specific characteristics of Twitter that are used in our proposed
method. In Ⅲ.B, we explain our proposed method.
A. Characteristics of Twitter
Content influence is a value that measures to what degree a piece of content contains meaningful information for users. We choose three factors for measuring influence. The first factor is spreadability of content. In Twitter, content is delivered to the author’s followers; thus, the number of followers shows how many users were delivered that content. Previous research showed that the number of followers measures popularity of
author [3, 5]. Therefore, we can say it is similar for the ability to spread information, and therefore, we use the number of the author’s followers as spreadability. The second factor is the value of content information. Value of information can be measured using number of times shared, because if the content is valuable, it will be shared with others. In Twitter, the retweet mechanism is the main method for sharing content, and therefore, we use retweet count to measure this factor. The last factor is the currency of the information. Newer content contains more valuable information than older content. Twitter is sensitive to up-to-date information [3, 8] and Twitter search shows results by time-descending order. This factor is very sensitive to use in our proposed method, because as we pointed out, this is a problem of Twitter search. However, it is a very important factor in news media and we treat this very carefully. We reflect this factor using the time information of the content.
B. Method for Measuring Content Influence
We use these three characteristics of Twitter in our proposed method: follower number as spreadability of content, retweet count as the value of content information, and time written as currency of information. The proposed method using these factors is as follows.
I(𝐶𝑖) = α log(𝑅𝑇𝑖 + 1) + 𝛽 log(𝐹𝑖 + 1) + 𝛾 log𝑘
𝑁𝑇−𝑊𝑇𝑖 (1)
※ α + β + γ = 1
Here, 𝐶𝑖 is the ith piece of content, and I is the influence of content 𝐶𝑖 . In this equation, ‘ α log(𝑅𝑇𝑖 + 1) ’ represents shareability of 𝐶𝑖, where 𝑅𝑇𝑖 is the retweet count of 𝐶𝑖. We take a logarithm function to normalize, and we add 1 to ensure the output is always defined, because if we do not add 1, and retweet count is 0, then it will be negative infinity. 𝐹𝑖 is the follower number of the author of 𝐶𝑖, and therefore, ‘𝛽 log(𝐹𝑖 +1) ’ represents the spreadability of 𝐶𝑖 . It takes a logarithm
Figure 1. Results of Twitter search
Figure 2. Twitter Advanced Search
660
function for the same reason as the previous factor.
‘𝛾 log𝑘
𝑁𝑇−𝑊𝑇𝑖’ represents currency influence, where NT means
‘now time’ and 𝑊𝑇𝑖 means ‘written time of 𝐶𝑖’. We calculate interval of time as a real number, where subtraction of hours is an integer value and subtraction of minutes is a decimal place. It also takes a logarithm function for the same reason as the previous two factors. We take the reciprocal to ensure that up-to-date contents have a larger contribution than out-of-date contents. Taking a reciprocal, however, has the disadvantage that the currency factor becomes much smaller than the other factors. To correct this disadvantage, we multiply by a constant
‘k’. In the equation ‘y = log𝑘
𝑥’, when x equals k then y is 0, and
if x is smaller than k, then y takes a negative value. Therefore, if a user wants to find information about posts in the last k hours, then they need only fix the value of k in the equation. Finally, α, β and γ are mediators that adjust the power of each factor in the equation. We ensure that the summation of these terms is 1 in order to prevent one factor from being too much more influential than the others.
IV. EXPERIMENTS AND EVALUATION
We crawled Korean Twitter contents and set up
experiments using that data. We describe the crawled dataset
and experiments in Ⅳ .A, and then show the result of an
experiment for content influence using our method in Ⅳ.B.
A. Dataset for Experiments
We used a crawled contents of Korean Twitter contents and user relation (followee information) of Korean users from July 1, 2012 to July 31. The size of the content dataset is about 93.1GB, and the size of the user information dataset is 12.3GB. We saved content data using JSON data style, and user data by simple text style. Content data includes various pieces of information about a single piece of content: author, retweet count, time written, etc. (see figure 3) User data consisted of user-followee relationships (see figure 4)
For our experiments, we extracted related content in two different domains: smartphones (Galaxy series and iPhone) and Psy (a Korean pop singer). We chose the first subject because the Galaxy series and the iPhone are presently the most popular models of smartphone, and the Galaxy S3 was issued in July 2013. We chose the second subject because Psy released a new
album, “Gangnam Style”, which has become a popular topic worldwide. We collected posts about smartphones (especially the Samsung Galaxy series and iPhone) and Psy using related Korean and English words: 12 for smartphones, and 5 for Psy (see table 1).
TABLE 1. RELATED WORDS IN THE TEST DATA SET
Topic Related Korean and English words
Galaxy
Series
갤럭시,겔럭시,겔스,갤스,갤노트,겔노트,겔놋,갤놋,galaxy
(not case sensitive)
iPhone 아이폰,iphone,I-phone(not case sensitive)
Psy 싸이, 강남 스타일, 강남스타일, Psy, Gangnam Style (not
case sensitive)
The number of collected smartphone-related posts is 207,022, and the number of collected Psy-related posts is 118,521. We extracted 10 times more retweeted content, and deleted content for which we could not get a user profile for effective experimentation. Finally, we selected 558 posts about smartphones and 421 about Psy.
After resizing, we classified posts by the nature of the user’s identity into one of four categories: public user, personal user, bot, and unclassified data. A public user is a spokesperson of a specific community or group. For example, ‘samsung’, ‘SBS8news’ and ‘YTN24’: ‘samsung’ is the ID of the SAMSUNG Corporation, and ’SBS8news’ and ’YTN24’ are broadcasting companies in the Republic of Korea. A personal user is a personal user’s ID. A bot is a program used to produce automated content, generally called a Twitter bot. Finally, unclassified refers to any IDs whose category could not be determined. Contents as classified by user identity are presented in table 2.
TABLE 2. CLASSIFIED DATA SET BY USER’S IDENTITY
We also classified posts according to their relation to the keywords, sorting into one of four categories: direct, indirect, private, and unrelated. A direct post is directly related to the topic. An indirect post does not contain direct information about the topic, but does have relevant information such as a related event, music video or so on. A private post contains personal information. Finally, an unrelated post contains information unrelated to the topic. Classification of relation to keywords is presented in table 3 with examples of content type in table 4.
TABLE 3. CLASSIFIED DATA SET BY INFORMATION OF CONTENTS
Topic Public Personal Bot Unclassified
Smart phone 205 298 18 37
Psy 115 172 36 98
Topic Direct Indirect Private Unrelated
Smart phone 166 157 130 105
Psy 129 67 36 189
Figure 3. Example of contents files
Figure 4. Example of relationship files
661
TABLE 4. EXAMPLE OF CONTENTS DIVIDED BY INFORMATION
B. Experiments
Before our experiments on the applicability of our proposed method, we first had to demonstrate that there is a difference between retweet in-degree and follower in-degree, as if those two in-degrees are linearly correlated; we do not need to use both factors. We thus first performed an experiment on the independence of retweet and follower in-degree, and then perform our experiment on the effectiveness of the proposed method.
1) Independence of retweet count and follower number :
First, we used Pearson correlation coefficient as a measure of
correlation between two data sets (See equation (2))
r =∑ (𝑋𝑖−�̅�)(𝑌𝑖−�̅�)𝑛
𝑖=1
√∑ (𝑋𝑖−�̅�)2𝑛𝑖=1 √∑ (𝑌𝑖−�̅�)2𝑛
𝑖=1
(2)
The Pearson correlation coefficient has a value between −1 and 1. If the absolute value of r is close to 1, there is a significant linear relationship, but if the absolute value of r is close to 0, there is no linear relationship between data sets. After performing the correlation test, we found that the coefficient of follower and retweet in-degree was close to 0, and therefore, follower and retweet in-degree had no linear relationship. (See table 5)
TABLE 5. RESULT OF PEARSON CORRELATION COEFFICIENT
Topic r
Smart-phone -0.0348392
Psy 0.17547095
2) Influence with retweet and followers: Next, we
performed experiments to measure the level of shareability and
spreadability, and interrelation of both. Before our experiments,
we created a modified PageRank based on follower-followee
relationship for comparison with the proposed method [10],
substituting followee for out-degree and follower for in-degree.
The initial value of the modified PageRank is then the number
of followers, and d-value is 0.85 as previous research (see
equation 3) [10]. The number of smartphone domain authors is
312 (312 nodes, 12376 edges), and the number of Psy domain
authors is 306 (306 nodes, 3004 edges). We calculated content
PageRank using this equation about three times recursively.
PR(i) =(1−𝑑)
𝑇𝑜𝑡𝑎𝑙𝑁𝑢𝑚 𝑜𝑓 𝑈𝑠𝑒𝑟+
𝑑(∑𝑃𝑅(𝑗)
𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑁𝑢𝑚 𝑜𝑓 𝑗
𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑁𝑢𝑚 𝑜𝑓 𝑖𝑗=1 (3)
For these experiments, we assumed that users want to find
direct and indirect information [8]. In other words, users do
not want to find private and unrelated information. Based on
this assumption, we calculate F-measure with results as shown
in figures 5 and 6 (in this experiment we do not consider time
influence; therefore, γ is zero)
Type of
relationship Contents
Directed
Korean 갤럭시 S3' 발화 사건, 단순 해프닝으로 종결
English Galaxy S3 bun-in incident is ended just happening
Indirect
Korean
[대한민국 올림픽 선수단 응원 이벤트] 응원
메세지 보내고, 금메달 개수를 맞추세요.
갤럭시 S3의 주인공이 되실 수 있습니다.
English
[Korean Olympic team cheers up event] Send a
cheer up message and guess the number of Olympic gold medals. You can be an owner of
Galaxy S3.
Private
Korean
RT))여수엑스포에서 갤럭시노트 화이트
잃어 버렷어요 엑스포에 분실 신고 했으니까
거기에 맡겨주세요 제발제발 부탁드립니다.
English
RT))I lost my white galaxy note in Yeosu Expo.
If you find my phone, please leave it lost-and-
found center.
Unrelated
Korean 갤력시 익스프레스의 미국투어 다큐
<반드시 크게 들을 것>
English The America tour documentary of Galaxy Express <Listen loudly>
Figure 5. Result of Smart-phone Domain
Figure 6. Result of The Psy Domain
662
In figures 5 and 6, the x-axis measures the value of the
mediator term β, and the y-axis measures the value of the F-
measure. Both results show that our proposed method is more
effective than modified PageRank. In the case of the
smartphone domain, the greatest F-measure value is occurred
when α (shareability) is zero, indicating that β (spreadability)
may be more important than α. However, as you can see from
other values in the case of the smartphone domain and of the
Psy domain (see figure 5 and 6), the shareability impact on
measuring the influence of contents is greatest when α is
between 0.5 and 0.6. Therefore, we can find that retweet count
and author follower number are both useful in measuring the
influence of content.
As a note, the result of the Psy domain is worse than
smart-phone domain because there is considerable garbage
data in that dataset. The reason for garbage data is that there is
a social network service, ‘Cyworld’, whose pronunciation ‘Cy’
is same as ‘Psy’ in Korean language, and whose Korean
character is written the same, giving many false positives in
the Psy domain. Nevertheless, despite the garbage data in the
experiment set, the proposed method is still more effective
than the comparison function.
Through this experiment, we have found that retweet count
and follower number are both useful for measuring the
influence of Twitter content, and without the recursive
calculation needed in a method such as PageRank. Instead, it
needs only linear time if the search system already knows
some information about the content.
3) Influence with time: The influence of Twitter should
reflect not only the influence of content but also how up-to-
date the information is [5]. In this section, we want to discover
time influence in Twitter content. More specifically, we want
to discover time influence of Twitter content in specific k-hour
(detailed meaning of ‘k’ is in the Ⅲ.B). To experiment, we
focused on a burn-in incident that occurred in the middle of
July 2012. Burn-in is a failure of a screen caused by displaying
the same image on the screen for an extended period, creating
an after image that remains on the screen afterwards. The
Galaxy S3 had a burn-in problem, but the manual of it
contained a note stating that the company is not responsible
for burn-in failure. However, the company corrected that
sentence and announced a new policy that they will provide
service for burn-in failure. We assume that some user wanted
to find information about the burn-in problem and Galaxy S3
at July 11, 18:00 and that the user wanted to find up-to-date
information over the last 5 hours from that time. We extracted
contents that have Korean keywords–‘Galaxy’ and ‘image-
remain’–and calculated each score using our proposed method
both with time (α=0.16, β=0.24, γ=0.6, k=5) and without time
(α=0.4, β=0.6, γ=0, k=0). The result of this experiment is
shown below (see table 6).
The results show that if a user has many followers and
high retweet counts, then time influence is not important (see
C1-C5, C7-C8). However, if the user has a small number of
followers and their retweet counts are very small, then their
influence score is very small even if they have up-to-date
information about the topic (see C6). If content influence
includes time, however, then this result is changed. C6 has the
second smallest in the result without time influence even
though it has up-to-date information. However, if the
measurement of content influence considers time, C6 gets a
higher score than older contents (C1-C4) even though they
have more followers and a higher retweet count. This
experiment shows us that our proposed method reflects time
effectively with shareability and spreadability in k-hour.
TABLE 6. RESULT OF EXPERIMENT WITH INFLUENCE OF TIME
V. CONCLUSION AND FUTUREWORKS
In this research, we hoped to aid Twitter content search that presently returns results in only time-descending order. To solve this problem, we analyzed characteristics of Twitter content, and proposed a method for measuring influence of each post. We used crawled data about Korean Twitter contents and user relation data from July 1, 2012 to July 31, extracting content on two subjects: smartphones and Korean pop singer Psy. We refined this extracted content, classifying by author’s identity (public, personal, bot, and unclassified) and type of information (direct, indirect, private, and unrelated). We then performed experiments using our proposed method, comparing it with PageRank to demonstrate that our approach is more effective and more computationally efficient in measuring influence of Twitter content, running in linear time on search results given associated information. We also performed experiments with our proposed method incorporating time influence, showing that our proposed method is effective when a post comes from a user with a smaller number of followers and a lower retweet count than other users, but still has more up-to-date information.
In the future, we would like to incorporate other information about the Twitter account, such as friendship of users, the feeling expressed in a post, and the closeness between users. Through this, we hope to create a more
Con-
tent# Content
Written
Time
Re-
tweet
Foll-
ower
Score
with
time
Score
with-
out
time
C1
No AS for burn-
in
7.10 12:36:38
27 872 0.475762 2.343471
C2 7.11
07:33:25 34 106320 1.261351 3.633598
C3 7.11
08:05:23 21 120437 1.255733 3.585427
C4 7.11
09:01:40 13 12733 1.015891 2.921430
C5
Offer AS
for burn-in
7.11
16:19:09 64 122936 1.795269 3.778974
C6 7.11
16:35:05 12 1622 1.277328 2.371768
C7 7.11
17:42:14 34 135214 2.211606 3.696242
C8 7.11
17:51:59 19 142715 2.358965 3.613095
663
advanced search and mining method for not only Twitter, but other social network services as well.
ACKNOWLEDGMENT
This research was supported by the basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science and Technology (2011-0025588), and Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7033346). The corresponding authors is Doo-Kwon Baik
REFERENCES
[1] Twitter, #numbers [Online]. Available:https://blog.twitter.com/2011/numbers, 2011
[2] Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. K., “Measuring User Influence in Twitter: The Million Follower Fallacy”, ICWSM, pp10-17, 2010.
[3] Kwak, H., Lee, C., Park, H., & Moon, S, “What is Twitter, a social network or a news media?”, In Proceedings of the 19th international conference on World wide web, pp. 591-600, 2010.
[4] Weng, J., Lim, E. P., Jiang, J., & He, Q., “Twitterrank: finding topic-sensitive influential twitterers”, In Proceedings of the third ACM international conference on Web search and data mining, pp. 261-270 2010.
[5] Teevan, Jaime, Daniel Ramage, Merredith Ringel Morris. "# TwitterSearch: a comparison of microblog search and web search.", Proceedings of the fourth ACM international conference on Web search and data mining. pp. 25-44, 2011.
[6] Horowitz, Damon, ,Sepandar D. Kamvar. "The anatomy of a large-scale social search engine.", Proceedings of the 19th international conference on World wide web. pp. 431-440, 2010.
[7] Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har'El, N., Ronen, I., Chernov S., "Personalized social search based on the user's social network.", Proceedings of the 18th ACM conference on Information and knowledge management. pp. 1227 – 1236, 2009.
[8] Golovchinsky, Gene, and Miles Efron. "Making sense of twitter search.",2010.
[9] M. Oussalah, F. Bhat, K. Challis, T. Schnier, “A software architecture for Twitter collection, search and geolocation services”, Knowledge-Based Systems, Vol 37, pp. 105-120, 2012
[10] Brin, Sergey, Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems pp. 107-117, 1998.
[11] Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har'El, N., Ronen, I., Chernov, S., “Personalized social search based on the user's social network”, 18th ACM conference on Information and knowledge management, pp. 1227-1236, 2009.
[12] A. Java, X. Song, T. Finin and B. Tseng, “Why We Twitter : Understanding Microblogging Usage and Communities”, In Proc of 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007.
[13] Kristina Lerman and Rumi Ghosh, “Information Contagion: n Empirical Study of the Spread of News on Digg and Twitter Social Networks”, ICWSM, 2010. Web mining and social network analysis, 2007.
[14] N. J. Belkin, “Some(what) grand challenges for information retrieval”, SIGIR Forum, 42(1):p47–54, 2008.
[15] M. J. Carman, M. Baillie, and F. Crestani, “Tag data and personalized information retrieval”, In Procof the CIKM workshop on Search in social media, pp27–34. ACM, 2008.
[16] D. Carmel, N Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'el, I. Ronen, E. Uziel, S. Yogev and S. Chernov,“Personalized Social Search based on the User’s Social Network”, In Proc of CIKM '09, 1227-1236
664