method for measuring twitter content influencetwitter is a microblogging website with specific...

Method for Measuring Twitter Content Influence Subtitle as needed (paper subtitle)

Euijong Lee

Dept. of Computer and Radio

Communications Engineering

Korea University

Seoul, Republic of Korea

Email: [email protected]

Jeong-Dong Kim

Dept. of Computer and Radio

Communications Engineering

Korea University



Doo-Kwon Baik

Graduate School of Convergence IT

Korea University



Abstract—Twitter is a microblogging website with specific

characteristics not found in other social network services. This

platform contains a good deal of valuable content, and users can

access this content using Twitter search. However, Twitter search

returns only time-descending ordered content including

keywords. Thus, we propose a linear-time method of measuring

the influence of Twitter content considering not only time, but

also characteristics of each Twitter account. In analyzing these

characteristics, we have found that the number of retweets can

measure shareability, while the number of followers held by the

content author can measure spreadability. We perform

experiments using real Twitter data for proving the effectiveness

of the proposed method. We demonstrate that this proposed

method is effective at finding up-to-date content. Further, in

comparing our method with analysis via PageRank, we

demonstrate that our method is more effective at accurately

measuring influence.

Keywords-Twitter; Contents Search; Retweet; Follower,

Contents Influence

I. INTRODUCTION

Twitter is a microblogging website that allows only posts of 140 characters or less (called tweets) to create content. This service has a well-defined markup vocabulary. Unlike other social networking services, however, Twitter supports one-sided relationships between users. If one user wants to view another user’s contents in real-time, that user can add the other user to the user’s social network list without approval of the other user. This behavior is referred to as ‘following’ the other user, with the following user called the ‘follower’. The main sharing mechanism of content on twitter is what is referred to as ‘retweeting’. A retweet keeps the information of the original content while also including the opinion of users who retweeted that content. Those simple mechanisms, though restricted to making various types of contents, but those lead to the development of Twitter. Those are easy to make sharing of information on Twitter and, making it easy to writing content succinctly.

In 2012, about 140 million tweets a day were created on Twitter [1]. Literally, it is very big data, and therefore, of course, users face problems in finding useful information on this social network [5, 8]. Twitter does provide a content search service, through which users can search for content on Twitter by entering relevant keywords and receive the corresponding content in a time-descending order (see figure 1). Twitter also

provides advanced search offering four categories of options: ‘words’, ‘people’, ‘places’, and ‘other’. The ‘words’ category provides options related to specific keywords, the ‘people’ category provides options related to specific accounts, the ‘places’ category provides options related to specific locations, and the ‘other’ category provides options limiting results to those including positive or negative emoticons or to questions, as well as an option to include otherwise-excluded retweets in the results (see figure 2). Still, advanced search also provides content only in a time-descending order, with no attention to relevancy. Up-to-date information is important in Twitter; however, Twitter search needs to consider not only time, but other characteristics as well [3, 5, 8]. We focus on the problem of improving Twitter search, and in this research, we propose a method of measuring the influence of individual Twitter content using characteristics of individual Twitter accounts in content search.

II. RELATED WORKS

The most representative studies on Twitter analyze characteristics like follower count, followee count, and retweet count [2-5]. There exists considerable research on Twitter search based on these characteristics.

[2] attempted to measure user influence of Twitter accounts, finding three contributory factors: number of followers, number of retweets, and number of mentions. A large number of followers implies a large audience for the user. More specifically, it means a large audience receives a user’s content directly. Number of followers, then, measures a user’s ability to spread information. Number of retweets demonstrates the ability of the user to make content with pass-along value; a retweet means that there is worth to be shared. Number of mentions demonstrates the ability of the user to relate to others in a conversation; it shows advertisement value.

[3] researched the possibility of Twitter serving as a news media. They found that number of retweets is the chief measure of user influence, and number of followers is the chief measure of user popularity. Further, they found that trends on Twitter match up with new media trends fairly closely (about 85%); as a result, they claimed that the role of Twitter is not only social networking service, but also news media.

659

[5] researched differences between microblog search and web search by analyzing search log information. They found that users using Twitter search want to find ‘timely information’, ‘social information’ and ‘topical information’. Queries in Twitter searches are shorter than web searches, but words are longer, and Twitter search users use site-specific grammar (for example, using ‘@’ or ‘#’). [8] suggested that Twitter search should reflect characteristics of the users. In surveying users, they found that users of Twitter search most often want to find events, trending topics, or specific people.

In this research, we use characteristics of Twitter extracted in previous research to produce a method for measuring the influence of a single post that can be used via Twitter search and mining.

III. CHARACTERISTICS OF TWITTER AND A METHOD OF

MEASURING CONTENT INFLUENCE

We want to solve the problem that Twitter content search only reflects time of posting. We chose characteristics of Twitter and created a method for measuring the influence of

content using these characteristics. In Ⅲ.A, we describe the

specific characteristics of Twitter that are used in our proposed

method. In Ⅲ.B, we explain our proposed method.

A. Characteristics of Twitter

Content influence is a value that measures to what degree a piece of content contains meaningful information for users. We choose three factors for measuring influence. The first factor is spreadability of content. In Twitter, content is delivered to the author’s followers; thus, the number of followers shows how many users were delivered that content. Previous research showed that the number of followers measures popularity of

author [3, 5]. Therefore, we can say it is similar for the ability to spread information, and therefore, we use the number of the author’s followers as spreadability. The second factor is the value of content information. Value of information can be measured using number of times shared, because if the content is valuable, it will be shared with others. In Twitter, the retweet mechanism is the main method for sharing content, and therefore, we use retweet count to measure this factor. The last factor is the currency of the information. Newer content contains more valuable information than older content. Twitter is sensitive to up-to-date information [3, 8] and Twitter search shows results by time-descending order. This factor is very sensitive to use in our proposed method, because as we pointed out, this is a problem of Twitter search. However, it is a very important factor in news media and we treat this very carefully. We reflect this factor using the time information of the content.

B. Method for Measuring Content Influence

We use these three characteristics of Twitter in our proposed method: follower number as spreadability of content, retweet count as the value of content information, and time written as currency of information. The proposed method using these factors is as follows.

I(𝐶𝑖) = α log(𝑅𝑇𝑖 + 1) + 𝛽 log(𝐹𝑖 + 1) + 𝛾 log𝑘

𝑁𝑇−𝑊𝑇𝑖 (1)

※ α + β + γ = 1

Here, 𝐶𝑖 is the ith piece of content, and I is the influence of content 𝐶𝑖 . In this equation, ‘ α log(𝑅𝑇𝑖 + 1) ’ represents shareability of 𝐶𝑖, where 𝑅𝑇𝑖 is the retweet count of 𝐶𝑖. We take a logarithm function to normalize, and we add 1 to ensure the output is always defined, because if we do not add 1, and retweet count is 0, then it will be negative infinity. 𝐹𝑖 is the follower number of the author of 𝐶𝑖, and therefore, ‘𝛽 log(𝐹𝑖 +1) ’ represents the spreadability of 𝐶𝑖 . It takes a logarithm

Figure 1. Results of Twitter search

Figure 2. Twitter Advanced Search

660

function for the same reason as the previous factor.

‘𝛾 log𝑘

𝑁𝑇−𝑊𝑇𝑖’ represents currency influence, where NT means

‘now time’ and 𝑊𝑇𝑖 means ‘written time of 𝐶𝑖’. We calculate interval of time as a real number, where subtraction of hours is an integer value and subtraction of minutes is a decimal place. It also takes a logarithm function for the same reason as the previous two factors. We take the reciprocal to ensure that up-to-date contents have a larger contribution than out-of-date contents. Taking a reciprocal, however, has the disadvantage that the currency factor becomes much smaller than the other factors. To correct this disadvantage, we multiply by a constant

‘k’. In the equation ‘y = log𝑘

𝑥’, when x equals k then y is 0, and

if x is smaller than k, then y takes a negative value. Therefore, if a user wants to find information about posts in the last k hours, then they need only fix the value of k in the equation. Finally, α, β and γ are mediators that adjust the power of each factor in the equation. We ensure that the summation of these terms is 1 in order to prevent one factor from being too much more influential than the others.

IV. EXPERIMENTS AND EVALUATION

We crawled Korean Twitter contents and set up

experiments using that data. We describe the crawled dataset

and experiments in Ⅳ .A, and then show the result of an

experiment for content influence using our method in Ⅳ.B.

A. Dataset for Experiments

We used a crawled contents of Korean Twitter contents and user relation (followee information) of Korean users from July 1, 2012 to July 31. The size of the content dataset is about 93.1GB, and the size of the user information dataset is 12.3GB. We saved content data using JSON data style, and user data by simple text style. Content data includes various pieces of information about a single piece of content: author, retweet count, time written, etc. (see figure 3) User data consisted of user-followee relationships (see figure 4)

For our experiments, we extracted related content in two different domains: smartphones (Galaxy series and iPhone) and Psy (a Korean pop singer). We chose the first subject because the Galaxy series and the iPhone are presently the most popular models of smartphone, and the Galaxy S3 was issued in July 2013. We chose the second subject because Psy released a new

album, “Gangnam Style”, which has become a popular topic worldwide. We collected posts about smartphones (especially the Samsung Galaxy series and iPhone) and Psy using related Korean and English words: 12 for smartphones, and 5 for Psy (see table 1).

TABLE 1. RELATED WORDS IN THE TEST DATA SET

Topic Related Korean and English words

Galaxy

Series

갤럭시,겔럭시,겔스,갤스,갤노트,겔노트,겔놋,갤놋,galaxy

(not case sensitive)

iPhone 아이폰,iphone,I-phone(not case sensitive)

Psy 싸이, 강남 스타일, 강남스타일, Psy, Gangnam Style (not

case sensitive)

The number of collected smartphone-related posts is 207,022, and the number of collected Psy-related posts is 118,521. We extracted 10 times more retweeted content, and deleted content for which we could not get a user profile for effective experimentation. Finally, we selected 558 posts about smartphones and 421 about Psy.

After resizing, we classified posts by the nature of the user’s identity into one of four categories: public user, personal user, bot, and unclassified data. A public user is a spokesperson of a specific community or group. For example, ‘samsung’, ‘SBS8news’ and ‘YTN24’: ‘samsung’ is the ID of the SAMSUNG Corporation, and ’SBS8news’ and ’YTN24’ are broadcasting companies in the Republic of Korea. A personal user is a personal user’s ID. A bot is a program used to produce automated content, generally called a Twitter bot. Finally, unclassified refers to any IDs whose category could not be determined. Contents as classified by user identity are presented in table 2.

TABLE 2. CLASSIFIED DATA SET BY USER’S IDENTITY

We also classified posts according to their relation to the keywords, sorting into one of four categories: direct, indirect, private, and unrelated. A direct post is directly related to the topic. An indirect post does not contain direct information about the topic, but does have relevant information such as a related event, music video or so on. A private post contains personal information. Finally, an unrelated post contains information unrelated to the topic. Classification of relation to keywords is presented in table 3 with examples of content type in table 4.

TABLE 3. CLASSIFIED DATA SET BY INFORMATION OF CONTENTS

Topic Public Personal Bot Unclassified

Smart phone 205 298 18 37

Psy 115 172 36 98

Topic Direct Indirect Private Unrelated

Smart phone 166 157 130 105

Psy 129 67 36 189

Figure 3. Example of contents files

Figure 4. Example of relationship files

661

TABLE 4. EXAMPLE OF CONTENTS DIVIDED BY INFORMATION

B. Experiments

Before our experiments on the applicability of our proposed method, we first had to demonstrate that there is a difference between retweet in-degree and follower in-degree, as if those two in-degrees are linearly correlated; we do not need to use both factors. We thus first performed an experiment on the independence of retweet and follower in-degree, and then perform our experiment on the effectiveness of the proposed method.

1) Independence of retweet count and follower number :

First, we used Pearson correlation coefficient as a measure of

correlation between two data sets (See equation (2))

r =∑ (𝑋𝑖−�̅�)(𝑌𝑖−�̅�)𝑛

𝑖=1

√∑ (𝑋𝑖−�̅�)2𝑛𝑖=1 √∑ (𝑌𝑖−�̅�)2𝑛

𝑖=1

(2)

The Pearson correlation coefficient has a value between −1 and 1. If the absolute value of r is close to 1, there is a significant linear relationship, but if the absolute value of r is close to 0, there is no linear relationship between data sets. After performing the correlation test, we found that the coefficient of follower and retweet in-degree was close to 0, and therefore, follower and retweet in-degree had no linear relationship. (See table 5)

TABLE 5. RESULT OF PEARSON CORRELATION COEFFICIENT

Topic r

Smart-phone -0.0348392

Psy 0.17547095

2) Influence with retweet and followers: Next, we

performed experiments to measure the level of shareability and

spreadability, and interrelation of both. Before our experiments,

we created a modified PageRank based on follower-followee

relationship for comparison with the proposed method [10],

substituting followee for out-degree and follower for in-degree.

The initial value of the modified PageRank is then the number

of followers, and d-value is 0.85 as previous research (see

equation 3) [10]. The number of smartphone domain authors is

312 (312 nodes, 12376 edges), and the number of Psy domain

authors is 306 (306 nodes, 3004 edges). We calculated content

PageRank using this equation about three times recursively.

PR(i) =(1−𝑑)

𝑇𝑜𝑡𝑎𝑙𝑁𝑢𝑚 𝑜𝑓 𝑈𝑠𝑒𝑟+

𝑑(∑𝑃𝑅(𝑗)

𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑁𝑢𝑚 𝑜𝑓 𝑗

𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑁𝑢𝑚 𝑜𝑓 𝑖𝑗=1 (3)

For these experiments, we assumed that users want to find

direct and indirect information [8]. In other words, users do

not want to find private and unrelated information. Based on

this assumption, we calculate F-measure with results as shown

in figures 5 and 6 (in this experiment we do not consider time

influence; therefore, γ is zero)

Type of

relationship Contents

Directed

Korean 갤럭시 S3' 발화 사건, 단순 해프닝으로 종결

English Galaxy S3 bun-in incident is ended just happening

Indirect

Korean

[대한민국 올림픽 선수단 응원 이벤트] 응원

메세지 보내고, 금메달 개수를 맞추세요.

갤럭시 S3의 주인공이 되실 수 있습니다.

English

[Korean Olympic team cheers up event] Send a

cheer up message and guess the number of Olympic gold medals. You can be an owner of

Galaxy S3.

Private

Korean

RT))여수엑스포에서 갤럭시노트 화이트

잃어 버렷어요 엑스포에 분실 신고 했으니까

거기에 맡겨주세요 제발제발 부탁드립니다.

English

RT))I lost my white galaxy note in Yeosu Expo.

If you find my phone, please leave it lost-and-

found center.

Unrelated

Korean 갤력시 익스프레스의 미국투어 다큐

<반드시 크게 들을 것>

English The America tour documentary of Galaxy Express <Listen loudly>

Figure 5. Result of Smart-phone Domain

Figure 6. Result of The Psy Domain

662

In figures 5 and 6, the x-axis measures the value of the

mediator term β, and the y-axis measures the value of the F-

measure. Both results show that our proposed method is more

effective than modified PageRank. In the case of the

smartphone domain, the greatest F-measure value is occurred

when α (shareability) is zero, indicating that β (spreadability)

may be more important than α. However, as you can see from

other values in the case of the smartphone domain and of the

Psy domain (see figure 5 and 6), the shareability impact on

measuring the influence of contents is greatest when α is

between 0.5 and 0.6. Therefore, we can find that retweet count

and author follower number are both useful in measuring the

influence of content.

As a note, the result of the Psy domain is worse than

smart-phone domain because there is considerable garbage

data in that dataset. The reason for garbage data is that there is

a social network service, ‘Cyworld’, whose pronunciation ‘Cy’

is same as ‘Psy’ in Korean language, and whose Korean

character is written the same, giving many false positives in

the Psy domain. Nevertheless, despite the garbage data in the

experiment set, the proposed method is still more effective

than the comparison function.

Through this experiment, we have found that retweet count

and follower number are both useful for measuring the

influence of Twitter content, and without the recursive

calculation needed in a method such as PageRank. Instead, it

needs only linear time if the search system already knows

some information about the content.

3) Influence with time: The influence of Twitter should

reflect not only the influence of content but also how up-to-

date the information is [5]. In this section, we want to discover

time influence in Twitter content. More specifically, we want

to discover time influence of Twitter content in specific k-hour

(detailed meaning of ‘k’ is in the Ⅲ.B). To experiment, we

focused on a burn-in incident that occurred in the middle of

July 2012. Burn-in is a failure of a screen caused by displaying

the same image on the screen for an extended period, creating

an after image that remains on the screen afterwards. The

Galaxy S3 had a burn-in problem, but the manual of it

contained a note stating that the company is not responsible

for burn-in failure. However, the company corrected that

sentence and announced a new policy that they will provide

service for burn-in failure. We assume that some user wanted

to find information about the burn-in problem and Galaxy S3

at July 11, 18:00 and that the user wanted to find up-to-date

information over the last 5 hours from that time. We extracted

contents that have Korean keywords–‘Galaxy’ and ‘image-

remain’–and calculated each score using our proposed method

both with time (α=0.16, β=0.24, γ=0.6, k=5) and without time

(α=0.4, β=0.6, γ=0, k=0). The result of this experiment is

shown below (see table 6).

The results show that if a user has many followers and

high retweet counts, then time influence is not important (see

C1-C5, C7-C8). However, if the user has a small number of

followers and their retweet counts are very small, then their

influence score is very small even if they have up-to-date

information about the topic (see C6). If content influence

includes time, however, then this result is changed. C6 has the

second smallest in the result without time influence even

though it has up-to-date information. However, if the

measurement of content influence considers time, C6 gets a

higher score than older contents (C1-C4) even though they

have more followers and a higher retweet count. This

experiment shows us that our proposed method reflects time

effectively with shareability and spreadability in k-hour.

TABLE 6. RESULT OF EXPERIMENT WITH INFLUENCE OF TIME

V. CONCLUSION AND FUTUREWORKS

In this research, we hoped to aid Twitter content search that presently returns results in only time-descending order. To solve this problem, we analyzed characteristics of Twitter content, and proposed a method for measuring influence of each post. We used crawled data about Korean Twitter contents and user relation data from July 1, 2012 to July 31, extracting content on two subjects: smartphones and Korean pop singer Psy. We refined this extracted content, classifying by author’s identity (public, personal, bot, and unclassified) and type of information (direct, indirect, private, and unrelated). We then performed experiments using our proposed method, comparing it with PageRank to demonstrate that our approach is more effective and more computationally efficient in measuring influence of Twitter content, running in linear time on search results given associated information. We also performed experiments with our proposed method incorporating time influence, showing that our proposed method is effective when a post comes from a user with a smaller number of followers and a lower retweet count than other users, but still has more up-to-date information.

In the future, we would like to incorporate other information about the Twitter account, such as friendship of users, the feeling expressed in a post, and the closeness between users. Through this, we hope to create a more

Con-

tent# Content

Written

Time

Re-

tweet

Foll-

ower

Score

with

time

Score

with-

out

time

C1

No AS for burn-

in

7.10 12:36:38

27 872 0.475762 2.343471

C2 7.11

07:33:25 34 106320 1.261351 3.633598

C3 7.11

08:05:23 21 120437 1.255733 3.585427

C4 7.11

09:01:40 13 12733 1.015891 2.921430

C5

Offer AS

for burn-in

7.11

16:19:09 64 122936 1.795269 3.778974

C6 7.11

16:35:05 12 1622 1.277328 2.371768

C7 7.11

17:42:14 34 135214 2.211606 3.696242

C8 7.11

17:51:59 19 142715 2.358965 3.613095

663

advanced search and mining method for not only Twitter, but other social network services as well.

ACKNOWLEDGMENT

This research was supported by the basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science and Technology (2011-0025588), and Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7033346). The corresponding authors is Doo-Kwon Baik

REFERENCES

[1] Twitter, #numbers [Online]. Available:https://blog.twitter.com/2011/numbers, 2011

[2] Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. K., “Measuring User Influence in Twitter: The Million Follower Fallacy”, ICWSM, pp10-17, 2010.

[3] Kwak, H., Lee, C., Park, H., & Moon, S, “What is Twitter, a social network or a news media?”, In Proceedings of the 19th international conference on World wide web, pp. 591-600, 2010.

[4] Weng, J., Lim, E. P., Jiang, J., & He, Q., “Twitterrank: finding topic-sensitive influential twitterers”, In Proceedings of the third ACM international conference on Web search and data mining, pp. 261-270 2010.

[5] Teevan, Jaime, Daniel Ramage, Merredith Ringel Morris. "# TwitterSearch: a comparison of microblog search and web search.", Proceedings of the fourth ACM international conference on Web search and data mining. pp. 25-44, 2011.

[6] Horowitz, Damon, ,Sepandar D. Kamvar. "The anatomy of a large-scale social search engine.", Proceedings of the 19th international conference on World wide web. pp. 431-440, 2010.

[7] Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har'El, N., Ronen, I., Chernov S., "Personalized social search based on the user's social network.", Proceedings of the 18th ACM conference on Information and knowledge management. pp. 1227 – 1236, 2009.

[8] Golovchinsky, Gene, and Miles Efron. "Making sense of twitter search.",2010.

[9] M. Oussalah, F. Bhat, K. Challis, T. Schnier, “A software architecture for Twitter collection, search and geolocation services”, Knowledge-Based Systems, Vol 37, pp. 105-120, 2012

[10] Brin, Sergey, Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems pp. 107-117, 1998.

[11] Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har'El, N., Ronen, I., Chernov, S., “Personalized social search based on the user's social network”, 18th ACM conference on Information and knowledge management, pp. 1227-1236, 2009.

[12] A. Java, X. Song, T. Finin and B. Tseng, “Why We Twitter : Understanding Microblogging Usage and Communities”, In Proc of 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007.

[13] Kristina Lerman and Rumi Ghosh, “Information Contagion: n Empirical Study of the Spread of News on Digg and Twitter Social Networks”, ICWSM, 2010. Web mining and social network analysis, 2007.

[14] N. J. Belkin, “Some(what) grand challenges for information retrieval”, SIGIR Forum, 42(1):p47–54, 2008.

[15] M. J. Carman, M. Baillie, and F. Crestani, “Tag data and personalized information retrieval”, In Procof the CIKM workshop on Search in social media, pp27–34. ACM, 2008.

[16] D. Carmel, N Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'el, I. Ronen, E. Uziel, S. Yogev and S. Chernov,“Personalized Social Search based on the User’s Social Network”, In Proc of CIKM '09, 1227-1236

664

method for measuring twitter content influencetwitter is a microblogging website with specific...

Documents