[ieee 2013 ieee recent advances in intelligent computational systems (raics) - trivandrum, india...

5
Detecting Influential Users using Spread of Communications Saptaditya Maiti Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India Email: [email protected] Deba P. Mandal Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India Email: [email protected] Pabitra Mitra Dept. of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India. Email: [email protected] Abstract—This article discusses about detecting the most influential users in an online social network. We observe that a communication of an influential user is likely to reach many more users than the same made by a user having lesser influence in the network. Based on this observation, We have formulated a method using the spread of communications (i.e., the number of users the communication reaches). We have verified the method on three datasets downloaded from ‘Twitter’ and results are found to be the best among existing methods on the said datasets. Index Terms—online social network, influential users detection, spread of communications, Twitter. I. I NTRODUCTION A social network is an online service/platform/site that focuses on building and reflection of social relations in a community formed by people having common interests and/or activities. It allows users to share ideas, activities, events and interests within a community. The main types of social networking services are those that contain category places (such as former school year or classmates), means to con- nect with friends (usually with self-description pages), and a recommendation system linked to trust. At present there exist many online social networks and among them ‘Facebook’ [1], ‘Twitter’ [2] and ‘MySpace’ [3] are the most popular ones. The number of users of these networks has been increasing at an unprecedented rate during the last few years [4]. Each network has a certain graph structure based on the services it intend to provide to the users. On analysis we find that basic services like communicating with other users, formation of communities etc. are available in almost all of the sites in different nomenclature with the same or slightly different structures. For example, ‘like’ and ‘share’ in ‘Facebook’ are same as ‘favorite’ and ‘retweet’ in ‘Twitter’; but for users to be connected, ‘Facebook’ requires mutual ‘friendship’ (both the users have to accept the friendship), whereas in ‘Twitter’ one has to follow a user and the user does not necessarily need to follow him back. Social network mining is a process of gathering meaning- ful and useful information from online social network data. The most common use of social network mining is gauging customer opinion to support marketing and customer service activities. Another important use of social network mining is to find the influential users through whom an organization can propagate its agenda to a much larger audience. The present article is concerned with the identification of the most influential nodes/ users in an online social network. As the social networking sites differ in structure and goal we have considered here only ‘Twitter’ in describing the proposed method. ‘Twitter’ is a social networking and micro-blogging site. It allows users to communicate and stay connected through the exchange of short (maximum of 140 characters) messages, called tweets. Twitter creates several interesting social network structures. The most obvious network is the one created by the ‘follows’ and ‘followed by’ relationships which generate directional trees structure of ties, where the directionality of tie is important (i.e. who is following whom). Unlike most of the other online social networking sites (like Facebook, etc), following on Twitter is not a mutual relation- ship. Any user can follow you and you do not have to follow him back. Twitter users follow someone, mostly because they are interested in the topics the user publishes in tweets, and they follow back if they share similar topics of interest. When a user posts (tweets) a message, it reaches to all his followers; if any of them consider it important/ interesting he re-posts (retweets) it to his followers and so on; thereby a large number of users can be potentially reached by a particular message. Efforts have been made to identify the influential users con- sidering both their importance as hubs within their community and by the quality and topical relevance of their communica- tions. Some of these efforts are: (Balkundi & Kilduff 2005 [5]; Bar-Ilan & Peritz 2009 [6]; Bongwon Suh et al. 2010 [7]; Boyd et al. 2010 [9]; Cha et al. 2010 [10]; Gayo-Avello 2010a [11]; Gayo-Avello 2010b [12]; Gruhl et al. 2004 [13]; Nagarajan et al. 2010 [14]; Nagle & Singh 2009 [15]; Pal & Counts 2011 [16]; Romero et al. 2011 [17]; Sakaki & Matsuo 2010 [18]; Sousa et al. 2010 [19]; Welch et al. 2011 [20]; Yamaguchi et al. 2010 [21]; Ye & Wu 2010 [22]; Kwak et al. 2010 [23]). Most of these researches are based on: follower count, tweet count and mention count, co-follower rate (ratio between followers and followings), frequency of tweets/ updates, whom your followers follow, topical authorities. Centrality measures such as indegree/outdegree, eigen vector, betweenness, closeness, pagerank (Page et al. 1999 [24]) etc. have been used to evaluate node importance too. Nevertheless, evaluating node 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) 978-1-4799-2178-2/13/$31.00 ©2013 IEEE 288

Upload: pabitra

Post on 25-Feb-2017

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

Detecting Influential Users using Spread ofCommunications

Saptaditya MaitiMachine Intelligence Unit,Indian Statistical Institute,

Kolkata, IndiaEmail: [email protected]

Deba P. MandalMachine Intelligence Unit,Indian Statistical Institute,

Kolkata, IndiaEmail: [email protected]

Pabitra MitraDept. of Computer Science and Engineering,

Indian Institute of Technology,Kharagpur, India.

Email: [email protected]

Abstract—This article discusses about detecting the mostinfluential users in an online social network. We observe thata communication of an influential user is likely to reach manymore users than the same made by a user having lesser influencein the network. Based on this observation, We have formulated amethod using the spread of communications (i.e., the number ofusers the communication reaches). We have verified the methodon three datasets downloaded from ‘Twitter’ and results arefound to be the best among existing methods on the said datasets.

Index Terms—online social network, influential users detection,spread of communications, Twitter.

I. INTRODUCTION

A social network is an online service/platform/site thatfocuses on building and reflection of social relations in acommunity formed by people having common interests and/oractivities. It allows users to share ideas, activities, eventsand interests within a community. The main types of socialnetworking services are those that contain category places(such as former school year or classmates), means to con-nect with friends (usually with self-description pages), and arecommendation system linked to trust.

At present there exist many online social networks andamong them ‘Facebook’ [1], ‘Twitter’ [2] and ‘MySpace’ [3]are the most popular ones. The number of users of thesenetworks has been increasing at an unprecedented rate duringthe last few years [4]. Each network has a certain graphstructure based on the services it intend to provide to the users.On analysis we find that basic services like communicatingwith other users, formation of communities etc. are available inalmost all of the sites in different nomenclature with the sameor slightly different structures. For example, ‘like’ and ‘share’in ‘Facebook’ are same as ‘favorite’ and ‘retweet’ in ‘Twitter’;but for users to be connected, ‘Facebook’ requires mutual‘friendship’ (both the users have to accept the friendship),whereas in ‘Twitter’ one has to follow a user and the userdoes not necessarily need to follow him back.

Social network mining is a process of gathering meaning-ful and useful information from online social network data.The most common use of social network mining is gaugingcustomer opinion to support marketing and customer serviceactivities. Another important use of social network mining is

to find the influential users through whom an organization canpropagate its agenda to a much larger audience.

The present article is concerned with the identification ofthe most influential nodes/ users in an online social network.As the social networking sites differ in structure and goal wehave considered here only ‘Twitter’ in describing the proposedmethod. ‘Twitter’ is a social networking and micro-bloggingsite. It allows users to communicate and stay connectedthrough the exchange of short (maximum of 140 characters)messages, called tweets. Twitter creates several interestingsocial network structures. The most obvious network is theone created by the ‘follows’ and ‘followed by’ relationshipswhich generate directional trees structure of ties, where thedirectionality of tie is important (i.e. who is following whom).Unlike most of the other online social networking sites (likeFacebook, etc), following on Twitter is not a mutual relation-ship. Any user can follow you and you do not have to followhim back. Twitter users follow someone, mostly because theyare interested in the topics the user publishes in tweets, andthey follow back if they share similar topics of interest. Whena user posts (tweets) a message, it reaches to all his followers;if any of them consider it important/ interesting he re-posts(retweets) it to his followers and so on; thereby a large numberof users can be potentially reached by a particular message.

Efforts have been made to identify the influential users con-sidering both their importance as hubs within their communityand by the quality and topical relevance of their communica-tions. Some of these efforts are: (Balkundi & Kilduff 2005 [5];Bar-Ilan & Peritz 2009 [6]; Bongwon Suh et al. 2010 [7]; Boydet al. 2010 [9]; Cha et al. 2010 [10]; Gayo-Avello 2010a [11];Gayo-Avello 2010b [12]; Gruhl et al. 2004 [13]; Nagarajan etal. 2010 [14]; Nagle & Singh 2009 [15]; Pal & Counts 2011[16]; Romero et al. 2011 [17]; Sakaki & Matsuo 2010 [18];Sousa et al. 2010 [19]; Welch et al. 2011 [20]; Yamaguchi et al.2010 [21]; Ye & Wu 2010 [22]; Kwak et al. 2010 [23]). Mostof these researches are based on: follower count, tweet countand mention count, co-follower rate (ratio between followersand followings), frequency of tweets/ updates, whom yourfollowers follow, topical authorities. Centrality measures suchas indegree/outdegree, eigen vector, betweenness, closeness,pagerank (Page et al. 1999 [24]) etc. have been used toevaluate node importance too. Nevertheless, evaluating node

2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS)

978-1-4799-2178-2/13/$31.00 ©2013 IEEE 288

Page 2: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

importance with a single metric can be considered incompleteand limited as it cannot capture the specific differences amongnodes.

Any communications of a user reaches to all of his followersand some of them are retweeted. But when it is from aninfluential user, it is expected to be retweeted by many ofhis followers, and so it would reach to the followers of thefollowers who retweeted it, and so on. That is, spread ofthe communications of a user can also signify the influenceof a user in the network. We have proposed a method fordetermining the influence of the users in a network based onthe spread of the communications.

We have implemented the proposed method on threedatasets on topics #euro12, #wimbledon and #london2012downloaded from twitter using ‘Twitter search API’ [25].We have verified the results using Mean Average precision(MAP) [26], Kendall rank correlation [27] and Spearman rankcorrelation [28]. It is shown in [22] that retweet count is themost effective in finding the most influential users. Therefore,the results are compared with the results obtained by retweetcount. It is found that the spread of communications is bestamong existing methods in detecting influential users in asocial network.

The rest of the paper is organized as follows. In section IIthe proposed method is described. The experimental results ofthe method are presented in section III. Section IV finds theconclusions.

II. PROPOSED METHOD

We have proposed a method to determine the influentialusers in an online social network based on the spread ofcommunications made by the users. In this section, we firstexplain the concept of spread of the communications and thenpropose our method.

Spread of Communication

Spread of a communication is the count of the users receivedthe communication. When a communication (tweet) is madeby a user, all of his followers receive the communication, butsome of them retweet it to their followers and rest do not takeany further action after its receipt. This process continues fornext levels till the level when no followers retweet it. Weexplain this with an example as shown in Fig. 1. In the figuresolid circles represent the users who retweeted a post and solidlines represent the flow of the post. Thus the post is spreadamong many users (beyond his own followers) by virtue ofretweeting.

One can observe that in a network spread of a communica-tion made by a user is usually higher than the same made byany other less influential user in a social network. That is theinfluence of a user increases with the increase of the spreadof his communications. This motivated us to use spread indetermining the influence of a user.

Suppose a communication c is made by user S(c), F (S(c))is the set of followers of S(c) and R(c) is the set of the retweets

of c after its receipt. So the spread of c, denoted as SP (c),can be written as

SP (c) = |F (S(c))|+∑

c′∈R(c)

SP (c′) (1)

where |F (S(c))| is the total number of followers of S(c). Itis to be noted that spread of a communication contains thefollowers’ count of the users who tweeted as well as retweetedthe communication.

An algorithm for Eqn (1) is provided below a communica-tion c having nr retweets denoted as rj , j = 1, 2, ..., nr.Algorithm: Calculation of spread of a communication c(a) SP (c) = |F (Sc)|;(b) If nr = 0, go to (a);(c) j = 1;(d) SP (c)+ = SP (rj);(e) j+ = 1;(f) If j < nr, go to (d);(g) STOP.

Influential user determination using spread

Let us consider a user U made N communications c1,c2, ..., cN in a social network. Using the spread of thecommunications made by U (Eqn. (1)), his influential valueIVS(U) is determined as

IVS(U) =∑c

SP (c). (2)

The influence value of each of the users in the network isdetermined using Eqn. (2). The users are ranked according totheir influential scores and the top ranking users are decidedas the most influential ones.

III. IMPLEMENTATION AND RESULTS

The proposed method of determining the influential usersin a social network using the spread of the communicationsis described in the previous section. The implementation andthe experimental results of the method are discussed here.

A ranking list of the all users is generated according totheir influential scores in Eqn. (2). in order to evaluate thereliability/stability of the method we divided the data into twoparts depending on the time it was posted and compared theranked list of users generated by the method. The similarityof the lists between two parts signify the reliability of themethod.

To find the similarity of the lists, Spearman rank correla-tion [28], Kendall rank correlation [27] and Mean AveragePrecision [26] are taken as measures as these are the popularmeasures in evaluating the ranked lists.

Spearman’s rank correlation coefficient (ρ) [28]:

ρ = 1− 6∑

(ai − bi)2

n(n2 − 1)(3)

where a and b are two ranking lists of n users.Kendall Tau rank correlation coefficient (τ ) [27]:

τ =nc − nd

0.5n(n− 1)(4)

289

Page 3: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

Nodes reacting to a communication

Nodes not reacting to a communication

Flow of a communication

Connections where the communication does

not reach

Level 1

Level 2

Level 3

Level 4

Level 5

Fig. 1. Example: flow of a communication in ‘Twitter’

TABLE IDESCRIPTION OF DATASETS USED FOR EXPERIMENTS

Topic Part T ime Users Tweets

Set-1 #euro12 Part-1 24 June-27 June, 2012 15,523 25,075Part-2 28 June-2 July, 2012 15,803 24,069

Set-2 #wimbledon Part-1 02 July-08 July, 2012 62,286 116,130Part-2 09 July-10 July, 2012 63,532 90,684

Set-3 #london2012 Part-1 16 July-02 August, 2012 1,064,184 3,613,754Part-2 03 August-22 August, 2012 1,066,198 3,486,210

where nc is the number of concordant pairs and nd is thenumber of discordant pairs. Given two items i and j, if ai > ajand bi > bj (or ai < aj and bi < bj), i and j are a concordantpair, otherwise i and j are a discordant pair.

Mean Average Precision [26]:

MAP =

n∑u=1

AvPr(ui)

n(5)

where for n users in list a, AvPr(Ui) is the average precisionof the rank of user ui in list b.

As we are interested in finding the top level influential users,we have shown the results for top 100 and 1000 users toevaluate the ranking lists.

It should be noted here that two lists coming from twoparts of a set do not contain exactly same users (one list maycontain only some of the users in the top 100 or 1000), wehave matched the lists in the following process to make themcomparable.

In order to evaluate the performance we need to compare itwith other metrics in different datasets. In the literature [22],retweet count is found to be the most effective in finding themost influential users and the results are compared here withthe retweet count.

A. Dataset Collection

In order to find the effectiveness of the proposed method, wecollected the data from ‘Twitter’. We have crawled ‘Twitter’with the help of the Twitter Search API [25]. We have takenparticular topics of #euro12 and #wimbledon #london2012 todownload the data during the Euro Cup 2012 and Wimbledon2012 and London Olympic 2012. We have divided the datasetinto two parts according to the date of the tweets. The detailsof the datasets are provided in Table I.

B. Results

We have implemented our method over the aforementioneddatasets. The spread of a communication is calculated usingEqn. (1). This is done for the communications made by eachof the users. The influence value for each of the users is nowobtained for both the sets using Eqn. (2) and the users areranked according to their influence values.

C. Experimental Verification

As stated earlier, the Mean Average Precision (MAP) [26],Kendall rank correlation [27] and Spearman rank correlation[28] are mostly used in the literature. So, we have consideredthem as evaluation measures in the present article. The listsof users for both the sets are matched as explained earlier.Retweet count is found to be the most useful in finding theinfluential users [22] and we have compared the performance

290

Page 4: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

of proposed method with it. Therefore, we create the list oftop 100 and 1000 influential users from part-1 and their ranksin part-2 according to retweets count. The users ranked morethan 100 will have negligible effect on the MAP and therefore,it is determined for the top 100 users only.

The lists we got earlier are now compared to evaluate themethod. The results are shown for Spearman rank correlation,Kendall rank correlation and Mean Average Precision in tablesII, III and IV respectively. For better visualization, the resultsare demonstrated in figures 2, 3 and 4 respectively.

TABLE IISPEARMAN RANK CORRELATION COEFFICIENT OF RT COUNT AND

SPREAD COUNT

Topic No of Users RT Count Spread Count

euro12 100 0.30798 0.369851000 0.12006 0.46923

wimbledon 100 0.14437 0.652361000 0.38258 0.65310

london2012 100 0.29626 0.648121000 0.33987 0.65310

TABLE IIIKENDALL RANK CORRELATION COEFFICIENT OF RT COUNT AND

SPREAD COUNT

Topic No of Users RT Count Spread Count

euro12 100 0.22182 0.259391000 0.13347 0.32802

wimbledon 100 0.39354 0.477171000 0.32202 0.47221

london2012 100 0.19758 0.474341000 0.23778 0.48612

TABLE IVMEAN AVERAGE PRECISION OF RT COUNT AND SPREAD COUNT

Topic RT Count Spread Counteuro12 0.06800 0.21819

wimbledon 0.24748 0.57398london2012 0.18104 0.63360

From the tables and the figures we see that the spread ofcommunications has performed better than retweet count forall the sets considered. Therefore, we conclude that this factoris quite effective in finding the influential users in an onlinesocial network.

IV. CONCLUSIONS

This article discussed about determining the most influentialnodes in a social networking site. Spread of the communicationhas been introduced here to determine the influential users.Spread of the communications refers to how far the commu-nication has reached. The spread of the communications ismeasured by the retweets of a communication through thefollowers of a user.

Fig. 2. Spearman Rank Correlation Coefficient of RT Count and SpreadCount

Fig. 3. Kendall Rank Correlation Coefficient of RT Count and Spread Count

We have considered only ‘Twitter’ in describing the pro-posed method. As the social networking sites differ in structureand goal, we would like to mention here that the concept of themethod is applicable for most of the other social networkingsites as well. In such cases some of the steps of the methodare needed to be modified slightly depending on the structureof the sites.

We have implemented the method on a three datasetscollected from ‘Twitter’. The performance of our method isfound to be better than method by retweet count on all the threedatasets we considered. As shown in [22], retweet count is theleading method in detecting the influential users. Therefore,the proposed method is the best among existing methods toidentify the most influential users.

REFERENCES

[1] Facebook, http://www.facebook.com[2] Twitter, http://twitter.com[3] MySpace, http://www.myspace.com[4] Kim, W., Jeong, O. & Lee, S.W., 2010. On social Web sites. Information

Systems, 35(2), p.215-236.[5] Balkundi, P. & Kilduff, M., 2005. The ties that lead: A social network

approach to leadership. The Leadership Quarterly, 16(6), p.941-961.[6] Bar-Ilan, J. & Peritz, B.C., 2009. A method for measuring the evolution

of a topic on the Web: The case of informetrics. Journal of the AmericanSociety for Information Science and Technology, 60(9), p.1730-1740.

[7] Bonacich, P., 2007. Some unique properties of eigenvector centrality.Social Networks, 29(4), p.555-564.

291

Page 5: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

Fig. 4. MAP of RT Count and Spread Count

[8] Bongwon Suh et al., 2010. Want to be Retweeted? Large Scale Analyticson Factors Impacting Retweet in Twitter Network. In Social Computing(SocialCom), 2010 IEEE Second International Conference on. SocialComputing (SocialCom), 2010 IEEE Second International Conference on.p. 177-184.

[9] Boyd, D., Golder, S. & Lotan, G., 2010. Tweet, Tweet, Retweet: Conver-sational Aspects of Retweeting on Twitter. In Hawaii International Con-ference on System Sciences. Los Alamitos, CA, USA: IEEE ComputerSociety, p. 110.

[10] Cha, M., Haddadi, H. & Gummadi, P.K., 2010. Measuring User Influ-ence in Twitter: The Million Follower Fallacy. In International Conferenceon Weblogs and Social Media.

[11] Gayo-Avello, D., 2010a. Detecting Important Nodes to CommunityStructure Using the Spectrum of the Graph. Cornell University Library.

[12] Gayo-Avello, D., 2010b. Nepotistic relationships in twitter and theirimpact on rank prestige algorithms. Arxiv preprint arXiv:1004.0816.

[13] Gruhl, D. et al., 2004. Information diffusion through blogspace. InProceedings of the 13th international conference on World Wide Web.New York, NY, USA: ACM, p. 491-501.

[14] Nagarajan, M., Purohit, H. & Sheth, A., 2010. A Qualitative Exami-nation of Topical Tweet and Retweet Practices. In ICWSM 2010. Inter-national AAAI Conference on Weblogs and Social Media. Washington,DC.

[15] Nagle, F. & Singh, L., 2009. Can Friends Be Trusted? Exploring Privacyin Online Social Networks. In 2009 International Conference on Advancesin Social Network Analysis and Mining. 2009 International Conference onAdvances in Social Network Analysis and Mining (ASONAM). Athens,Greece, p. 312-315.

[16] Pal, A. & Counts, S., 2011. Identifying topical authorities in microblogs.In Proceedings of the fourth ACM international conference on Web searchand data mining. Hong Kong, China: ACM, p. 45-54.

[17] Romero, D.M. et al., 2011. Influence and passivity in social media. InProceedings of the 20th international conference companion on Worldwide web - WWW 11. the 20th international conference companion.Hyderabad, India, p. 113.

[18] Sakaki, T. & Matsuo, Y., 2010. How to Become Famous in theMicroblog World. 2010.

[19] Sousa, D., Sarmento, L. & Mendes Rodrigues, E., 2010. Characterizationof the twitter @replies network. In Proceedings of the 2nd internationalworkshop on Search and mining user-generated contents - SMUC 10. the2nd international workshop. Toronto, ON, Canada, p. 63.

[20] Welch, M.J. et al., 2011. Topical semantics of twitter links. In Proceed-ings of the fourth ACM international conference on Web search and datamining. Hong Kong, China: ACM, p. 327-336.

[21] Yamaguchi, Y. et al., 2010. TURank: Twitter User Ranking Based onUser-Tweet Graph Analysis. In Web Information Systems EngineeringWISE 2010. Lecture Notes in Computer Science. Springer Berlin /Heidelberg, p. 240-253.

[22] Ye, S. & Wu, S., 2010. Measuring Message Propagation and Social In-fluence on Twitter.com. In Social Informatics. Lecture Notes in ComputerScience. Springer Berlin / Heidelberg, p. 216-231.

[23] Kwak, H. et al., 2010. What is Twitter, a social network or a news media?

In Proceedings of the 19th international conference on World wide web.Raleigh, North Carolina, USA: ACM, p. 591-600.

[24] Page, L. et al., 1999. The PageRank Citation Ranking: Bringing Orderto the Web., Stanford InfoLab.

[25] Kevin Makice. 2009. Twitter API: Up and Running Learn how to BuildApplications with the Twitter API (1st ed.). O’Reilly Media, Inc.

[26] C. D. Manning, P. Raghavan, H. Schutze, An Introduction to InformationRetrieval. Cambridge University Press, New York, 2008.

[27] Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2),81-93 (June 1938).

[28] C.Spearman, The proof and measurement of association between twothings. The American journal of psychology 15, 72 101 (1904).

292