influence twiter.pdf

Upload: emanuel-robles

Post on 03-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 influence twiter.pdf

    1/9

    Influence and Passivity in Social Media

    Daniel M. RomeroCornell UniversityCenter for Applied

    MathematicsIthaca, New York, USA

    [email protected]

    Wojciech GalubaDistributed Information

    Systems Lab

    EPFLLausanne, Switzerland

    [email protected]

    Sitaram AsurSocial Computing Lab

    HP LabsPalo Alto, California, [email protected]

    Bernardo A. HubermanSocial Computing Lab

    HP LabsPalo Alto, California, USA

    [email protected]

    ABSTRACT

    The ever-increasing amount of information flowing throughSocial Media forces the members of these networks to com-

    pete for attention and influence by relying on other peopleto spread their message. A large study of information propa-gation within Twitter reveals that the majority of users actas passive information consumers and do not forward thecontent to the network. Therefore, in order for individualsto become influential they must not only obtain attentionand thus be popular, but also overcome user passivity. Wepropose an algorithm that determines the influence and pas-sivity of users based on their information forwarding activ-ity. An evaluation performed with a 2.5 million user datasetshows that our influence measure is a good predictor of URLclicks, outperforming several other measures that do not ex-plicitly take user passivity into account. We demonstratethat high popularity does not necessarily imply high influ-

    ence and vice-versa.

    1. INTRODUCTIONThe explosive growth of Social Media has provided mil-

    lions of people the opportunity to create and share contenton a scale barely imaginable a few years ago. Massive partic-ipation in these social networks is reflected in the countlessnumber of opinions, news and product reviews that are con-stantly posted and discussed in social sites such as Facebook,Digg and Twitter, to name a few. Given this widespreadgeneration and consumption of content, it is natural to tar-get ones messages to highly connected people who will prop-agate them further in the social network. This is particularlythe case in Twitter, which is one of the fastest growing so-

    cial networks on the Internet, and thus the focus of advertis-

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

    ing companies and celebrities eager to exploit this vast newmedium. As a result, ideas, opinions, and products competewith all other content for the scarce attention of the user

    community. In spite of the seemingly chaotic fashion withwhich all these interactions take place, certain topics man-age to get an inordinate amount of attention, thus bubblingto the top in terms of popularity and contributing to newtrends and to the public agenda of the community. Howthis happens in a world where crowdsourcing dominates isstill an unresolved problem, but there is considerable consen-sus on the fact that two aspects of information transmissionseem to be important in determining which content receivesattention.

    One aspect is the popularity and status of given membersof these social networks, which is measured by the level ofattention they receive in the form of followers who createlinks to their accounts to automatically receive the contentthey generate. The other aspect is the influence that theseindividuals wield, which is determined by the actual propa-gation of their content through the network. This influenceis determined by many factors, such as the novelty and reso-nance of their messages with those of their followers and thequality and frequency of the content they generate. Equallyimportant is the passivity of members of the network whichprovides a barrier to propagation that is often hard to over-come. Thus gaining knowledge of the identity of influentialand least passive people in a network can be extremely usefulfrom the perspectives of viral marketing, propagating onespoint of view, as well as setting which topics dominate thepublic agenda.

    In this paper, we analyze the propagation of web linkson Twitter over time to understand how attention to given

    users and their influence is determined. We devise a gen-eral model for influence using the concept of passivity in asocial network and develop an efficient algorithm similar tothe HITS algorithm [14] to quantify the influence of all theusers in the network. Our influence measure utilizes boththe structural properties of the network as well as the dif-fusion behavior among users. The influence of a user thusdepends not only on the size of the influenced audience, butalso on their passivity. This differentiates our measure ofinfluence from earlier ones, which were primarily based onindividual statistical properties such as the number of fol-

  • 8/12/2019 influence twiter.pdf

    2/9

    lowers or retweets [7].We have shown through extensive evaluation that this in-

    fluence model outperforms other measures of influence suchas PageRank, H-index, the number of followers and the num-ber of retweets. In addition, it has good predictive propertiesin that it can forecast in advance the upper bound on thenumber of clicks a URL can get. We have also presentedcase studies showing the top influential users uncovered byour algorithm. An important conclusion from the results

    is that the correlation between popularity and influence isquite weak, with the most influential users not necessarilybeing the ones with the highest popularity. Additionally,when we considered nodes with high passivity, we found themajority of them to be spammers and robot users. Thisdemonstrates the applicability of our algorithm to automaticuser categorization and filtering of online content.

    2. RELATED WORKThe study of information and influence propagation in

    social networks has been particularly active for a numberof years in fields as disparate as sociology, communication,marketing, political science and physics. Earlier work fo-

    cused on the effects that scale-free networks and the affinityof their members for certain topics had on the propagationof information [6]. Others discussed the presence of keyinfluentials [12, 11, 8, 5, 10] in a social network, defined asthose who are responsible for the overall information dis-semination in the network. This research highlighted thevalue of highly connected individuals as key elements in thepropagation of information through the network.

    Huberman et al. [2] studied the social interactions onTwitter to reveal that the driving process for usage is asparse hidden network underlying the friends and followers,while most of the links represent meaningless interactions.Jansen et al. [3] have examined Twitter as a mechanismfor word-of-mouth advertising. They considered particularbrands and products and examined the structure of the post-ings and the change in sentiments. Galuba et al. [4] proposea propagation model that predicts, which users will tweetabout which URLs based on the history of past user activ-ity.

    There have also been earlier studies focused on social in-fluence and propagation. Agarwal et al. [8] have examinedthe problem of identifying influential bloggers in the blogo-sphere. They discovered that the most influential bloggerswere not necessarily the most active. Aral et al [9] havedistinguished the effects of homophily from influence as mo-tivators for propagation. As to the study of influence withinTwitter, Cha et al. [7] have performed a comparison of threedifferent measures of influence - indegree, retweets and usermentions. They discovered that while retweets and men-

    tions correlated well with each other, the indegree of usersdid not correlate well with the other two measures. Basedon this, they hypothesized that the number of followers maynot a good measure of influence. On the other hand, Wenget al [5] have proposed a topic-sensitive PageRank measurefor influence in Twitter. Their measure is based on the factthat they observed high reciprocity among follower relation-ships in their dataset, which they attributed to homophily.However, other work [7] has shown that the reciprocity islow overall in Twitter and contradicted the assumptions ofthis work.

    3. TWITTER

    3.1 Background on TwitterTwitter is an extremely popular online microblogging ser-

    vice, that has gained a very large user base, consisting ofmore than 105 million users (as of April 2010). The Twittergraph is a directed social network, where each user choosesto follow certain other users. Each user submits periodicstatus updates, known astweets, that consist of short mes-sages limited in size to 140 characters. These updates typi-cally consist of personal information about the users, newsor links to content such as images, video and articles. Theposts made by a user are automatically displayed on theusers profile page, as well as shown to his followers.

    Aretweetis a post originally made by one user that is for-warded by another user. Retweets are useful for propagatinginteresting posts and links through the Twitter community.

    Twitter has attracted lots of attention from corporationsfor the immense potential it provides for viral marketing.Due to its huge reach, Twitter is increasingly used by newsorganizations to disseminate news updates, which are thenfiltered and commented on by the Twitter community. Anumber of businesses and organizations are using Twitter

    or similar micro-blogging services to advertise products anddisseminate information to stockholders.

    3.2 DatasetTwitter provides a Search API for extracting tweets con-

    taining particular keywords. To obtain the dataset for thisstudy, we continuously queried the Twitter Search API fora period of 300 hours starting on 10 Sep 2009 for all tweetscontaining the string http. This allowed us to acquire acomplete stream of all the tweets that contain URLs. Weestimated the 22 million we accumulated to be 1/15th of theentire Twitter activity at that time. From each of the ac-cumulated tweets, we extracted the URL mentions. Each ofthe unique 15 million URLs in the dataset was then checkedfor valid formatting and the URLs shortened via the services

    such as bit.ly or tinyurl.com were expanded into theiroriginal form by following the HTTP redirects. For eachencountered unique user ID, we queried the Twitter APIfor metadata about that user and in particular the usersfollowers and followees. The end result was a dataset oftimestamped URL mentions together with the complete so-cial graph for the users concerned.

    User graph. The user graph contains those users whosetweets appeared in the stream, i.e., users that during the300 hour observation period posted at least one public tweetcontaining a URL. The graph does not contain any users whodo not mention any URLs in their tweets or users that havechosen to make their Twitter stream private.

    For each newly encountered user ID, the list of followedusers was only fetched once. Our dataset does not capturethe changes occurring in the user graph over the observationperiod.

    4. THE IP ALGORITHMEvidence for passivity. The users that receive informa-

    tion from other users may never see it or choose to ignoreit. We have quantified the degree to which this occurs onTwitter (Fig. 4). An average Twitter user retweets only onein 318 URLs, which is a relatively low value. The retweetingrates vary widely across the users and the small number of

  • 8/12/2019 influence twiter.pdf

    3/9

    1 0

    - 8

    1 0

    - 7

    1 0

    - 6

    1 0

    - 5

    1 0

    - 4

    1 0

    - 3

    1 0

    - 2

    1 0

    - 1

    1 0

    0

    r e t w e e t i n g r a t e

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    #

    u

    s

    e

    r

    s

    a u d i e n c e r e t w e e t i n g r a t e

    u s e r r e t w e e t i n g r a t e

    Figure 1: Evidence for the Twitter user passivity.We measure passivity by two metrics: 1. the userretweeting rate and 2. the audience retweeting rate.The user retweeting rate is the ratio between thenumber of URLs that user i decides to retweet tothe total number of URLs user i received from thefollowed users. The audience retweeting rate is theratio between the number of user is URLs that wereretweeted by is followers to the number of times afollower of i received a URL from i.

    the most active users play an important role in spreading theinformation in Twitter. This suggests that the level of userpassivity should be taken into account for the informationspread models to be accurate.

    Assumptions. Twitter is used by many people as a toolfor spreading their ideas, knowledge, or opinions to others.An interesting and important question is whether it is pos-sible to identify those users who are very good at spreadingtheir content, not only to those who choose to follow them,but to a larger part of the network. It is often fairly easyto obtain information about the pairwise influence relation-ships between users. In Twitter, for example, one can mea-sure how much influence user A has on user B by countingthe number of times B retweeted A. However, it is not veryclear how to use the pairwise influence information to accu-rately obtain information about the relative influence eachuser has on the whole network. To answer this question,we design an algorithm (IP) that assigns a relativeinfluencescore and a passivity score to every user. The passivity ofa user is a measure of how difficult it is for other users toinfluence him. Since we found evidence that users on Twit-ter are generally passive, the algorithm takes into accountthe passivity of all the people influenced by a user, whendetermining the users influence. In other words, we assumethat the influence of a user depends on both the quantity

    and the quality of the audience she influences. In general,our model makes the following assumptions:

    1. A users influence score depends on the number of peo-ple she influences as well as their passivity.

    2. A users influence score depends on how dedicated thepeople she influences are. Dedication is measured bythe amount of attention a user pays to a given one ascompared to everyone else.

    3. A users passivity score depends on the influence of

    those who shes exposed to but not influenced by.

    4. A users passivity score depends on how much she re-jects other users influence compared to everyone else.

    Operation. The algorithm iteratively computes both thepassivity and influence scores simultaneously in the followingway:

    Given a weighted directed graph G = (N,E,W) withnodes N, arcs E, and arc weights W, where the weightswij on arc e = (i, j) represent the ratio of influence that iexerts on j to the total influence that i attempted to exerton j, the IP algorithm outputs a function I : N [0, 1],which represents the nodes relative influence on the net-work, and a function P : N [0, 1] which represents thenodes relative passivity.

    For every arc e = (i, j) E, we define theacceptance rate

    by uij = wi,j

    k:(k,j)E

    wkj. This value represents the amount

    of influence that user j accepted from user i normalizedby the total influence accepted by j from all users in thenetwork. The acceptance rate can be viewed as the dedi-cation or loyalty user j has to user i. On the other hand,

    for every e = (j, i) E we define the rejection rate byvji =

    1 wji

    k:(j,k)E

    (1 wjk). Since the value 1wji is amount

    of influence that user i rejected from j, then the value vjirepresents the influence that useri rejected from userj nor-malized by the total influence rejected fromj by all users inthe network.

    The algorithm is based on the following operations:

    Ii

    j:(i,j)E

    uijPj (1)

    Pi

    j:(j,i)E

    vjiIj (2)

    Each term on the right hand side of the above operationscorresponds to one of the listed assumptions. In operation1, the term Pj corresponds to assumption 1 and the termuij corresponds to assumption 2. In operation 2, the termIjcorresponds to assumption 3 and the termvji correspondsto assumption 4. TheInfluence-Passivity algorithm (Algo-rithm 1) takes the graphG as the input and computes theinfluence and passivity for each node in m iterations.

    The IP algorithm is similar to the HITS algorithm forfinding authoritative web pages and hubs that link to them[14]. The passivity score corresponds to the authority score,and the influence corresponds to hub score. However, IP isdifferent from HITS in that it operates on a weighted graphand it takes into account other properties of the networksuch as those referred to as acceptance rate and rejectionrate.

    Generating the input graph. There are many ways ofdefining the influence graph G = (N,E,W). We constructit by taking into account retweets and the follower graphin the following way: The nodes are users who tweeted atleast 3 URLs. The arc (i, j) exists if userj retweeted a URLposted by user i at least once. The arce = (i, j) has weight

    we= SijQij

    whereQi is the number of URLs thati mentioned

    andSij is the number of URLs mentioned byiand retweetedbyj .

  • 8/12/2019 influence twiter.pdf

    4/9

    Algorithm 1: The Influence-Passivity (IP) algorithm

    I0 (1, 1, . . . , 1) R|N|;

    P0 (1, 1, . . . , 1) R|N|;

    for i= 1 to m doUpdate Pi using operation (2) and the values Ii1;Update Ii using operation (1) and the values Pi;for j= 1 to |N| do

    Ij = IjkN

    Ik;

    Pj = Pj

    kN

    Pk;

    end

    endReturn (Im, Pm);

    5. EVALUATION

    5.1 Computations

    Based on the obtained dataset (3.2) we generate theweighted graph using the method described in 4. Thegraph consists of approximately 450k nodes and 1 millionarcs with mean weight of 0.07, and we use it to computethe PageRank, influence and passivity values for each node.The Influence-Passivity algorithm (Algorithm1) convergesto the final values in tens of iterations (Fig. 2).

    PageRank. The PageRank algorithm has been widelyused to rank web pages as well as people based on theirauthority and influence [13, 5]. In order to compare it withthe results from the IP algorithm, we compute PageRankon the weighted graph G= (N,E,W) with a small change.First, since the arcs e = (i, j) E indicate that user iexerts some influence on user j then we invert all the arcsbefore running PageRank on the graph while leaving the

    weights intact. In other words, we generate a new graphG = (N, E,W) where N = N, E ={(i, j) : (j, i) E},and for each (i, j) E we define w ij =wji. This generatesa new graph G analogous to G but where the influencedusers point to their influencers. Second, since the graphG is weighted we assume that when the the random surferof the PageRank algorithm is currently at the node i, she

    chooses to visit nodej next with probability wij

    k:(i,k)E

    wik

    .

    The Hirsch Index. The Hirsch index (or H-index) isused in the scientific community in order to measure theproductivity and impact of a scientist. A scientist has indexhif he has published harticles which have been cited at least

    htimes each. It has been shown that the H-index is a goodindicator of whether a scientist has had high achievementssuch as getting the Nobel prize [16]. Analogously, in Twitter,a user has index h ifh of his URL posts have been retweetedat least h times each.

    5.2 Influence as a correlate of attentionAny measure of influence is necessarily a subjective one.

    However, in this case, a good measure of influence shouldhave a high predictive power on how well the URLs men-tioned by the influential users attract attention and propa-

    0 1 0 2 0 3 0 4 0 5 0

    i t e r a t i o n

    1 0

    - 3

    1 0

    - 2

    1 0

    - 1

    1 0

    0

    1 0

    1

    c

    h

    a

    n

    g

    e

    i

    n

    v

    a

    l

    u

    e

    s

    Figure 2: IP-algorithm convergence. In each itera-tion we measure the sum of all the absolute changesof the computed influence and passivity values sincethe previous iteration

    gate in the social network. We would expect the URLs that

    highly influential users propagate to attract a lot of atten-tion and user clicks. Thus, a viable estimator of attention isthe number of times a URL has been accessed.

    Click data. Bit.ly is a URL shortening service that foreach shortened URL keeps track of how many times it hasbeen accessed. There are 3.2M unique Bit.ly URLs in thetweets from our dataset. We have queried the Bit.ly API forthe number of clicks the service has registered on each URL.

    A URL my be shortened by a user who has a Bit.ly ac-count. Each such shortening is a assigned a unique per-userBit.ly URL. To account for that we took the global clicksnumber returned by the API instead of theuser clicksnum-bers. The global clicks number sums the clicks across allthe Bit.ly shortenings of a given URL and across all theusers.

    URL traffic Prediction. Using the URL click data, wetake several different user attributes and test how well theycan predict the attention the URLs posted by the users re-ceive (Fig. 3). It is important to note that none of the influ-ence measures are capable of predicting the exact number ofclicks. The main reason for this is that the amount of atten-tion a URL gets is not only a function of the influence of theusers mentioning it, but also of many other factors includingthe virality of the URL itself and more importantly, whetherthe URL was mentioned anywhere outside of Twitter, whichis likely to be the biggest source of unpredictability in theclick data.

    The wide range of factors potentially affecting the Bit.lyclicks may prevent us from predicting their number accu-

    rately. However, the upper bound on that number can toa large degree be predicted. To eliminate the outlier cases,we examined how the 99.9th percentile of the clicks variedas the measure of influence increased.

    Number of followers. The most readily available andoften used by the Twitterers measure of influence is the num-ber of followers a user has. As the Figure 3(a) shows, thenumber of followers of an average poster of a given URL is arelatively weak predictor of the maximum number of clicksthat the URL can receive, with an R2 value of 0.59.

    Number of retweets. When users post URLs, their

  • 8/12/2019 influence twiter.pdf

    5/9

    1 0

    1

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    a v e r a g e # o f f o l l o w e r s o f t h e p o s t i n g u s e r s

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 5 9

    1

    1 0

    1 0 0

    1 0 0 0

    1 0 0 0 0

    #

    u

    r

    l

    s

    (a) Average number of followers vs. number of clickson URLs

    1 0

    3

    1 0

    2

    1 0

    1

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    a v e r a g e # t i m e s p o s t i n g u s e r s w e r e r e t w e e t e d

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 0 2

    1

    1 0

    1 0 0

    1 0 0 0

    1 0 0 0 0

    #

    u

    r

    l

    s

    (b) Average number of times users were retweeted vs.number of clicks on URLs

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    1 0

    2

    a v e r a g e P a g e R a n k o f t h e p o s t i n g u s e r s

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 8 4

    1

    1 0

    1 0 0

    1 0 0 0

    1 e 4

    1 e + 0 5

    #

    u

    r

    l

    s

    (c) Average user PageRank vs. number of clicks onURLs

    1 0

    0

    1 0

    1

    1 0

    2

    a v e r a g e H - i n d e x o f t h e p o s t i n g u s e r s

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 0 5

    1

    1 0

    1 0 0

    1 0 0 0

    1 e 4

    1 e + 0 5

    #

    u

    r

    l

    s

    (d) Average user H-index vs. number of clicks onURLs

    1 0

    9

    1 0

    8

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    1 0

    2

    a v e r a g e I P - i n f l u e n c e o f t h e p o s t i n g u s e r s

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 9 5

    1 0

    1 0 0

    1 0 0 0

    1 0 0 0 0

    #

    u

    r

    l

    s

    (e) Average user IP-influence vs. number of clicks on

    URLs, using the retweet graph as input

    1 0

    9

    1 0

    8

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    a v e r a g e I P - i n f l u e n c e o f t h e p o s t i n g u s e r s

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    1 0

    7

    #

    b

    i

    t

    .

    l

    y

    c

    l

    i

    c

    k

    s

    R

    2

    = 0 . 8 7

    1 0

    1 0 0

    1 0 0 0

    1 0 0 0 0

    #

    u

    r

    l

    s

    (f) Average user IP-influence vs. number of clicks on

    URLs, using the co-mention graph as input

    Figure 3: We consider several user attributes: the number of followers, the number of times a user has beenretweeted, the users PageRank, H-index and IP-influence. For each of the 3.2M Bit.ly URLs we computethe average value of a users attribute among all the users that mentioned that URL. This value becomes thex coordinate of the URL-point; the y coordinate is the number of clicks on the Bit.ly URL. The density ofthe URL-points is then plotted for each of the four user attributes. The solid line in each figure representsthe 99.9th percentile of Bit.ly clicks at a given attribute value. The dotted line is the linear regression fit forthe solid line with the fits R2 and slope displayed beside it.

  • 8/12/2019 influence twiter.pdf

    6/9

    posts might be retweeted by other users. Each retweet ex-plicitly credits the original poster of the URL (or the userfrom whom the retweeting user heard about the URL). Thenumber of times a user has been credited in a retweet hasbeen assumed to be a good measure of influence [7]. How-ever, Figure 3(b) shows that the number of times a user hasbeen retweeted in the past is an extremely poor predictorof the maximum number of clicks the URLs posted by thatuser can get.

    The Hirsch Index. Figure 3(d) shows that despite thefact that in the scientific community the H-index is used as agood predictor of scientific achievements, in Twitter, it hasvery low correlation with URL popularity (R2 of 0.05). Thismay reflect the fact that attention in the scientific commu-nity plays a symmetric role, since those who pay attentionto the work of others also seek it from the same commu-nity. Thus, citations play a strategic role in the successfulpublishing of papers, since the expectation of authors is thatreferees and authors will demand attention to their work andthose of their colleagues. Within Social Media such symme-try does not exist and thus the decision to forward a messageto the network lacks this particularly strategic value.

    PageRank. Figure 3(c) shows that the average PageR-

    ank of those who tweet a certain URL is a much betterpredictor of the URLs traffic than the average number offollowers, retweets, or Hirsch index. The reason for the im-provement could be explained by the fact that PageRanktakes into account structural properties of the graph as op-posed to individual measures of the users. However, figure3(c) also shows that IP influence is a better indicator ofURL popularity than PageRank. One of the main differ-ences between the IP algorithm and PageRank is that theIP algorithm takes into account the passivity of the people auser influences and PageRank does not. This suggests thatinfluencing users who are difficult to influence, as opposedto simply influencing many users, has a positive impact onthe eventual popularity of the message that a user tweets.

    IP-Influence score. As we can see in Figure 3(e), the

    average IP-influence of those who tweeted a certain URL candetermine the maximum number of clicks that a URL willget with good accuracy, achieving an R2 score of 0.95. Sincethe URL clicks are never considered by the IP algorithm tocompute the users influence, the fact that we find a veryclear connection between average IP-influence and the even-tual popularity of the URLs (measured by clicks) serves asan unbiased evaluation of the algorithm and demonstratesthe utility of IP-influence. For example, as we can see inFigure 3(e), given a group of users having very large aver-age IP-influence scores who post a URL we can estimate,with 99.9% certainty, that this URL will not receive morethan 100, 000 clicks. On the other hand, if a group of userswith very low average IP-influence score post the same URLwe can estimate, with 99.9% certainty that the URL will notreceive more than 100 clicks.

    Furthermore, figure 4 shows that a users IP-influence isnot well correlated with the number of followers she has.This reveals interesting implications about the relationshipbetween a persons popularity and the influence she has onother people. In particular, it shows that having many fol-lowers on Twitter does not directly imply the power to in-fluence them to click on a URL.

    In the above experiments, we have used the average num-ber of followers, retweets, PageRank, H-Index, and IP-influence

    1 0

    0

    1 0

    1

    1 0

    2

    1 0

    3

    1 0

    4

    1 0

    5

    1 0

    6

    # o f f o l l o w e r s

    1 0

    9

    1 0

    8

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    1 0

    2

    I

    P

    -

    i

    n

    f

    l

    u

    e

    n

    c

    e

    1 0

    1 0 0

    1 0 0 0

    #

    u

    s

    e

    r

    s

    Figure 4: For each user we place a user-point withIP-influence as they coordinate and thex coordinateset to the number of users followers. The density ofuser-points is represented in grayscale. The corre-lation between IP-influence and #followers is 0.44.

    1 0

    9

    1 0

    8

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    1 0

    2

    I P - i n f l u e n c e ( c o - m e n t i o n g r a p h )

    1 0

    9

    1 0

    8

    1 0

    7

    1 0

    6

    1 0

    5

    1 0

    4

    1 0

    3

    I

    P

    -

    i

    n

    f

    l

    u

    e

    n

    c

    e

    (

    r

    e

    t

    w

    e

    e

    t

    g

    r

    a

    p

    h

    )

    1 0

    1 0 0

    1 0 0 0

    #

    u

    s

    e

    r

    s

    Figure 5: The correlation between the IP-influencevalues computed based on two inputs: the co-mention influence graph and the retweet influencegraph. The correlation between the two influencevalues is 0.06.

    of the users who posted a URL to predict the URLs traf-fic. We examined other choices such as using the maximumnumber instead of the average, and obtained similar results.

    6. IP ALGORITHM ADAPTABILITYAs mentioned earlier (4) there are many ways of defining

    a social graph in which the edges indicate pairwise influ-

    ence. We have so far been using the graph based on whichuser retweeted which user (retweet influence graph). How-ever, the explicit signals of influence such as retweets arenot always available. One way of overcoming this obstacleis to use other, possibly weaker, signals of influence. In thecase of Twitter, we can define an influence graph based onmentions of URLs without regard of actual retweeting in thefollowing way.

    The co-mention graph. The nodes of the co-mentioninfluence graphare users who tweeted at least three URLs.The edge (i, j) exists if userj follows user i and j mentioned

  • 8/12/2019 influence twiter.pdf

    7/9

    at least one URL thati had previously mentioned. The edge

    e= (i, j) has weight we = SijFij+Sij

    where Fij is the number

    of URLs that i mentioned and j never did and S is thenumber of URLs mentioned by j and previously mentionedbyi.

    The resulting graph has the disadvantage that the edgesare based on a much less explicit notion of influence thanwhen based on retweets. Therefore the graph could have

    edges between users who do not influence each other. Onthe other hand, the retweeting conventions on Twitter arenot uniform and therefore sometimes users who repost aURL do not necessarily credit the correct source of the URLwith a retweet [15]. Hence, the influence graph based onretweets has potentially missing edges.

    Since the IP algorithm has the flexibility of allowing anyinfluence graph as input, we can compute the influence scoresof the users based on the co-mention influence graph andcompare with the results obtained from the retweet influ-ence graph. As we can see in Figure 3(f), we find that theretweet graph yields influence scores that are better at pre-dicting the maximum number of clicks a URL will obtainthan the co-mention influence graph. Nevertheless, Figure3(f) shows that the influence values obtained from the co-

    mention influence graph are still better at predicting URLtraffic than other measures such as PageRank, number offollowers, H-index or the total number of times a user hasbeen retweeted. Furthermore, Figure 5 shows that the influ-ence score based on both graphs do not correlate well, whichsuggests that considering explicit vs. implicit signals of in-fluence can change the outcome of the IP algorithm, whileat the same time maintaining its predictive value. In gen-eral, we find that the explicitness of the signal provided bythe retweets yields slightly better results when it comes topredicting URL traffic, however, the influence scores basedon co-mentions may surface a different set of p otentially in-fluential users.

    7. CASE STUDIESAs we mentioned earlier, one important application of the

    IP algorithm is ranking users by their relative influence. Inthis section, we present a series of rankings of Twitter usersbased on the influence, passivity, and number of followers.

    The most influential. Table 1 shows the users with themost IP-influence in the network. We constrain the numberof URLs posted to 10 to obtain this list, which is dominatedby news services from politics, technology, and Social Media.These users post many links which are forwarded by otherusers, causing their influence to be high.

    The most passive. Table 2 shows the users with themost IP-passivity in the network. Passive users are thosewho follow many people, but retweet a very small percent-

    age of the information they consume. Interestingly, robotaccounts (which automatically aggregate keywords or spe-cific content from any user on the network), suspended ac-counts (which are likely to be spammers), and users whopost extremely often are among the users with the mostIP-passivity. Since robots attend to all existing tweetsand only retweet certain ones, the percentage of informa-tion they forward from other users is actually very small.This explains why the IP-algorithm assigns them such highpassivity scores. This also highlights a new application ofthe IP-algorithm: automatic identification of robot users in-

    mashable Social Media Bloggerjokoanwar Film Directorgoogle Googleaplusk Actor Ashton Kutchersyfy Science Fiction Channelsmashingmag Online Developer Magazinemichellemalkin Conservative Commentatortheonion News Satire Organizationrww Tech/Social Media Blogger

    breakingnews News Aggregator

    Table 1: Users with the most IP-influence (with atleast 10 URLs posted in the period)

    cluding aggregators and spammers.

    redscarebot Keyword Aggregatordrunk bot Suspendedtea robot Keyword Aggregatorcondos Listing Aggregatorwootboot Suspendedraybeckerman Attorney

    hashphotography Keyword Aggregatorcharlieandsandy Suspendedms defy Suspendedrpattinsonbot Keyword Aggregator

    Table 2: Users with the most IP-passivity

    The least influential with many followers. We havedemonstrated that the amount of attention a person getsmay not be a good indicator of the influence they have inspreading their message. In order to make this point moreexplicit, we show, in Table 3, some examples of users whoare followed by many people but have relatively low influ-ence. These users are very popular and have the attentionof millions of people but are not able to spread their mes-sage very far. In most cases, their messages are consumedby their followers but not considered important enough toforward to others.

    The most influential with few followers. We arealso able identify users with very low number of followersbut high influence. Table 4 shows the users with the mostinfluence who rank less that 100, 000thin number of follow-ers. We find that during the data collection period someof the users in this category ran very successful retweetingcontests where users who retweeted their URLs would havethe chance of winning a prize. Moreover, there is a group ofusers who post from twitdraw.com, a website where peoplecan make drawings and post them on Twitter. Even though

    these users dont have many followers, their drawings are ofvery high quality and spread throughout Twitter reachingmany people. Other interesting users such as local politi-cians and political cartoonists are also found in the list. TheIP-influence measure surfaces interesting content posted byusers who would otherwise be buried by popularity rankingssuch as number of followers.

    The most influential news sources. We are also ableto choose users belonging to a particular topic and identifythe most influential ones. We present the list for the top20 most influential news sources over the period from June

  • 8/12/2019 influence twiter.pdf

    8/9

    User name Category Rank by # followers Rank by IP-influencethatkevinsmith Screen Writer 33 1000nprpolitics Political News 41 525eonline TV Channel 42 1008marthastewart Television Host 43 1169nba Sports 64 1041davidgregory Journalist 106 3630nfl Sports 110 2244cbsnews News Channel 114 2278

    jdickerson Journalist 147 4408newsweek News Magazine 148 756

    Table 3: Users with many followers and low relative influence

    User name Category Rank by # followers Rank by IP-influencecashcycle Retweet Contest 153286 13mobiliens Retweet Contest 293455 70jadermattos Twitdraw 227934 134jaum Twitdraw 404385 143robmillerusmc Congressional Candidate 147803 145sitekulite Twitdraw 423917 149jesse sublett Musician 385265 151cyberaurora Tech News Website 446207 163viveraxo Twitdraw 458279 165fireflower Political Cartoons 452832 195

    Table 4: Users with very few followers but high relative influence

    15th through July 22nd 2010 in Table 5. We find that on-line news sources such as Mashable and The Onion havehigh influence among users. And this ranking, once againillustrates the fact that one does not need to have a largenumber of followers to be influential. The Big Picture, aphoto blog for the Boston Globe has only 23,666 followers,but it is very influential, possibly due to the high qualitypictures that are on display.

    8. DISCUSSIONInfluence as predictor of attention. As we demon-

    strated in 5, the IP-influence of the users is an accuratepredictor of the upper bound on the total number of clicksthey can get on the URLs they post. The input to the influ-ence algorithm is a weighted graph, where the arc weightsrepresent the influence of one user over another. This graphcan be derived from the user activity in many ways, evenin cases where explicit feedback in the form of retweets orlikes is not available (6).Topic-based and group-based influence. The Influence-Passivity algorithm can be run on a subpgraph of the full

    graph or on the subset of the user activity data. For ex-ample, if only users tweeting about a certain topic are partof the graph, the IP-influence determines the most influen-tial users in that topic. It is an open question whether theIP algorithm would be equally accurate at different graphscales.Content ranking. The predictive power of IP-influencecan be used for content filtering and ranking in order toreveal content that is most likely to receive attention basedon which users mentioned that content early on. Similarly,as in the case of users, this can be computed on a per-topic

    or per-user-group basis.Content filtering. We have observed from our passivityexperiments that highly passive users tend to be primarilyrobots or spammers. This leads to an interesting extensionof this work to perform content filtering, limiting the tweetsto influential users and thereby reducing spam in Twitterfeeds.Influence dynamics. We have computed the influencemeasures over a fixed 300-hour window. However, the So-cial Media are a rapidly changing, real-time communicationplatform. There are several implications of this. First, theIP algorithm would need to be modified to take into ac-count the tweet timestamps. Second, the IP-influence it-self changes over time, which brings a number of interestingquestions about the dynamics of influence and attention.In particular, whether users with spikes of IP-influence areoverall more influential than users who can sustain their IP-influence over time is an open question.

    9. CONCLUSIONGiven the mushrooming popularity of Social Media, vast

    efforts are devoted by individuals, governments and enter-prises to getting attention to their ideas, policies, products,and commentary through social networks. But the verylarge scale of the networks underlying Social Media makesit hard for any of these topics to get enough attention inorder to rise to the most trending ones. Given this con-straint, there has been a natural shift on the part of thecontent generators towards targeting those individuals thatare perceived as influential because of their large numberof followers. This study shows that the correlation betweenpopularity and influence is weaker than it might be expected.

  • 8/12/2019 influence twiter.pdf

    9/9

    User name Category Numb er of followers Rank by IP-influencemashable Pete Cashmore 2037840 59cnnbrk CNN Breaking News 3224475 71big picture The Big Picture 23666 92theonion The Onion 2289939 116time TIME.com 2111832 143breakingnews Breaking News 1795976 147bbcbreaking BBC Breaking News 509756 168espn ESPN 572577 187

    harvardbiz Harvard Business Rev 219039 227gizmodo Gizmodo 111025 237techcrunch TechCrunch 1402254 319wired Wired 547187 322wsj Wall Street Journal 366133 358smashingmag Smashing Magazine 224333 360pitchforkmedia Pitchfork 1494896 384rollingstone Rolling Stone 133999 436whitehouse The White House 1794544 448cnn CNN 1196719 473tweetmeme TweetMeme 52386 515peoplemag People magazine 2099081 565

    Table 5: Top 20 Influential News Media Sources

    This is a reflection of the fact that for information to prop-agate in a network, individuals need to forward it to theother members, thus having to actively engage rather thanpassively read it and rarely act on it. Moreover, since ourmeasure of influence is not specific to Twitter it is applicableto many other social networks. This opens the possibilityof discovering influential individuals within a network whichcan on average have a further reach than others in the samemedium, regardless of their popularity.

    10. REFERENCES

    [1] Jure Leskovec, Lada A. Adamic and Bernardo A.Huberman. The dynamics of viral marketing. InProceedings of the 7th ACM Conference on ElectronicCommerce, 2006.

    [2] Bernardo A. Huberman, Daniel M. Romero, and FangWu. Social networks that matter: Twitter under themicroscope. First Monday, 14(1), Jan 2009.

    [3] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury.Twitter power: Tweets as electronic word of mouth.Journal of the American Society for InformationScience and Technology, 2009.

    [4] Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty,Zoran Despotovic, Wolfgang Kellerer Outtweeting theTwitterers - Predicting Information Cascades in

    Microblogs 3rd Workshop on Online Social Networks,WOSN, 2010)

    [5] Jianshu Weng and Ee-Peng Lim and Jing Jiang and QiHe. TwitterRank: finding topic-sensitive influentialtwitterers.WSDM, 2010.

    [6] Fang Wu, Bernardo A. Huberman, Lada Adamic andJosh Tyler. Information Flow in Social Groups.PhysicaA, Vol 337, 327-335, 2004.

    [7] Meeyoung Cha and Hamed Haddadi and FabricioBenevenuto and Krishna P. Gummadi. Measuring UserInfluence in Twitter: The Million Follower Fallacy. 4th

    International AAAI Conference on Weblogs and SocialMedia (ICWSM), 2010.

    [8] Nitin Agarwal and Huan Liu and Lei Tang and PhilipS. Yu. Identifying the Influential Bloggers in aCommunity. WSDM, 2008.

    [9] Sinan Aral, Lev Muchnik and Arun Sundararajan.Distinguishing influence-based contagion fromhomophily-driven diffusion in dynamic networks.Proceedings of the National Academy of Sciences, Vol.106 (51), pp. 21544-21549, 2009.

    [10] Duncan J. Watts and Peter Sheridan Dodds.Influentials, Networks, and Public Opinion Formation.

    Journal of Consumer Research, Vol. 34 (4), pp.441-458, 2007.

    [11] Amit Goyal, Francesco Bonchi and Laks V.S.Lakshmanan. Learning Influence Probabilities In SocialNetworks.WSDM, 2010.

    [12] P. Domingos and M. Richardson. Mining the networkvalue of customers. SIGKDD, 2001.

    [13] S. Brin and L. Page. The anatomy of a large-scalehypertextual Web search engine.Computer Networksand ISDN Systems, Vol 30, 1-7, 1998.

    [14] Jon Kleinberg. Authoritative sources in a hyperlinkedenvironment. Journal of the ACM46 (5): 604 -632,1999.

    [15] Boyd Danah, Scott Golder, and Gilad Lotan. Tweet,

    Tweet, Retweet: Conversational Aspects of Retweetingon Twitter. HICSS-43. IEEE 2010.

    [16] Jorge E. Hirsch. An index to quantify an individualsscientific research output. Proceedings of the NationalAcademy of Sciences 102(46): 16569 -16572, 2005.