graph based user interest modeling in twitter

Graph based User Interest Modeling in TwitterCS 224W : Final Project , Group 41

Aparna [email protected]

Raghav [email protected]

1 Motivation

Interest graph is a comparatively recent phe-nomenon in social media, building on the linesof Knowledge Graph1 and Social Graph2. Sim-ply put, an interest graph is a representation ofrelationship between people and things that theyare interested in. Companies like Netflix andFacebook are already leveraging Interest graph topower their recommendations. Additionally, theseinterest graphs fuel behavioral profiling based oninterests for advertisement targeting and audienceanalytics. One social network that inherentlypresents itself as an interest graph is Twitter. Inthis project, we explore a graph based approach touser-interest modeling in Twitter.

2 Related Work

In the light of the current push towards develop-ment of an interest graph, identification and char-acterization of user interest in social media is veryrelevant. Most works focusing on identifying userinterest on Twitter use text analysis techniques likeBag of Words, TF-IDF or Latent Dirichlet Allo-cation [Michelson and Macskassy (2010)]. Thispaper proposes a method to discover twitter userstopics of interests based on the entities in theirtweets. Another common method used to extractuser interests is to use a knowledge corpus as in[Lim and Datta (2013)] where the authors inferthe users topics of interest from the domains of thecelebrities followed by the user. Taking cue fromthese works, we hypothesize that since Twitter asa social medium is primarily based on who we fol-low, the properties of network structure of a userare potentially indicative of his/her interest profile.

1www.google.com/insidesearch/features/search/knowledge.html2www.facebook.com/about/graphsearch

3 Problem Definition

The high level goal of the project is to study thefollowing two questions:

1. How well can we gauge the interest profile ofa user based on his network structure?

2. How predictive are different graph based fea-tures in identifying the interests of a user?

The intuition behind this is the following: a userinterested in baseball will follow his favorite team,say @SFGiants, while one interested in technol-ogy will follow @TechCrunch. Thus, the networkof a user represents his/her interest. Additionally,the more a user (say, a huge fan of Star Trek) isinterested in a topic, higher will be his interactionwith @StarTrekMovie.This, the primary goal of this project is to modelthe interest profile of a Twitter user using thenetwork structure of the Twitter follower-friendgraph. Concretely, we explore various graph fea-tures to capture the level of influence exerted bythe followers and friends, and use this to predictthe interests of the user. Additionally, we use amachine learning approach to build a supervisedmodel to predict the interest profile of a user.

4 Data

4.1 Data CollectionWe obtained Twitter social graph data from the in-ternet3. The dataset contains 41.7 million nodesand 1.47 billion edges. A dataset of tweets wasobtained from SNAP datasets. This data contains476 million tweets that includes about 50 millionhashtags and 180 million URL entries4. Since thesocial graph dataset references users by a numericuser-id and the tweets datasets does it through theusers screen name, a user-id to screen name mapobtained from 3 was used to link the two datasets.

3http://an.kaist.ac.kr/traces/WWW2010.html4http://snap.stanford.edu/data/bigdata/twitter7/

4.2 Data pre-processingSince, we use data from different sources, a con-siderable amount of pre-processing was requiredto get the data to a usable form. We use a Mon-goDB to store all data required for the project. Westart with the tweets dataset and use the user-id :name map to insert the data into a db collectionwith the following schema - user id, name, [listof tweets] . From this, we subset users who havea minimum of 50 tweets. Then, we use the Twit-ter graph data to obtain the list of followers andfollowees for these users and form our dataset.

5 Mathematical Model

In this section, we describe the mathematical for-mulations of our project and define the terms thatare used in the rest of the report.

5.1 Identifying relevant subgraphsAs discussed in the data description, the Twittergraph we deal with contains 41.7 million nodesand 1.47 billion edges. Such a large graph canclearly not be dealt with in its entirety due to thememory and processing limits. The natural alter-native is to consider only neighborhoods that re-late to the node in focus. In fact, this fits our billperfectly as only nodes that are closest to the nodein focus, can possibly influence its interest profile.Let Ga denote the subgraph for a node a , contain-ing all nodes and edges that influence a. We usedtwo methods of arriving at the subgraph, denotedG1 and G2,

• G1(a) = {Φ(a) ∪Ψ(a)}

• G2(a) = {G1(a) ∪Ψ(G1(a)) ∪ Φ(G1(a))}

where

• Φ(a) denotes the set of followers of a

• Ψ(a) denotes the set of friends (those whofollow a) of a.

The subgraph G1 is useful in quantifying theeffects of as immediate neighbors on its interestprofile, while G2 is useful in accounting for thecascading effect. For example, the influence thennodes in G1 can have on a can be modeled as afunction of its neighbors using the subgraph G2.These methods of subgraph capture our require-ments well and hence, other methods were not ex-plored.

5.2 Interest Profile

The interest profile of a user I(a) is representedas a probability distribution of his interest over theset of categories. Let K be the total number of cat-egories, then I(a) is a K dimensional vector, [pk],where pk denote the probability that a user is in-terested in category k relative to other categories.Here, k = 1, 2, .,K

5.3 Interest Influence Model Framework

We hypothesize that the interest profile of auser can be derived from the interest profilesof the followers and that of their friends, usingappropriate weighting schemes. Particularly, wepropose two models for obtaining a user’s interestprofile:

IΦw (a) = 1∑

b∈Φ(a)

w(a,b)

∑b∈Φ(a)

w(a, b)I(b)

IΨw (a) = 1∑

b∈Ψ(a)

w(a,b)

∑b∈Ψ(a)

w(a, b)I(b)

where,

• IΦw (a) denotes the predicted interest profile

for user a based on Φ(a) (the set of followersof a) using the weighting scheme w,

• IΨw (a) denotes the predicted interest profile

for user a based on Ψ(a) (the set of friendsof a) using the weighting scheme w and

• w(a, b) is the weighting factor that capturesthe influence node b exerts on node a.

This represents the general framework, whereinwe use different weighting schemes and obtaindifferent predictions for the interest profile of user.The various weighting schemes implemented arediscussed in Section 6.2

5.4 Composite Feature Model

The weighting schemes w(a, b) are so defined (ex-plained in 6.2) to capture a particular aspect of therelationship between a and b. Thus, a number ofinterest profile predictions can be obtained by us-ing different weighting schemes. Since, each ofthese predictions uses a single aspect of the rela-tionship, we propose a composite feature modelthat takes into account all aspects by combiningthe different predictions in an optimal proportion.

To design such a composite feature model, werequire a way of effectively combining the results

from the different weighting schemes. All weight-ing schemes are not expected to (and from ourresults, do not) reflect the interest profile equallywell. Moreover, different weighting schemes workwell for different categories. Thus, we assigna category-wise quality score to each weightingscheme. This quality score is based on how wellthe weighting scheme identifies the true interestlevel of the user for that category.

To obtain these quality scores, we formulate acategory-wise regression problem. For the regres-sion problem, the independent variables (features/ x values) are the probability predictions obtainedunder different weighting schemes for that cate-gory and the dependent variable (predicted vari-able / y value) is the true interest probability forthe user for that category. Solving this regres-sion (curve fitting) problem, gives the coefficientsfor different features, which represent the qualityscore for each weighting scheme.

Using these quality scores, we obtain the in-terest prediction from the composite model asfollows-

Ic(a) = [ick(a)] (1)

ick(a) is the composite predicted interest level forcategory k, given by

ick(a) = 1∑w∈W

Qkw

∑w∈W

Qkwikw

where

• W - set of all weighting schemes

• Qkw is the quality score for weightingscheme w for category k

• ikw is the interest level for category kpredicted using the weighting scheme w

Ic(a) is normalized to obtain a probability distri-bution.

5.5 Machine Learning ModelIn addition to the graph based interest influ-ence approach described above, we use a ma-chine learning based predictive model to predictthe interest of a user using graph features. Weuse a multi-label, multi-class supervised learningmodel, where each user is represented by a vectorof features obtained from his graph and the vari-able to predict is the list of all interest categoriesfor the user.

6 Experimental Details

6.1 Obtaining interest profiles

The primary entity of the study is the interest pro-file of a user. Since the focus of our project isto conduct a graph based interest modeling, weadopt a simple approach to obtain the interest pro-file and do not explore other options. We use aset of 5 categories - Sports, Entertainment, Food,Health and Technology and map a user’s tweetsto a probability distribution over these topics. Weuse a word list of topics and increment the countfor each topic if a word of that topic appears inthe user’s tweet. The distribution thus obtained isnormalized to get the probability distribution overcategories for that user, which represents the inter-est profile I(a) for user a.

For example, a tweet - “I love the new iPhone ”contains the word “iPhone ” which is a part of thewords list for the topic “Technology ” and hencewill output a distribution [0,1,0,0,0]. We obtainthese topic-wise word lists by merging the listsfrom the following sources5 6.

6.2 Feature Engineering

As described in 5.2, the predicted interest profileof a user is obtained using a weighted average ofthe interest profiles of the user’s followers/friends.Also, as previously described, we implement a va-riety of weighting schemes to incorporate differ-ent aspects (features) of the relationship betweenthe user and his network.

The features used are listed below. For each fea-ture, we reason the use of the feature and give theweighting scheme w used.

6.2.1 Baseline WeightingAll weights w(a, b) are defined to be 1. Thus, thebaseline method does a simple average of all theinterest profiles.

6.2.2 Retweet WeightingA user highly interested in (say) Technologywould not only follow a user (say @TechCrunch)but is also more likely to interact more with thatuser. Interactions in Twitter can be captured interms of ReTweets and Mentions. Thus, we usea Retweet weighting scheme wherew(a, b) = number of times the user a has

retweeted a tweet of user b.5http:www.ourcommunity.com.au/tech/tech article.jsp?articleId=746http://www.enchantedlearning.com/wordlist/

Since this weighting scheme can potentiallyhave zeros for a lot of users, we use a smoothenedmeasure,w(a, b) = 1 + number of times the user a has

retweeted a tweet of user b

6.2.3 Mentions WeightingSimilar to the ReTweet weighting scheme, we usethe mentions to obtain Mentions Weighting, where

w(a, b) = number of times the user a has men-tioned b in his tweet.

The smoothened measure isw(a, b) = 1 + number of times the user a has

mentioned b in his tweet.

6.2.4 Mutual Followers WeightingAnother edge based feature is the number of nodesthe two nodes are mutually connected to. Twittergraph, being a directed graph, we use two suchmeasures based on in-edges and out-edges.w(a, b) = |Φ(a) ∩ Φ(b)|The smoothened measure is w(a, b) = 1 +|Φ(a) ∩ Φ(b)|

6.2.5 Mutual Friends WeightingThis is similar to mutual followers weighting-w(a, b) = |Ψ(a) ∩Ψ(b)|The smoothened measure is w(a, b) =

1 + |Ψ(a) ∩Ψ(b)|

All features described above are edge-basedmeasures. Next ,we list all the node based featureswe develop. The intuition behind these featuresis that different nodes have different influencingcapabilities and thus, the properties of the relevantnodes for a user influence the interest profile of theuser. We implement three such features as givenbelow.

6.2.6 Follower biasThis measures the importance or popularity of thepersonw(a, b) = |Φ(b)|

6.2.7 Friend biasThis measures the relevance of the person.

w(a, b) = |Ψ(b)|

6.2.8 Tweets biasThis measures the level of activity of the person

w(a, b) = Number of tweets of user b

We use all the weighting schemes describedabove to obtain both IΦ

w (a) and IΨw (a).

6.3 Deriving topics of interest from interestprofile

Once we obtain the interest profile using thesemethods, we derived the topics of interest usingtwo methods:

• Threshold: Select all topics whose probabil-ity (component in the interest profile) is morethan a threshold value. We used a thresholdof 1

K , where K is the number of categories.Topics = {k|Ik(a) > 1/K}

• ChooseTopM: Select M topics with the high-est probability values. In our experiments, weused a value of M = 2, since we deal withonly 5 categories.

6.4 Error metrics

To quantify the performance of our models, weused three metrics - Precision, Recall and F1scores for the true labels and the predicted labelsobtained.

6.5 Composite Feature Model

As described in 5.4, we use a regression model toobtain the category-wise quality scores for eachweighting scheme. In particular, we used ridgeregression where the loss function is the linearleast squares function and regularization is givenby L2-norm. We identified the regularization co-efficient using leave-one-out cross validation. Theregression model was then used to arrive at thequality scores. These quality scores where thenused in conjunction with the interest profiles ob-tained from different weighting schemes to arriveat the predicted interest profile using the compos-ite model.

6.6 Machine Learning Model

For the machine learning based prediction taskexplained in 5.5, we used a Random Forestbased classifier and validate the performance usingleave-one-out cross validation. The error metricused is the F1 score. Additionally, we used a fea-ture selection algorithm based on the feature im-portances calculated by the Random Forest classi-fier to identify the optimal subset of features. Theoptimal set of features were then used for the pre-diction task.

7 Results and Analysis

Figures 1, 2 and 3 show the results (Precision, Re-call and F1) obtained from the interest influencemodel for different feature weighting schemes.All results presented here are obtained by averag-ing the results for 100 users. We observe that thegeneral trend is that the values obtained using theChooseTopM method are higher than that obtainedusing the Threshold method. This is attributed tothe fact that the ChooseTopM method is agnos-tic to the probability values and chooses just thetopics corresponding to the top M values. Since,we use only a small number of categories in ouranalysis, it is quite easy for the model to performbetter under ChooseTopM method. Contrast thiswith the Threshold method, which chooses onlyvalues that are higher than a threshold. Hence,this method is better indicative of the actual per-formance of our approaches, given that we use asimple approach based on tweets to identify theinterest profiles. Going forward, we use only theThreshold method for analysis.

In the Threshold method, we observe that thefollower network gives slightly higher perfor-mance in terms of all of precision, recall andF1 than the friend network. While all weightingschemes give comparable performance, only a fewweighting schemes give better results. This is par-tially due to the fact that we use a limited data anda basic approach to topic identification (in termsof text matching) and primarily due to the fact thateach feature captures only one aspect of the inter-est influence.

This leads us on to the composite model, wherewe combine the features as explained previously.The results obtained using the composite modelusing both the friend network and follower net-work are shown in Table 1. The composite modelgives a superior performance compared to the basemodel in terms of F1 score for both the friend andthe follower network.

Table 2 shows the results obtained using the ma-chine learning approach. The F1 score obtainedhere is significantly higher than the interest influ-ence model and the composite model. This indi-cates that a supervised model that learns from thegraph based features leverages the training effec-tively and gives a higher performance. Thus, usingthe graph features in conjunction with a machinelearning approach is lucrative. Moreover, it is im-portant to note that the dataset used consisted of

100 users and hence, using a larger dataset shallcertainly be expected to increase the performance.

8 Discussion

The results obtained using the graph based ap-proaches were encouragingly positive. We dugdeep into the results to understand how the per-formance varies for different users. In particular,we wanted to identify if some graph property ofthe user was indicative of the performance. Ouranalysis suggests that clustering coefficient of auser is indeed representative of the efficiency ofthe graph based model to predict his/her interestprofile. Users with high clustering coefficient ob-tained high F1 scores, while those with lower clus-tering coefficient values, got low F1 scores. Somerepresentative values are shown in Table 3

9 Conclusion and Future Work

In this project, we have explored various graphfeatures for the task of predicting the interest pro-files of a user based on his/her friends and fol-lowers. The techniques devised are promising andcan definitely be extended to improve the perfor-mance. We focused only on the network basedfeatures in this study and hence used simple pro-cedures for other aspects. For example, we haveused the tweets of the followers/friends of theuser to extract the interest profiles of the follow-ers/friends. We could extend this to include thebiography of the followers/friends obtained fromtheir Twitter user profile. We have used a naivemethod to extract the probability distribution ofthe tweets over the different categories for a userwhere we directly compare the words in the tweetsof the user to the wordlists for each category.This can be improved to check for words withedit-distances of 1 to take into account commonspelling errors. This becomes important particu-larly in Twitter where the language used in infor-mal and error-prone. Moreover, we use a set of5 pre-defined categories. This can be increased toinclude more categories.

References

Kwan Hui Lim and Amitava Datta. 2013. Inter-est classification of twitter users using wikipedia.In Proceedings of the 9th International Symposiumon Open Collaboration, WikiSym ’13, pages 22:1–22:2, New York, NY, USA. ACM.

Figure 1: Interest Influence Model : Precision

Figure 2: Interest Influence Model : Recall

Figure 3: Interest Influence Model : F1

Precision Recall F1Baseline - Friend network 0.5325 0.6128 0.5439Composite - Friend network 0.5966 0.7174 0.6174Baseline - Follower network 0.5528 0.6992 0.5882Composite - Follower network 0.6498 0.6908 0.6300

Table 1: Composite Model : Results

Precision Recall F10.8079 0.7232 0.7277

Table 2: Classifier Model: Results

Clustering coefficient F10.024 00.08 0.50.41 0.51.56 0.6667

Table 3: Clustering coefficient values and F1 scores for sample users

Matthew Michelson and Sofus A. Macskassy. 2010.Discovering users’ topics of interest on twitter: Afirst look. In Proceedings of the Fourth Workshop onAnalytics for Noisy Unstructured Text Data, AND’10, pages 73–80, New York, NY, USA. ACM.