human mobility (with mobile devices)

www.bgoncalves.com@bgoncalves

Bruno Gonçalves www.bgoncalves.com

Human Mobility (with mobile devices)


Mobility



GPS-enabled Smartphones


www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves


Social Media


Coding


Anatomy of a Tweethttps://github.com/bmtgoncalves/Mining-Georeferenced-Data


Anatomy of a Tweet

[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata']

https://github.com/bmtgoncalves/Mining-Georeferenced-Data


Anatomy of a Tweet

[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata']

[u'follow_request_sent', u'profile_use_background_image', u'default_profile_image', u'id', u'profile_background_image_url_https', u'verified', u'profile_text_color', u'profile_image_url_https', u'profile_sidebar_fill_color', u'entities', u'followers_count', u'profile_sidebar_border_color', u'id_str', u'profile_background_color', u'listed_count', u'is_translation_enabled', u'utc_offset', u'statuses_count', u'description', u'friends_count', u'location', u'profile_link_color', u'profile_image_url', u'following', u'geo_enabled', u'profile_banner_url', u'profile_background_image_url', u'screen_name', u'lang',

u'profile_background_tile', u'favourites_count', u'name', u'notifications', u'url', u'created_at', u'contributors_enabled', u'time_zone', u'protected', u'default_profile', u'is_translator']



Anatomy of a Tweet[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata']

[u'type', u'coordinates']

[u'symbols', u'user_mentions', u'hashtags', u'urls']

u'<a href="http://foursquare.com" rel=“nofollow"> foursquare</a>'

u"I'm at Terminal Rodovi\xe1rio de Feira de Santana (Feira de Santana, BA) http://t.co/WirvdHwYMq"

{u'display_url': u'4sq.com/1k5MeYF', u'expanded_url': u'http://4sq.com/1k5MeYF', u'indices': [70, 92], u'url': u'http://t.co/WirvdHwYMq'}


http://t.co/WirvdHwYMq

http://t.co/WirvdHwYMq'


API Basics https://dev.twitter.com/docs

• The twitter module provides the oauth interface. We just need to provide the right credentials.

• Best to keep the credentials in a dict and parametrize our calls with the dict key. This way we can switch between different accounts easily.

• .Twitter(auth) takes an OAuth instance as argument and returns a Twitter object that we can use to interact with the API

• Twitter methods mimic API structure

• 4 basic types of objects:

• Tweets

• Users

• Entities

• Places



User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline

• .statuses.user_timeline() returns a set of tweets posted by a single user

• Important options:

• include_rts=‘true’ to Include retweets by this user

• count=200 number of tweets to return in each call

• trim_user=‘true’ to not include the user information (save bandwidth and processing time)

• max_id=1234 to include only tweets with an id lower than 1234

• Returns at most 200 tweets in each call. Can get all of a users tweets (up to 3200) with multiple calls using max_id

@bgoncalves

User Timeline https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline

import twitter from twitter_accounts import accounts

app = accounts["social"]

auth = twitter.oauth.OAuth(app["token"], app["token_secret"], app["api_key"], app["api_secret"])

twitter_api = twitter.Twitter(auth=auth) screen_name = "bgoncalves"

args = { "count" : 200, "trim_user": "true", "include_rts": "true" }

tweets = twitter_api.statuses.user_timeline(screen_name = screen_name, **args) tweets_new = tweets

while len(tweets_new) > 0: max_id = tweets[-1]["id"] - 1 tweets_new = twitter_api.statuses.user_timeline(screen_name = screen_name, max_id=max_id, **args) tweets += tweets_new

print "Found", len(tweets), "tweets"

timeline_twitter.py


Streaming Geocoded data• The Streaming api provides realtime data, subject to filters

• Use TwitterStream instead of Twitter object (.TwitterStream(auth=twitter_api.auth))

• .status.filter(track=q) will return tweets that match the query q in real time

• Returns generator that you can iterate over

• .status.filter(locations=bb) will return tweets that occur within the bounding box bb in real time

• bb is a comma separated pair of lat/lon coordinates.• -180,-90,180,90 - World• -74,40,-73,41 - NYC



Streaming Geocoded dataimport twitter from twitter_accounts import accounts import gzip

app = accounts["social"]

auth = twitter.oauth.OAuth(app["token"], app["token_secret"], app["api_key"], app["api_secret"])

stream_api = twitter.TwitterStream(auth=auth)

query = "-74,40,-73,41" # NYC

stream_results = stream_api.statuses.filter(locations = query)

tweet_count = 0 fp = gzip.open("NYC.json.gz", "a")

for tweet in stream_results: try: tweet_count += 1 print tweet_count, tweet[“id”]

print >> fp, tweet except: pass

location_twitter.py



GPS Coordinates


World Population


Biases


Market Penetration PLoS One 8, E61981 (2013)


Age DistributionPLoS One 10, e0115545 (2015)


Age Distribution


Demographics ICWSM’11, 375 (2011)

(a) Normal representation (b) Area cartogram representation

Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.Blue colors indicate underrepresentation, while red colors represent overrepresentation. The intensity of the color correspondsto the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west andoverrepresentation of populous counties.

less than 95% predictive (e.g., the name Avery was observedto correspond to male babies only 56.8% of the time; it wastherefore removed). The result is a list of 5,836 names thatwe use to infer gender.

Limitations Clearly, this approach to detecting gender issubject to a number of potential limitations. First, users maymisrepresent their name, leading to an incorrect gender in-ference. Second, there may be differences in choosing to re-veal one’s name between genders, leading us to believe thatfewer users of one gender are present. Third, the name listsabove may cover different fractions of the male and femalepopulations.

Gender of Twitter usersWe first determine the number of the 3,279,425 U.S.-basedusers who we could infer a gender for, based on their nameand the list previously described. We do so by comparingthe first word of their self-reported name to the gender list.We observe that there exists a match for 64.2% of the users.Moreover, we find a strong bias towards male users: Fully71.8% of the the users who we find a name match for had amale name.

0

0.2

0.4

0.6

0.8

1

2007-01 2007-07 2008-01 2008-07 2009-01 2009-07Frac

tion

of J

oini

ng U

sers

who

are

Mal

e

Date

Figure 3: Gender of joining users over time, binned intogroups of 10,000 joining users (note that the join rate in-creases substantially). The bias towards male users is ob-served to be decreasing over time.

To further explore this trend, we examine the historic gen-der bias. To do so, we use the join date of each user (avail-able in the user’s profile). Figure 3 plots the average fractionof joining users who are male over time. From this plot, itis clear that while the male gender bias was significantlystronger among the early Twitter adopters, the bias is be-coming reduced over time.

Race/ethnicityDetecting race/ethnicity using last namesAgain, since we have very limited information availableon each Twitter user, we resort to inferring race/ethnicityusing self-reported last name. We examine the last nameof users, and correlate the last name with data from theU.S. 2000 Census (U.S. Census 2000). In more detail, foreach last name with over 100 individuals in the U.S. dur-ing the 2000 Census, the Census releases the distribution ofrace/ethnicity for that last name. For example, the last name“Myers” was observed to correspond to Caucasians 86% ofthe time, African-Americans 9.7%, Asians 0.4%, and His-panics 1.4%.

Race/ethnicity distribution of Twitter usersWe first determined the number of U.S.-based users forwhom we could infer the race/ethnicity by comparing thelast word of their self-reported name to the U.S. Censuslast name list. We observed that we found a match for71.8% of the users. We the determined the distribution ofrace/ethnicity in each county by taking the race/ethnicitydistribution in the Census list, weighted by the frequencyof each name occurring in Twitter users in that county.1Due to the large amount of ambiguity in the last name-to-race/ethnicity list (in particular, the last name list is morethan 95% predictive for only 18.5% of the users), we are un-able to directly compare the Twitter race/ethnicity distribu-

1This is effectively the census.model approach discussed inprior work (Chang et al. 2010).

(a) Normal representation (b) Area cartogram representation

Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.Blue colors indicate underrepresentation, while red colors represent overrepresentation. The intensity of the color correspondsto the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west andoverrepresentation of populous counties.

less than 95% predictive (e.g., the name Avery was observedto correspond to male babies only 56.8% of the time; it wastherefore removed). The result is a list of 5,836 names thatwe use to infer gender.

Limitations Clearly, this approach to detecting gender issubject to a number of potential limitations. First, users maymisrepresent their name, leading to an incorrect gender in-ference. Second, there may be differences in choosing to re-veal one’s name between genders, leading us to believe thatfewer users of one gender are present. Third, the name listsabove may cover different fractions of the male and femalepopulations.

Gender of Twitter usersWe first determine the number of the 3,279,425 U.S.-basedusers who we could infer a gender for, based on their nameand the list previously described. We do so by comparingthe first word of their self-reported name to the gender list.We observe that there exists a match for 64.2% of the users.Moreover, we find a strong bias towards male users: Fully71.8% of the the users who we find a name match for had amale name.

0

0.2

0.4

0.6

0.8

1

2007-01 2007-07 2008-01 2008-07 2009-01 2009-07Frac

tion

of J

oini

ng U

sers

who

are

Mal

e

Date

Figure 3: Gender of joining users over time, binned intogroups of 10,000 joining users (note that the join rate in-creases substantially). The bias towards male users is ob-served to be decreasing over time.

To further explore this trend, we examine the historic gen-der bias. To do so, we use the join date of each user (avail-able in the user’s profile). Figure 3 plots the average fractionof joining users who are male over time. From this plot, itis clear that while the male gender bias was significantlystronger among the early Twitter adopters, the bias is be-coming reduced over time.

Race/ethnicityDetecting race/ethnicity using last namesAgain, since we have very limited information availableon each Twitter user, we resort to inferring race/ethnicityusing self-reported last name. We examine the last nameof users, and correlate the last name with data from theU.S. 2000 Census (U.S. Census 2000). In more detail, foreach last name with over 100 individuals in the U.S. dur-ing the 2000 Census, the Census releases the distribution ofrace/ethnicity for that last name. For example, the last name“Myers” was observed to correspond to Caucasians 86% ofthe time, African-Americans 9.7%, Asians 0.4%, and His-panics 1.4%.

Race/ethnicity distribution of Twitter usersWe first determined the number of U.S.-based users forwhom we could infer the race/ethnicity by comparing thelast word of their self-reported name to the U.S. Censuslast name list. We observed that we found a match for71.8% of the users. We the determined the distribution ofrace/ethnicity in each county by taking the race/ethnicitydistribution in the Census list, weighted by the frequencyof each name occurring in Twitter users in that county.1Due to the large amount of ambiguity in the last name-to-race/ethnicity list (in particular, the last name list is morethan 95% predictive for only 18.5% of the users), we are un-able to directly compare the Twitter race/ethnicity distribu-

1This is effectively the census.model approach discussed inprior work (Chang et al. 2010).

Undersampling

Oversampling

(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic

Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, andHispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicityare shown. Blue regions correspond to undersampling; red regions to oversampling.

tion directly to race/ethnicity distribution in the U.S. Census.However, we are able to make relative comparisons betweenTwitter users in different geographic regions, allowing us toexplore geographic trends in the race/ethnicity distribution.Thus, we examine the per-county race/ethnicity distributionof Twitter users.In order to account for the uneven distribution of

race/ethnicity across the U.S., we examine the per-countyrace/ethnicity distribution relative to the distribution fromthe overall U.S. Census. For example, if we observed that25% of Twitter users in a county were predicted to be His-panic, and the 2000 U.S. counted 23% of people in thatcounty as being Hispanic, we would consider Twitter to beoversampling the Hispanic users in that county. Figure 4plots the per-county race/ethnicity distribution, relative tothe 2000 U.S. Census, per all counties in which we observedmore than 500 Twitter users with identifiable last names.A number of geographic trends are visible, such as the un-dersampling of Hispanic users in the southwest; the under-samping of African-American users in the south and mid-west; and the oversampling of Caucasian users in many ma-jor cities.

Related workA few other studies have examined the demographics of so-cial network users. For example, recent studies have exam-ined the ethnicity of Facebook users (Chang et al. 2010),general demographics of Facebook users (Corbett 2010),and differences in online behavior on Facebook and MyS-pace by gender (Strayhorn 2009). However, studies of gen-eral social networking sites are able to leverage the broadnature of the profiles available; in contrast, on Twitter, usersself-report only a minimal set of information, making calcu-lating demographics significantly more difficult.

ConclusionTwitter has received significant research interest lately as ameans for understanding, monitoring, and even predictingreal-world phenomena. However, most existing work doesnot address the sampling bias, simply applying machinelearning and data mining algorithms without an understand-ing of the Twitter user population. In this paper, we tooka first look at the user population themselves, and exam-ined the population along the axes of geography, gender, andrace/ethnicity. Overall, we found that Twitter users signif-icantly overrepresent the densely population regions of the

U.S., are predominantly male, and represent a highly non-random sample of the overall race/ethnicity distribution.Going forward, our study sets the foundation for future

work upon Twitter data. Existing approaches could imme-diately use our analysis to improve predictions or measure-ments. By enabling post-hoc corrections, our work is a firststep towards turning Twitter into a tool that can make infer-ences about the population as a whole. More nuanced anal-yses on the biases in the Twitter population will enhancethe ability for Twitter to be used as a sophisticated inferencetool.

AcknowledgementsWe thank Fabricio Benevento and Meeyoung Cha for theirassistance in gathering the Twitter data used in this study.We also thank Jim Bagrow for valuable discussions and hiscollection of geographic data from Google Maps. This re-search was supported in part by NSF grant IIS-0964465 andan Amazon Web Services in Education Grant.

ReferencesAsur, S., and Huberman, B. 2010. Predicing the future with socialmedia. http://arxiv.org/abs/1003.5699.Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predictsthe stock market. In ICWSM.Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.Measuring user influence in twitter: The million follower fallacy.In ICWSM.Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.epluribus: Ethnicity on social networks. In ICWSM.Corbett, P. 2010. Facebook demographics and statistics re-port 2010. http://www.istrategylabs.com/2010/01/facebook-demographics-and-statistics-report-\2010-145-growth-in-1-year.Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-basedmethod for producing density-equalizing maps. PNAS 101.O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.2010. From tweets to polls: Linking text sentiment to public opin-ion time series. In ICWSM.Social Security Administration. 2010. Most popular baby names.http://www.ssa.gov/oact/babynames.Strayhorn, T. 2009. Sex differences in use of facebook and mys-pace among first-year college students. Stud. Affairs 10(2).U.S. Census. 2000. Genealogy data: Frequently occur-ring surnames from census. http://www.census.gov/genealogy/www/data/2000surnames.


Language and Geography PLoS One 8, E61981 (2013)

Spanish

English

Geography


Multilayer Network


Information Layer(s)

Social Layer(s)


Link Function ICWSM’11, 89 (2011)

Agreement Discussion


erated by two-point joint probability functions Pkk!!w ,w!",and among those, initially only the ones that are degree in-dependent given by functions of the type P!w ,w!".

In order to construct weighted networks along these lines,we use the so-called Barabási-Albert !BA" model #22$,where new nodes entering the network connect to old oneswith a probability proportional to their degree #23$. The net-works generated by this model are scale-free #their degreedistribution reads as Pk!k"%k−3$, have no degree-degree cor-relations, and their clustering coefficient !probability of find-ing triangles" tends to zero when the system size tends toinfinity. All this makes them ideal null models to test corre-lations between edge weights. Once the network is grown, ajoint probability distribution for the link weights P!w ,w!"and an algorithm for weight assignation are needed. With thefunction P!w ,w!" one can calculate the weight distributionP!w"=&dw! P!w ,w!", and the conditional probability ofhaving a weight w! provided that a neighboring link has aweight w, P!'w!'w"= P!w ,w!" / P!w". We start by choosingan edge at random and giving it a weight obtained fromP!w". Then we move to the nodes at its extremes and assignweights to the neighboring links. To do this, we follow arecursive method: if the edge from which the node is ac-cessed has a weight w0, the rest, w1 , . . . ,wk−1, are obtainedfrom the conditional distributions P!'wi'wi−1". The recursionis necessary to increase the variability in case of anticorrela-tion !see below". If any of the links j already have a weight,it remains without change and its value affects the subse-quent edges j+1, . . . ,k−1. We repeat this process until allthe edges of the network have a weight assigned #24$.

For P!w ,w!", we have considered different possibilitiesbut here we will focus only on the following three:

P+!w,w!" =X+

!w + w!"2+! ,

PU!w,w!" =XU

!ww!"1+! ,

P−!w,w!" =X−

!ww! + 1"1+! , !2"

where X+=2!!!1+!", XU=!2, and X−=!2 / 2F1!! ,! ,1+! ,−1" are the normalization factors for the distributions on thedomain of weights !1,"", and 2F1! " is the Gauss hypergeo-metric function #25$. Without losing generality, we havechosen these particular functional forms due to their analyti-cal and numerical tractability. The distributions generatedby Eqs. !2" asymptotically decay as P!w"%w−1−!. Thereason to use power-law decaying distributions is that em-pirical networks commonly show very wide weight distri-butions that in a first approach can be modeled as powerlaws !see Fig. 6 and Refs. #3–5,26$". We name thefunctions as + !positively correlated", − !anticorrelated",and U !uncorrelated" because the average weight (w)!w0"=&dw w P!'w'w0", obtained with the conditional probabili-ties from a certain seed w0 grows as (w)+!w0"= !1+!+w0" /!, decreases as (w)−!w0"= !!+1 /w0" / !!−1" and

remains constant (w)U=! / !!−1", respectively. This meansthat in + networks the links of each node tend to be relativelyuniform in the weights #see Fig. 1!a"$, with separate areas ofthe graph concentrating the strong or the weak links, while inthe negative case, links with high and low weights areheavily mixed.

From a numerical point of view, we have checked how thevariables to measure vary with the network size N. In thefollowing, most results are shown for N=105, which is bigenough to avoid significant finite size effects. For each valueof the exponent ! #from Eqs. !2"$ and for each type of cor-relation, we have averaged over more than 600 realizations.Note that we use ! as a control parameter for the strength ofthe correlations. For high values of !, P!w" decays very fastand the correlations become negligible; all links have almostthe same weight. When ! decreases however, the higher mo-ments of P!w" diverge and one would expect the correlationsto be more prominent.

III. MEASURES OF WEIGHT CORRELATIONS

After a look at the sketch of Fig. 1, the first estimator toconsider in order to estimate weight correlations is the stan-dard deviation of the weights of the links arriving at eachnode. If the weights are relatively homogeneous, the stan-dard deviation will be lower compared with its counterpart ina randomized instance of the graph. The opposite will hap-pen if the correlations are negative as in Fig. 1!b". Morespecifically, for a generic node of the network i, #w!i" can bedefined as

#w2 !i" = *

j!$!i"!wij − (w)i"2, !3"

where $!i" is the set of neighbors of i and (w)i is the meanvalue of the weight of the links arriving at i. Once the devia-tion is calculated for each node, an average can be taken overthe full network getting (#w)= !1 /N"*i#w!i". Then to evalu-ate the effects of weight correlations, it is necessary to com-pare the value of (#w)org obtained for the original networkwith that measured on uncorrelated graphs. It is, of course,important that the statistical properties of such uncorrelated

FIG. 1. !Color online" Two possible cases in networks withcorrelations in the link weight: !a" positively correlated nets and !b"anticorrelated networks. The width of the line of the links representsthe value of the weight.

JOSÉ J. RAMASCO AND BRUNO GONÇALVES PHYSICAL REVIEW E 76, 066106 !2007"

066106-2

A

C

B

kin = 1kout = 2sin = 1

sout = 2


sout = 1 kin = 1kout = 1sin = 1

sout = 2

Figure 2: Example of a meme diffusion network involvingthree users mentioning and retweeting each other. The val-ues of various node statistics are shown next to each node.The strength s refers to weighted degree, k stands for degree.

Observing a retweet at node B provides implicit confirma-tion that information from A appeared in B’s Twitter feed,while a mention of B originating at node A explicitly con-firms that A’s message appeared in B’s Twitter feed. Thismay or may not be noticed by B, therefore mention edgesare less reliable indicators of information flow compared toretweet edges.

Retweet and reply/mention information parsed from thetext can be ambiguous, as in the case when a tweet is markedas being a ‘retweet’ of multiple people. Rather, we relyon Twitter metadata, which designates users replied to orretweeted by each message. Thus, while the text of a tweetmay contain several mentions, we only draw an edge to theuser explicitly designated as the mentioned user by the meta-data. In so doing, we may miss retweets that do not use theexplicit retweet feature and thus are not captured in the meta-data. Note that this is separate from our use of mentions asmemes (§ 3.1), which we parse from the text of the tweet.

4 System ArchitectureWe implemented a system based on the data representationdescribed above to automatically monitor the data streamfrom Twitter, detect relevant memes, collect the tweets thatmatch themes of interest, and produce basic statistical fea-tures relative to patterns of diffusion. These features arethen passed to our meme classifier and/or visualized. Wecalled this system “Truthy.” The different stages that leadto the identification of the truthy memes are described in thefollowing subsections. A screenshot of the meme overviewpage of our website (truthy.indiana.edu) is shownin Fig. 3. Upon clicking on any meme, the user is taken toanother page with more detailed statistics about that meme.They are also given an opportunity to label the meme as‘truthy;’ the idea is to crowdsource the identification oftruthy memes, as an input to the classifier described in § 5.

4.1 Data CollectionTo collect meme diffusion data we rely on whitelisted ac-cess to the Twitter ‘Gardenhose’ streaming API (dev.twitter.com/pages/streaming_api). The Gar-denhose provides detailed data on a sample of the Twittercorpus at a rate that varied between roughly 4 million tweets

Figure 3: Screenshot of the Meme Overview page of ourwebsite, displaying a number of vital statistics about trackedmemes. Users can then select a particular meme for moredetailed information.

per day near the beginning of our study, to around 8 mil-lion tweets per day at the time of this writing. While theprocess of sampling edges (tweets between users) from anetwork to investigate structural properties has been shownto produce suboptimal approximations of true network char-acteristics (Leskovec and Faloutsos 2006), we find that theanalyses described below are able to produce accurate clas-sifications of truthy memes even in light of this shortcoming.

4.2 Meme DetectionA second component of our system is devoted to scanningthe collected tweets in real time. The task of this meme de-tection component is to determine which of the collectedtweets are to be stored in our database for further analysis.Our goal is to collect only tweets (a) with content relatedto U.S. politics, and (b) of sufficiently general interest inthat context. Political relevance is determined by matchingagainst a manually compiled list of keywords. We consider ameme to be of general interest if the number of tweets withthat meme observed in a sliding window of time exceeds agiven threshold. We implemented a filtering step for each ofthese criteria, described elsewhere (Ratkiewicz et al. 2011).

Our system has tracked a total of approximately 305 mil-lion tweets collected from September 14 until October 27,2010. Of these, 1.2 million contain one or more of our polit-ical keywords; the meme filtering step further reduced thisnumber to 600,000. Note that this number of tweets does notdirectly correspond to the number of tracked memes, as eachtweet might contribute to several memes.

4.3 Network AnalysisTo characterize the structure of each meme’s diffusion net-work we compute several statistics based on the topologyof the largest connected component of the retweet/mention

The Strength of Ties


Weakerated by two-point joint probability functions Pkk!!w ,w!",and among those, initially only the ones that are degree in-dependent given by functions of the type P!w ,w!".



P+!w,w!" =X+

!w + w!"2+! ,

PU!w,w!" =XU

!ww!"1+! ,

P−!w,w!" =X−

!ww! + 1"1+! , !2"






#w2 !i" = *

j!$!i"!wij − (w)i"2, !3"




066106-2

A

C

B


sout = 2



sout = 2











The Strength of Ties


Weak

• Interviews to find out how individuals found out about job opportunities.

• Mostly from acquaintances or friends of friends

• “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another”




P+!w,w!" =X+

!w + w!"2+! ,

PU!w,w!" =XU

!ww!"1+! ,

P−!w,w!" =X−

!ww! + 1"1+! , !2"






#w2 !i" = *

j!$!i"!wij − (w)i"2, !3"




066106-2

A

C

B


sout = 2



sout = 2











The Strength of Ties (1973)


Weak

• Interviews to find out how individuals found out about job opportunities.

• Mostly from acquaintances or friends of friends

• “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another”




P+!w,w!" =X+

!w + w!"2+! ,

PU!w,w!" =XU

!ww!"1+! ,

P−!w,w!" =X−

!ww! + 1"1+! , !2"






#w2 !i" = *

j!$!i"!wij − (w)i"2, !3"




066106-2

A

C

B


sout = 2



sout = 2











The Strength of Ties (1973)

the system. Within this new framework we study a family of informa-tion propagation processes, namely the rumour spreading model38,39.We tackle the case in which the dynamics of contacts and the spread-ing process are acting on the same time-scale. Interestingly, both insynthetic and real time-varying networks we find that memory ham-pers the rumour spreading process. Strong ties have an importantrole in the early cessation of the rumor diffusion by favouring inter-actions among agents already aware of the gossip. The celebratedGranovetter conjecture that spreading is mostly supported by weakties40, goes along with a negative effect of strong ties. In other words,while favouring locally the rumor spreading, strong ties have anactive role in confining the process for a time sufficient to itscessation.

ResultsWe focus on a prototypical large scale communication networkwhere mobile phone users are nodes and the calls among them links.

The common analysis framework for such systems neglects thetemporal nature of the connections in favour of time-aggregatedrepresentations. In these representations, the degree k of a nodeindicates the total number of contacted individuals, while the weightof a link w (the strength of the tie) the total number of calls betweenthe pair of connected nodes. The distributions of these quantities areshown in Fig. 1.a, and b. Interestingly, they are characterized byheavy-tailed distributions. Although, the study of the time-aggre-gated network provides basic information about its structure, it can-not inform us on the processes driving its dynamics. This intuition isclearly exemplified in Fig. 2.a and b. These figures show two snap-shots of the network at different times covering few hours of calls in atown. The two plots capture dynamical interaction patterns not vis-ible from the aggregated network representation (Fig. 2.c).

Here we aim to study and identify the mechanisms driving theevolution, and dynamics of the egocentric networks (egonets) of theglobal network. Egonets were thoroughly investigated earlier in psy-chology and sociology41–43. Some other characteristics have beenrecently mapped out with the availability of large-scale data44–48.We tackle this problem from a different angle focusing on the activityrate, a, that allows describing the network evolution beyond simplestatic measures. It is defined as the probability of any given node to beinvolved in an interaction at each unit time. The activity distributionis also heavy-tailed (see Fig. 1.c), but contrary to degree and weight, isa time invariant property of individuals23. It does not change by usingdifferent time aggregation scales23,25. This quantity is the basic ingre-dient of the activity-driven modelling framework23. Here we extendthis approach by identifying, and modelling another crucial com-ponent: the memory of each agent. We encode this ingredient in asimple non-Markovian reinforcing mechanism that allows to repro-duce with great accuracy the empirical data.

Egocentric network dynamics. In general, social networks arecharacterized by two types of links. The first class describes strongties that identify time repeated and frequent interactions amongspecific couples of agents. The second class characterizes weak tiesamong agents that are activated only occasionally. It is natural toassume that strong ties are the first to appear in the system, whileweak ties are incrementally added to the egonet of each agent1. Thisintuition has been recently confirmed49 in a large-scale dataset andindicates a particular egocentric network evolution. In order toquantify it, we measure the probability, p(n), that the nextcommunication event of an agent having n social ties will occur viathe establishment of a new (n 1 1)th link. We calculate theseprobabilities in the MPC dataset averaging them for users with thesame degree k at the end of the observation time. We therefore

Figure 1 | Distributions of the characteristic measures of the aggregatedMPC network, and activity-driven networks. In panels (a), and (d) we plotthe degree distributions. In panels (b), and (e) we plot the weightdistributions. Finally, in panels (c), and (f) we plot the activitydistributions. In each figure grey symbols are assigning the originaldistributions while coloured symbols are denoting the same distributionsafter logarithmic binning. Measured quantities in MPC sequences wererecorded for 182 days (see Methods). In panels (d), (e), and (f) solid linesare assigned to the distributions induced by the reinforced process, whiledashed lines denote results of the original memoryless process. Modelcalculations were performed with parameters N 5 106, 5 1024 andT 5 104.

Figure 2 | Dynamics of the MPC network. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.Panel (c) presents the total weighted social network structure, which was recorded by aggregating interactions during 6 months. Node size and colorsdescribe the activity of users, while link width and color represent weight.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 4 : 4001 | DOI: 10.1038/srep04001 2


Neighborhood Overlap PNAS 104, 7333 (2007)

conversation typically represents a one-to-one communication.The tie strength distribution is broad (Fig. 1B), however, decay-ing with an exponent !w ! 1.9, so that although the majority ofties correspond to a few minutes of airtime, a small fraction ofusers spend hours chatting with each other. This finding is ratherunexpected, given that fat-tailed tie strength distributions havebeen observed mainly in networks characterized by global trans-port processes, such as the number of passengers carried by theairline transportation network (11), the reaction fluxes in met-abolic networks (12), or packet transfer on the Internet (13), inwhich case the individual f luxes are determined by the globalnetwork topology. An important feature of such global f lowprocesses is local conservation: All passengers arriving to anairport need to be transported away, each molecule created bya reaction needs to be consumed by some other reaction, or eachpacket arriving to a router needs to be sent to other routers.Although the main purpose of the phone is information transferbetween two individuals, such local conservation that constrainsor drives the tie strengths are largely absent, making anyrelationship between the topology of the MCG and local tiestrengths less than obvious.

Complex networks often organize themselves according to aglobal efficiency principle, meaning that the tie strengths areoptimized to maximize the overall f low in the network (13, 14).In this case the weight of a link should correlate with itsbetweenness centrality, which is proportional to the number ofshortest paths between all pairs of nodes passing through it (refs.13, 15, and 16, and S. Valverde and R. V. Sole, unpublishedwork). Another possibility is that the strength of a particular tiedepends only on the nature of the relationship between two

individuals and is thus independent of the network surroundingthe tie (dyadic hypothesis). Finally, the much studied strength ofweak ties hypothesis (17–19) states that the strength of a tiebetween A and B increases with the overlap of their friendshipcircles, resulting in the importance of weak ties in connectingcommunities. The hypothesis leads to high betweenness central-ity for weak links, which can be seen as the mirror image of theglobal efficiency principle.

In Fig. 2A, we show the network in the vicinity of a randomlyselected individual, where the link color corresponds to thestrength of each tie. It appears from this figure that the networkconsists of small local clusters, typically grouped around ahigh-degree individual. Consistent with the strength of weak tieshypothesis, the majority of the strong ties are found within theclusters, indicating that users spend most of their on-air timetalking to members of their immediate circle of friends. Incontrast, most links connecting different communities are visibly

100 101 10210

10 6

10 4

10 2

100

100 102 104 106 10810

10 12

10 10

10 8

10 6

10 4

10 2

vi vj

Oij=0 Oij=1/3

Oij=1Oij=2/3

A B

<O

> w,

<O

> b

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

Pcum (w), Pcum(b)

C DDegree k Link weight w (s)

P(k

)

P(w

)

Fig. 1. Characterizing the large-scale structure and the tie strengths of themobile call graph. (A and B) Vertex degree (A) and tie strength distribution (B).Each distribution was fitted with P(x) ! a(x " x0)#x exp(#x/xc), shown as a bluecurve, where x corresponds to either k or w. The parameter values for the fitsare k0 ! 10.9, !k ! 8.4, kc ! $ (A, degree), and w0 ! 280, !w ! 1.9, wc ! 3.45 %105 (B, weight). (C) Illustration of the overlap between two nodes, vi and vj, itsvalue being shown for four local network configurations. (D) In the realnetwork, the overlap &O'w (blue circles) increases as a function of cumulativetie strength Pcum(w), representing the fraction of links with tie strengthsmaller than w. The dyadic hypothesis is tested by randomly permuting theweights, which removes the coupling between &O'w and w (red squares). Theoverlap &O'b decreases as a function of cumulative link betweenness centralityb (black diamonds).

A

B

C

1

10010

Fig. 2. The structure of the MCG around a randomly chosen individual. Eachlink represents mutual calls between the two users, and all nodes are shownthat are at distance less than six from the selected user, marked by a circle inthe center. (A) The real tie strengths, observed in the call logs, defined as theaggregate call duration in minutes (see color bar). (B) The dyadic hypothesissuggests that the tie strength depends only on the relationship between thetwo individuals. To illustrate the tie strength distribution in this case, werandomly permuted tie strengths for the sample in A. (C) The weight of thelinks assigned on the basis of their betweenness centrality bij values forthe sample in A as suggested by the global efficiency principle. In this case, thelinks connecting communities have high bij values (red), whereas the linkswithin the communities have low bij values (green).

.

Onnela et al. PNAS ! May 1, 2007 ! vol. 104 ! no. 18 ! 7333

APP

LIED

PHYS

ICA

LSC

IEN

CES

weight

betweeness

reshuffled


Strong Ties have higher overlaps PNAS 104, 7333 (2007)





100 101 10210

10 6

10 4

10 2

100

100 102 104 106 10810

10 12

10 10

10 8

10 6

10 4

10 2

vi vj

Oij=0 Oij=1/3

Oij=1Oij=2/3

A B

<O

> w,

<O

> b

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

Pcum (w), Pcum(b)


P(k

)

P(w

)


A

B

C

1

10010


.


APP

LIED

PHYS

ICA

LSC

IEN

CES





100 101 10210

10 6

10 4

10 2

100

100 102 104 106 10810

10 12

10 10

10 8

10 6

10 4

10 2

vi vj

Oij=0 Oij=1/3

Oij=1Oij=2/3

A B

<O

> w,

<O

> b

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

Pcum (w), Pcum(b)


P(k

)

P(w

)


A

B

C

1

10010


.


APP

LIED

PHYS

ICA

LSC

IEN

CES





100 101 10210

10 6

10 4

10 2

100

100 102 104 106 10810

10 12

10 10

10 8

10 6

10 4

10 2

vi vj

Oij=0 Oij=1/3

Oij=1Oij=2/3

A B<

O> w

, <

O> b

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

Pcum (w), Pcum(b)


P(k

)

P(w

)


A

B

C

1

10010


.


APP

LIED

PHYS

ICA

LSC

IEN

CES

Real Randomized

Betweeness


Network Structure PLoS One 7, e29358 (2012)

a mention involves some effort and addresses only single targetedusers.

2.3 Internal linksAccording to Granovetter’s theory, one could expect the

internal connections inside a group to bear closer relations.Mechanisms such as homophily [43], cognitive balance [44,45] ortriadic closure [12] favor this kind of structural configurations.Unfortunately, we have no means to measure the closeness of auser-user relation in a sociological sense in our Twitter dataset.However we can verify whether the link has been used formentions, whether the interchange has been reciprocated orwhether it has happened more than once. We define the fractionf ip of links with interaction i in position p with respect to the groups

of size s as

f ip(s)~

nip(s)

Ni, ð1Þ

where nip(s) is the number of links with that type of interaction in

position p with respect to the groups of size s and Ni in the total

number of links with interaction i. The fractions f iinternal(s) reveals

an interesting pattern as function of the group size as can be seenin Figure 3A. Note that the fraction of links in the follower network(black curve) is taken as the reference for comparison. Links withmentions are more abundant as internal links than the baselinefollower relations for groups of size up to 150 users. This particularvalue brings reminiscences of the quantity known as the Dunbarnumber [46], the cognitive limit to the number of people withwhom each person can have a close relationship and that hasrecently been discussed in the context of Twitter [47]. Althoughwe have identified larger groups, the density of mentions is similarto the density of links in the follower network. In addition, thedistribution of the number of times that a link is used (intensity) formentions is wide, which allows for a systematic study of thedependence of intensity and position (see Figure 3B). The moreintense (or reciprocated) a link with mentions is, the more likely itbecomes to find this link as internal (Figure 3C). This corresponds

Figure 1. Groups and links. (A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted asgray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used formentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups. (B) Different types of linksdepending on their position with respect to the groups’ structure: internal, between groups, intermediary links and no-group links.doi:10.1371/journal.pone.0029358.g001

The Strength of Intermediary Ties in Social Media

PLoS ONE | www.plosone.org 3 January 2012 | Volume 7 | Issue 1 | e29358

“People whose networks bridge the structural holes between groups have an advantage in detecting and developing rewarding opportunities. Information arbitrage is their advantage. They are able to see early, see more broadly, and translate information across groups.”

AJS Volume 110 Number 2 (September 2004): 349–99 349

! 2004 by The University of Chicago. All rights reserved.0002-9602/2004/11002-0004$10.00

Structural Holes and Good Ideas1

Ronald S. BurtUniversity of Chicago

This article outlines the mechanism by which brokerage providessocial capital. Opinion and behavior are more homogeneous withinthan between groups, so people connected across groups are morefamiliar with alternative ways of thinking and behaving. Brokerageacross the structural holes between groups provides a vision of op-tions otherwise unseen, which is the mechanism by which brokeragebecomes social capital. I review evidence consistent with the hy-pothesis, then look at the networks around managers in a largeAmerican electronics company. The organization is rife with struc-tural holes, and brokerage has its expected correlates. Compensation,positive performance evaluations, promotions, and good ideas aredisproportionately in the hands of people whose networks spanstructural holes. The between-group brokers are more likely to ex-press ideas, less likely to have ideas dismissed, and more likely tohave ideas evaluated as valuable. I close with implications for cre-ativity and structural change.

The hypothesis in this article is that people who stand near the holes ina social structure are at higher risk of having good ideas. The argumentis that opinion and behavior are more homogeneous within than betweengroups, so people connected across groups are more familiar with alter-

1 Portions of this material were presented as the 2003 Coleman Lecture at the Universityof Chicago, at the Harvard-MIT workshop on economic sociology, in workshops atthe University of California at Berkeley, the University of Chicago, the University ofKentucky, the Russell Sage Foundation, the Stanford Graduate School of Business,the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspects ofRationality” conference at the 2003 meetings of the American Sociological Association.I am grateful to Christina Hardy for her assistance on the manuscript and to severalcolleagues for comments affecting the final text: William Barnett, James Baron, Jon-athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, RachelKranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James Rauch,Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grateful toPeter Marsden for his comments as discussant at the Coleman Lecture. Direct cor-respondence to Ron Burt, Graduate School of Business, University of Chicago, Chi-cago, Illinois 60637. E-mail: [email protected]


Network Structure PLoS One 7, e29358 (2012)

to Granovetter expectation that the stronger the tie is the highernumber of mutual contacts of both parties it has and the higher thechance that the parties belong to the same group.

2.4 Links between groupsThe next question to consider is the characteristics of links

between groups. These links occur mainly between groupscontaining less than 200 users (Figure 4A–C). However, theirfrequency depends on the quality of the links (if they bear mentionsor retweets). While links with mentions are less abundant than thebaseline, those with retweets are slightly more abundant.According to the strength of weak ties theory [12,14–16], weaklinks are typically connections between persons not sharingneighbors, being important to keep the network connected andfor information diffusion. We investigate whether the linksbetween groups play a similar role in the online network asinformation transmitters. The actions more related to informationdiffusion are retweets [24] that show a slight preference foroccurring on between-group links (Figures 4B and 4C). Thispreference is enhanced when the similarity between connectedgroups is taken into account. We define the similarity between twogroups, A and B, in terms of the Jaccard index of theirconnections:

similarity(A,B)~j\links of A and Bjj|links of A and Bj

: ð2Þ

The similarity is the overlap between the groups’ connections andit estimates network proximity of the groups. The general patternis that links with mentions more likely occur between close groupsand retweets occur between groups with medium similarity(Figure 4D). Mentions as personal messages are typicallyexchanged between users with similar environments, what ispredicted by the strength of weak ties theory. Links with retweetsare related to information transfer and the similarity of the groups

between which they take place should be small according to theGranovetter’s theory. The results show that the most likely toattract retweets are the links connecting groups that are neither tooclose nor too far. This can be explained with Aral’s theory aboutthe trade-off between diversity and bandwidth: if the two groupsare too close there is no enough diversity in the information, whileif the groups are too far the communication is poor. These trendsare not dependant on the size of the considered groups (see Figs.S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in theSupplementary Information).

2.5 Intermediary linksThe communication between groups can take place in two

ways: the information can propagate by means of links betweengroups or by passing through an intermediary user belonging tomore than one group. We have defined as intermediary the linksconnecting a pair of users sharing a common group and with atleast one of the users belonging also to a different group (seeFig. 1B). These users and their links have a high potential to passinformation from one group to another in an efficient way [13].Several previous works pointed out to the existence of special usersin Twitter regarding the communication in the network [28,48].In order to estimate the efficiency of the different types of links asattractors of mentions and retweets, we measure a ratio ri

p for linksin position p and for interaction i defined as

rip~

nip

Np, ð3Þ

where, as before, nip is the number of links with the interaction i in

position p and Np is the total number of links in that position. The

bar plot with the values of rip is displayed in Figure 5A. The

efficiency of the different type of links can thus be compared forthe attraction of mentions (red bars) and retweets (green bars).

Figure 2. Group and link statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned.(C) Percentage of links of different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particulartopological localizations in respect to detected groups.doi:10.1371/journal.pone.0029358.g002



a mention involves some effort and addresses only single targetedusers.

2.3 Internal linksAccording to Granovetter’s theory, one could expect the

internal connections inside a group to bear closer relations.Mechanisms such as homophily [43], cognitive balance [44,45] ortriadic closure [12] favor this kind of structural configurations.Unfortunately, we have no means to measure the closeness of auser-user relation in a sociological sense in our Twitter dataset.However we can verify whether the link has been used formentions, whether the interchange has been reciprocated orwhether it has happened more than once. We define the fractionf ip of links with interaction i in position p with respect to the groups

of size s as

f ip(s)~

nip(s)

Ni, ð1Þ

where nip(s) is the number of links with that type of interaction in

position p with respect to the groups of size s and Ni in the total

number of links with interaction i. The fractions f iinternal(s) reveals

an interesting pattern as function of the group size as can be seenin Figure 3A. Note that the fraction of links in the follower network(black curve) is taken as the reference for comparison. Links withmentions are more abundant as internal links than the baselinefollower relations for groups of size up to 150 users. This particularvalue brings reminiscences of the quantity known as the Dunbarnumber [46], the cognitive limit to the number of people withwhom each person can have a close relationship and that hasrecently been discussed in the context of Twitter [47]. Althoughwe have identified larger groups, the density of mentions is similarto the density of links in the follower network. In addition, thedistribution of the number of times that a link is used (intensity) formentions is wide, which allows for a systematic study of thedependence of intensity and position (see Figure 3B). The moreintense (or reciprocated) a link with mentions is, the more likely itbecomes to find this link as internal (Figure 3C). This corresponds

Figure 1. Groups and links. (A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted asgray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used formentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups. (B) Different types of linksdepending on their position with respect to the groups’ structure: internal, between groups, intermediary links and no-group links.doi:10.1371/journal.pone.0029358.g001







: ð2Þ






rip~

nip

Np, ð3Þ












: ð2Þ






rip~

nip

Np, ð3Þ









Groups PLoS One 7, e29358 (2012)

information has been removed from the database before theanalysis, which has been performed using anonymized data.

4.2 Description of the datasetThe data analyzed in this paper was collected in a two step

process: the fist stage corresponds to the collection of the followernetwork (followers and followees), while the second consists in theretrieval of the user activity from the stream of Twitter (plaintweets, mentions and retweets). In the first stage, the directedunweighted network is obtained from the information on thefollowers and followees of each user. The data was collected usinga breadth-first search technique: Starting from several seeds,followers and followees of the seeds were retrieved. Then the sameprocedure was repeated for the newly discovered users obtaining a

so-called snowball sampling of the follower network. Theprocedure is stopped after several steps when the number ofnewly discovered users in n-th breadth is small compared with thetotal number of users already discovered in the (n{1)-th step. Theprocess was run in November 2008, gathering information for atotal of 2408534 users. Due to the internal exploration of thenetwork, one can anticipate that this method tends to detect theusers with the highest in or out degree that belong to the largestconnected cluster of the network.

The second stage consists in searching for all the tweets ofthe users found in the follower network for a period of timefrom November 20 to December 11. The activity dataset wasconstructed from these gathered tweets. The tweets containingusernames with a ‘@username’ functional syntax were used for the

Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the size of thegroups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group at the linkorigin, and (C) at the targeted group. (D) Frequency of between-group links as a function of the group-group similarity for the different type of links.In the inset, ratio between the frequency of links with retweets and with mentions.doi:10.1371/journal.pone.0029358.g004

Figure 5. Intermediary links. (A) Ratio r between the number of links with mentions or retweets and number of follower links. (B) Distribution ofthe links in the follower network (black curve), those with mentions (red curve) and retweets (green curve) as a function of the number of non-sharedgroups of the users connected by the link. Inset, ratios between these distributions and the follower network.doi:10.1371/journal.pone.0029358.g005







: ð2Þ






rip~

nip

Np, ð3Þ








Geography


Multilayer Network


Information Layer(s)

Social Layer(s)

Geographical Layer(s)


Twitter Follower Distance Social Networks 34, 73 (2012)78 Y. Takhteyev et al. / Social Networks 34 (2012) 73– 81

Fig. 1. Histogram of physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New York–London ties, at5590 km, are counted towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on 1259 dyads.

Fig. 1 shows the distribution of distances between egos andtheir alters, comparing it to two simulated baselines and show-ing that distance also has an effect on non-local ties. The observeddistribution is shown as the thick solid line. When analyzing the dis-tribution of tie lengths, it is again important to consider the unevendistribution of the users’ locations around the globe. If ties wereformed by picking random points on the surface of the planet (withfull disregard for uneven distribution of land mass and population),we would expect a symmetric distribution on the range from 0 to20,000 km, with a peak at 10,000 km, represented by the smooththin line in Fig. 1 (labeled “simulation 1”). Twitter users, howeverare not distributed evenly around the globe. (Nor is human pop-ulation in general.) This uneven distribution substantially skewsthe expected distributions of distances between egos and alterstowards shorter ties. Further, since the users are concentrated ina few clusters, we can expect the distribution to peak at valuescorresponding to distances between major clusters.

This distribution is demonstrated by the second simulation(Fig. 1, medium line, labeled “simulation 2”), in which egos, locatedwhere they are in our sample, form ties among each other at ran-dom. The graph shows a substantial number in the very first bin(0–200 km), followed by a decline in bins representing longer dis-tances. The count goes up, however, as we approach bins thatinclude distances that span the two coasts of the United States, witha particularly sharp peak for the 3800–4000 km bin, which catchesthe distance between New York and Los Angeles. This peak is fol-lowed by another valley, corresponding to not-quite-transatlanticdistances, and then a rise as we reach Europe. The simulationshows another large peak corresponding to the distance betweenNew York and São Paulo, followed by one matching the distancebetween New York and Tokyo. We see relatively few ties longerthan 12,000 km, since the antipodal points of all major clusters fallin the ocean.

Compared with this baseline, the observed distribution of tielengths shows a clear surplus of ties for distances up to 1000 km,a somewhat mixed record from there to 5000 km and a consistentdeficit of ties at greater distances. We note, though, that the peakin the number of ties at the New York–Los Angeles distance is actu-ally higher than we would expect if ties were formed randomly. Onthe other hand, several other expected peaks remain unrealized. Inparticular, we observe no peaks at the values corresponding to thedistances between New York and São Paulo, New York and Tokyo,and Tokyo and São Paulo.

For network comparison we created a “distance” network inwhich the weight of edges was set to a natural logarithm of the

great-circle distance between the two clusters, calculated using thestandard haversine formula. The comparison of this network to thenetwork of Twitter ties for the top 25 clusters shows a correlationof −0.45 for the top 25 clusters, with p < 0.001 (Table 4). We notethat our dependent network (“Twitter”) is based only on ties thatconnect users in different clusters, omitting the 39 percent of theties that fall within clusters. Therefore, the correlation with the dis-tance network cannot be explained simply by the large number oflocal ties, but rather, shows a further constraining effect of distanceon non-local ties.

5.2. Air travel

To investigate the effect of the ease of travel on Twitter ties weobtained a dataset showing a number of direct flights betweenpairs of 3023 airports on five different days in 2008 and 2009(Mendelsohn, unpublished data). We assigned those flights to pairsof clusters by matching each cluster to the airports located within100 km from its center. We then constructed a network by givingeach pair of clusters a weight based on the natural logarithm of theobserved number of flights between the airports assigned to eachof them.

Comparing the air travel network with the network of Twit-ter ties shows a correlation of 0.51 for the top 25 clusters, withp < 0.001 (Table 4). The network of flights is thus a better predictorof non-local Twitter ties than physical distance. One interpreta-tion of the predictive power of flight frequency is that frequentflights facilitate travel, which allows for formation of face-to-faceties and increases the likelihood of Twitter connections. (This may,for example, include the fact that when people travel or movethey may continue to follow people back home.) Another inter-pretation suggests that flight connections themselves reflect thestructure of the world city system, and that Twitter ties are influ-enced by this structure. Our data does not allow us to disambiguate

Table 4QAP correlations, top 25 clusters. Distance, the number of Twitter ties, and thenumber of flights are logged.

Twitter Flights Language Domestic

Distance −0.448 −0.817 −0.617 −0.720Domestic 0.440 0.723 0.709Language 0.418 0.637Flights 0.510

All p-values are ≤0.005.


Locality Social Networks 34, 73 (2012)

Y. Takhteyev et al. / Social Networks 34 (2012) 73– 81 79

Table 5Top countries.

Share ofegos (%)a

Share of egos(%) for egos indyadsb

Share ofalters (%)c

Percentage ofdomestic tiesd

Percentage ofdomestic ties amongnon-local tiesd

Following foreignalters/being followedfrom abroad

Country namedexplicitly (% ofegos)

USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7

a Out of the 2852 egos located at the level of country or better.b Out of the egos included in 1953 dyads with both parties located at the level of country or better.c Out of the 1953 alters located at the level of country or better.d The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country.

between those two interpretations. We also note that top Twitterclusters intersect only to an extent with Alderson and Beckfield’s(2004) ranking of world cities based on multinational corporations’branch headquarters. (Of Alderson and Beckfield’s top 25 cities byin-degree or “prestige,” 13 appear in the top 25 Twitter clustersranked by in-degree centrality, with another 6 appearing in top100.)

5.3. National borders

Of the ties that were matched to countries, 75 percent con-nect users in the same country. This prevalence of domestic ties ispartly explained by the high frequency of local connections, sinceall local ties are domestic. Looking at just the non-local ties (i.e.,ties between users in different clusters), we find that the share ofdomestic ties is lower but still substantial: 63 percent.

As with distance, the high frequency of domestic ties can bepartly explained by the concentration of users in a small numberof countries, with nearly half of them in the United States. Theshare of domestic ties, however, substantially exceeds what wewould expect if users formed connections randomly while beingdistributed as they are now, which would result in only 26 percentof the ties being domestic. Further, the surplus of domestic con-nections holds for all major countries, including those that accountfor just a small fraction of the egos, as shown in Table 5. (Theeffect is somewhat reduced for countries that have only one majorcluster, since in those cases removing local ties means removingthe majority of the domestic ties.) The table also shows that theshare of domestic ties is generally higher for non-English-speakingcountries (as long as they have several clusters), yet even theEnglish-speaking countries show a higher share of domestic tiesthan would be expected from their share of egos. A comparisonbetween the network of Twitter ties between the top 25 clustersand the “domestic” network (where edges were set to 1 for domes-tic ties and 0 for international) shows a correlation of 0.44, withp < 0.001 (Table 4).

The substantial share of the United States in the sample warrantsa comparison with other countries. The share of domestic ties islower for egos located outside the United States: 62 percent of allties and 42 percent of non-local ties. However, the share of domesticties is higher for pairs where both parties are located outside theUnited States: 80 percent of all ties and 65 percent of non-local ones.In other words, Twitter users outside of the US have a somewhatmore international orientation than American users, but only in thesense that they tend to follow users in the US.

It is also important to note the differences in the pattern ofoutgoing ties (following) and in-coming ties (being followed). Ascolumn 7 in Table 5 shows, the majority of the US’s international

Table 6The most common languages. Based on 2852 egos.

Language % of egos

English 72.5Portuguese 10.1Japanese 5.4Spanish 3.1Indonesian 1.8German 1.7Dutch 1.0Chinese 0.9Korean 0.4Swedish 0.4Russian 0.4

ties are incoming: Twitter users in the United States are often fol-lowed from abroad, with over three incoming ties for each outgoingtie. For some of the other countries, on the other hand, internationalties are overwhelmingly outgoing. For example, the ratio of incom-ing ties to outgoing is nearly one to five for users Brazil, who activelyfollow foreign accounts, but receive little attention in return.

The more domestic orientation of the American users alsoreflects itself in how they describe their locations. When codingthe locations we noted whether the country was stated explicitlyor implied (e.g., “São Paulo, Brasil” vs. just “São Paulo”). As shown incolumn 8 of Table 5, only eight percent of US location descriptionsexplicitly name the country, compared to, for example, 55 percentof locations in Brazil. This may suggest that American users of Twit-ter either see their audience as exclusively domestic (even thoughit is not), expect foreign users to know the names of American cities,or simply do not think about Twitter users abroad. The United Statesis closely followed by Japan, where only 25 percent of locationdescriptions identify the country explicitly. However, in the caseof Japan, this may be explained by the fact that in the overwhelm-ing majority of cases, locations in Japan were identified in Japanese(using kanji or kana), which makes them intelligible only to peoplewho know Japanese and would be familiar with Japanese cities.Additionally, Japanese users are followed almost exclusively byothers in Japan: ties from foreign egos account for a relatively smallfraction (10 percent) of the ties received by Japanese alters. Notethat the Brazilian users have proportionally even fewer incomingforeign ties. This does not, however, stop them from identifyingtheir country explicitly.

5.4. Language

A large majority of egos (62 percent) and alters (68 percent)are located in countries where English is the dominant language.

Y. Takhteyev et al. / Social Networks 34 (2012) 73– 81 77

accounts, by randomly drawing an account from among those “fol-lowed” by each of those egos. We then coded the locations of thealters using the same procedure as we did for the egos, removingthose pairs where the alter could not be assigned to a country. Inthe end, we obtained a sample of 1953 ego-alter pairs with boththe ego and the alter assigned to a country, including 1259 pairswith “specific” locations for both parties (Table 1).

4.4. Aggregating nearby locations

Since specific locations vary substantially in precision and sinceusers can often choose between a range of specific names for thesame place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), weaggregated nearby locations within each country, by assigning aset of coordinates (obtained from Google Maps) to each locationsmaller than 25,000 km2 and then merging nearby locations withineach country by replacing their coordinates with a weighted aver-age of the coordinates of the merged locations. This reduced ourlocation descriptions to a set of 386 regional clusters, which arecomparable in size to metropolitan areas. We labeled each clus-ter with the most common name associated with it in our sample.For example, the cluster centered on Manhattan is referred to as“New York.”

5. Analysis

In this section we analyze the factors affecting the formation ofTwitter ties. We first look at the effect of each variable identifiedearlier based on theoretical considerations: the actual physical dis-tance, the frequency of air travel, national boundaries, and languagedifferences. In addition to presenting the descriptive statisticsdemonstrating the effects of each variable and investigating thenature of such effects, we correlated the effects using the QuadraticAssignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In thelast subsection we also examined the relationship between thevariables using QAP regression (Double Dekker Semi-partiallingMRQAP). All statistical calculations were done using UCINet 6.277(Borgatti et al., 2002).

For correlation and regression analysis we used networks withnodes representing the 25 largest regional clusters of users (seeprevious section). The edges of each network were then assignedweights based on an operationalization of the corresponding vari-able. For the dependent variable network the weight of the edgesrepresented the natural logarithm of the number of Twitter tiesbetween users in the two clusters. The weights for the edges in theindependent variable networks are described below, when we dis-cuss each variable. We have found that the network of 386 Twitterclusters was extremely sparse, since the number of ties in the sam-ple was small relative to the number of nodes. As a result, more than99 percent of cluster pairs had zero Twitter connections betweenthem, leading to low correlation (between 0.05 and 0.1) with thecomparison networks, with the only exception being the networkof airline connections.6 For this reason, we limited our correlationand regression analysis to the ties between just the 25 largest clus-ters, which allowed for a much denser Twitter network (an averageof 0.76 ties per pair).

5.1. Physical distance

The use of Twitter is concentrated in the United States, whichaccounts for 49 percent of our sample of egos, 54 percent of thealters, and 6 of the 10 largest clusters (Table 3). At the same time,

6 Note that the airline network was very sparse, much like the Twitter networks.The other networks, by comparison, had non-zero values for all pairs.

Table 3Top clusters.

Rank Clustera Share ofegos (%)b

Share of egos(%) for egos indyadsc

Share ofalters (%)d

Localitye

1 “New York” 8.5 8.3 10.2 54.32 “Los Angeles, CA” 5.1 5.6 10.4 53.33 “ ” (Tokyo) 4.1 4.8 5.0 62.94 “London” 3.6 3.3 4.9 48.85 “São Paulo” 3.5 3.0 3.6 78.46 “San Francisco” 2.8 2.7 4.1 41.27 “New Jersey”f 2.5 2.8 2.1 20.08 “Chicago” 2.2 2.0 1.7 32.09 “Washington, DC” 2.1 2.8 2.6 34.310 “Manchester, UK” 1.9 2.0 1.1 30.811 “Atlanta” 1.7 2.1 2.1 46.212 “San Diego” 1.5 1.5 1.1 26.313 “Toronto, Canada” 1.3 1.1 1.5 42.914 “Seattle” 1.3 1.4 1.2 58.815 “Houston” 1.2 1.2 1.0 40.016 “Dallas, Texas” 1.2 1.0 1.4 61.517 “Rio de Janeiro” 1.2 1.0 1.1 30.818 “Boston, MA” 1.2 1.2 1.1 20.019 “Amsterdam” 1.1 1.1 0.9 50.020 “Jakarta, Indonesia” 1.1 0.6 0.3 42.921 “Austin, TX” 1.0 1.0 1.3 50.022 “Sydney” 0.9 1.0 0.8 38.523 “Orlando, Forida” 0.9 1.0 0.6 16.724 “Phoenix, AZ” 0.8 0.7 0.6 11.125 “ ” (Hyogo)g 0.8 1.0 1.0 25.0

a Each cluster is labeled with the name most frequently used for locations assignedto the cluster.

b Out of the 2167 egos located with precision of <25,000 km2.c Out of the 1259 egos included in dyads with both parties located with precision

of <25,000 km2.d Out of the 1259 alters included in dyads with both parties located with precision

of <25,000 km2.e Defined as the share of local of ties among all ties for egos in a cluster.f Centered between Philadelphia and Trenton, NJ and includes all locations iden-

tified as just “New Jersey”.g Centered near the boundary between Hyogo and Osaka prefectures, in the Kansai

region of Japan.

over half of the egos are in other countries, as are 4 of the 10largest clusters: Tokyo, São Paulo, and two clusters in the UnitedKingdom. In this sense, Twitter users are distributed quite widelyaround the globe. In addition to the relative concentration of usersin certain countries, however, we also observe a very substantialconcentration of users in a relatively small number of specific localclusters. 25 clusters account for 54 and 61 percent of the egos andalters respectively. This level of concentration exceeds the generalconcentration of the population in major urban agglomerations.7

Being in the same cluster also has a strong effect on the forma-tion of ties: 39 percent of the ties between egos and alters fall withinthe same regional cluster. The large share of in-cluster ties can bepartly explained by the substantial degree of clustering: when usersare concentrated in a handful of places, a large share of ties would belocal even if ties were formed randomly, disregardling location. Theshare of local (in-cluster) ties, however, is substantially higher thanwhat we would expect just due to clustering. Considering the dis-tribution of egos in our sample, only two percent of the ties wouldbe local if the ties were formed randomly. (An average user’s clusteraccounts for two percent of the total number of egos.)

7 For example, the New York cluster in our sample accounts for 17 percent ofUS-based egos, while the New York Metropolitan Area (which exceeds the size ofour “New York” cluster) accounts for only 6 percent of the United States population.For the two main clusters located outside North America and Europe, the degree ofconcentration is even more substantial: the São Paulo cluster accounts for 37 percentof egos located in Brazil, while Tokyo accounts for 64 percent of those located inJapan.


World Population


Population Heterogeneity Social Networks 34, 82 (2012)

• Bernoulli process to generate adjacency matrix given a distance matrix between nodes

• Above some density threshold, networks is naturally connected.C.T. Butts et al. / Social Networks 34 (2012) 82– 100 85

Fig. 2. Effects of increasing order on connectivity and cohesion for random graphsof fixed expected density. Left panel shows probability of connectivity and bicon-nectivity by order and density. Right panel shows fraction belonging to each k-core(and no higher) and mean core number (dotted line) at 1% expected density, bygraph order.

core number (black line) rises steadily, growing approximately lin-early with N. As with connectivity, these behaviors may be alteredsomewhat by spatial clustering. They provide a useful intuition,however, for the “baseline” impact of in-filling on local structure.

Fig. 3. Emergence of local connectivity on an uneven population density surface.Where the threshold population density for an approximately uniform region ofarea ! is ω(!) (such that !ω(!) ≫ ln N/N), local connectivity emerges where dN/dxexceeds the threshold for intervals of appropriate area. Similar intervals of lowerpopulation remain locally disconnected.

Taking the strong relationship of mean core number with N atconstant network density together with the in-filling principle, wewould expect that regions of a given shape and area having higherlocal population densities will exhibit higher mean core numbersthan equivalent regions of low population density (again, in thesense of Fig. 3. Since the scaling of mean core number with N isroughly linear (in contrast with the threshold behavior of connec-tivity), we also expect that variation in mean core number acrossthe population surface will be much smoother than variation inlocal connectivity. At the same time, membership in cores of a par-ticular order is likely to be rare until a particular population densitythreshold is reached (with that threshold varying depending on theorder of the core in question). For phenomena expected to emergeonly in subgroups of a particular level of cohesion, then, we maystill expect qualitative shifts in behavior between high and low den-sity regions. For phenomena assumed to vary quantitatively withlocal cohesion, direct proportionality to the local population seemsa reasonable first guess.

Although considerable insight can be gleaned from studyingstylized situations, the arguments of this section omit severalpotentially consequential factors. As noted earlier, clustering due tospatial effects should tend to increase the N needed for connectivity,relative to a homogeneous Bernoulli model with the same expecteddensity. We can bound this effect as follows. Let G be the spatialBernoulli graph on the vertices of region A associated with SIF F.Define ı′ = minℓ,ℓ′∈AF(Dℓ,ℓ′), and let G′ be a homogeneous Bernoulligraph with parameter ı′ on the same vertex set. Intuitively, everypair in A is adjacent with probability at least ı′, and G′ can be thoughtof as supplying a “lower bound” on the true graph G.3 Now, definethe “residual” Bernoulli graph R, with parameter matrix suchthat ˚ij = 1 − (1 − F(Dij))/(1 − ı′). Let G ′ ∪ R represent the unionof G′ and R, i.e. the random graph in which an edge appears iff itappears in G′, R, or both. Since all edges of G ′ ∪ R are independentand occur with probability 1 − (1 − ı′)(1 − ˚ij) = F(Dij), it followsthat G ∼ G ′ ∪ R.

This “decomposition” of the spatial Bernoulli graph G into ahomogeneous Bernoulli graph G′ and a residual graph R can pro-vide us with a rigorous tool for understanding the behavior ofG in more complex settings. For instance, if some draw g′ fromG′ is connected, then g ′ ∪ R is connected; thus, the probabilityof connectivity under G′ is a lower bound on the probability ofconnectivity under G. More generally, let z be any graph statisticsuch that z(x ∪ y) ≥ z(x) for graphs x and y having the same ver-tex set. Then clearly Ez(G) ≥ Ez(G ′) and, for any given value zo of z,Pr (z(G) ≥ zo) ≥ Pr (z(G ′) ≥ zo). Since many statistics of interest (e.g.,

3 E.g., we can generate G and G′ using the same random inputs, such that everyedge in G is also in G′; see Butts (2010) for a general treatment of this approach.

P (A = a|D) =Y

{i,j}

B (Aij = aij |F (Dij , ✓))

! (�)


Vertex Placement Social Networks 34, 82 (2012)88 C.T. Butts et al. / Social Networks 34 (2012) 82– 100

Fig. 5. Comparison of uniform and quasi-random vertex placement, Quay County, NM MSA. Lines indicate census block boundaries, with artificial elevation shown via vertexcolor. Insets provide detail of 2 km × 2 km portion of Tucumcari, NM. (For interpretation of the references to color in this figure legend, the reader is referred to the webversion of the article.)

data from (Festinger et al., 1950)), and can be thought of as alocally sparse relation with a fairly long tail (declining as approxi-mately d−2.8 for large distances). The second is from a “face-to-faceinteraction” relation based on data from Freeman et al. (1988),which acts as locally dense relation that attenuates very quicklywith distance (apx d−6.4). Both are of the general power law formF(x) = !1/(1 + !2x)

!3 , with parameter vectors (0.533, 0.032, 2.788)in the case of the social friendship model, and (0.859, 0.035, 6.437)in the case of the model for face-to-face interaction. While we donot assume that all networks – or, indeed, that all types of friend-ship or interpersonal interaction – follow one of these two forms,we take these as plausible examples of the types of SIFs one islikely to see from proximate relations such as those from whichthe functions were derived. Similarities and differences in the net-works formed by such functions thus provide us with some senseof the range of behaviors one might observe in similar settings.

In computing distances for purposes of the SIF, we employ theEuclidean distances between individual surface positions in theprojected geometry, plus a small correction to account for theeffects of the built environment (where applicable). Specifically,differences in artificial elevation are added to the base Euclideandistance for those within 25 m of one another, while the sumsof individuals’ artificial elevation distances are added for pairswhose surface positions differ by more than 25 m. This simulates abasic feature of travel within the built environment, namely move-ment within a building for those who are otherwise proximate,versus movement to down ground level, over to the second posi-tion, and up for those with distant surface locations. While onecould employ more complex schemes (including explicit adjust-ment for roadways, local obstructions, etc.), this would requiremore detailed knowledge of household position, built environ-ment, and other aspects of local geography than were availablefor our test regions. As a practical matter, experimentation withreasonable alternatives to the approach employed here did notproduce substantively different results. By way of explanation,it should be noted that different notions of distance tend to bevery strongly correlated, even at fairly small scales. For instance,

a comparison of Euclidean distances for the cases used here withManhattan distances (the so-called “city block” metric sometimessuggested as an alternative for urban environments) produced amedian correlation of approximately 0.99 under both uniform andquasi-random microdistribution models; even considering onlyvery proximate points (within 100 m of each other in Euclideanspace), the median correlation is still approximately 0.98 for bothmicrodistribution models.5 While the impact of alternative dis-tance models on network structure at smaller spatial scales is aninteresting and potentially productive target for further research,our experiments suggest that the results reported here are not sen-sitive to reasonable variations in how distance is defined.

3.1.4. Network simulationGiven the above, simulation proceeds as follows.6 For every

location, we generate a population microdistribution using eachplacement model (uniform and quasi-random), subsequently gen-erating 25 networks from each microdistribution using the twospatial models. The resulting set of 1600 networks is the primarybasis for our subsequent comparative analysis. In addition to thislarger sample, an additional single network was drawn in eachcondition. This smaller set of 64 networks was retained for within-network analysis.

3.2. Results

Our analysis of the simulated networks includes an examina-tion of both within and between location heterogeneity. Here, we

5 Since Manhattan distance is affected by the choice of coordinates, we alsoconsidered the correlation of Euclidean distance against Manhattan distance on arandomly rotated axis set. The resulting median correlations were nearly identical(apx 0.99 in the unconstrained case, and 0.97 within 100 m).

6 Simulation and analysis was performed using the statnet and sna libraries forR (Handcock et al., 2008; Butts, 2008) and the R spatial tools (Bivand et al., 2008),along with additional functions created by the authors.


Friendship probability Social Networks 34, 82 (2012)

• Probability that two people are friends as a function of distance:

• with (0.533, 0.032, 2.788) for “social friendships” and (0.859, 0.035, 6.437) for “face-to-face interactions”.

F (d) =✓1

(1 + ✓2d)✓3


Social Network Properties Social Networks 34, 82 (2012)C.T. Butts et al. / Social Networks 34 (2012) 82– 100 97

Fig. 12. Marginal degree distributions by location, SIF, and placement model. Friendship model distributions are shown in blue, interaction model distributions in black;solid lines indicate uniform placement, with quasi-random placement in dotted lines. (For interpretation of the references to color in this figure legend, the reader is referredto the web version of the article.)

the Poisson preferred in only 1. The Waring cases seem to be asso-ciated with the Hartford, CT and Cookeville, TN MSAs under theFriendship SIF, and may reflect particularly high levels of hetero-geneity in these locations. These possible exceptions aside, the vastmajority of cases can be seen to be well-approximated by distribu-tions of the same form, despite differing in population size and landarea by several orders of magnitude.

Turning to core number, we note in the marginal distributionsof Fig. 13 the same combination of family resemblance and differ-ence in detail seen earlier in Fig. 12. As before, we attempt to assess

the presence of a common underlying distributional form by fittingmodels to each distribution, selecting that chosen by the AICC. Theresults of this process are shown in Table 6. Once again, we find thatthe negative binomial is overwhelmingly preferred, with the geo-metric distribution (itself a special case of the negative binomial)favored in two cases (both Hartford, CT Interaction SIF models) andthe Waring favored in one. For core number, as for degree, then, theunique pattern of variation in each individual population surfacenevertheless combines to generate a consistent family of marginaldistributions.

Table 6AICC selected models for core number distribution, by location, SIF, and placement model.

Freeman/uniform Festinger/uniform Freeman/quasi Festinger/quasi

Bristol Bay, AK NB NB NB NBGolden Valley, MT NB NB NB NBEsmeralda, NV NB NB NB NBYakutat, AK NB NB NB NBChoctaw, MS NB NB NB NBCheyenne, NE NB NB NB NBQuay, NM NB NB NB NBWhite Pine, NV NB NB NB NBLawrence, KS NB NB NB NBCookeville, TN NB NB W NBIdaho Falls, ID NB NB NB NBNavajo, AZ NB NB NB NBHonolulu, HI NB NB NB NBHartford, CT G NB G NBRochester, NY NB NB NB NBSalt Lake City, UT NB NB NB NB

G, Geometric; NB, Negative Binomial; P, Poisson; W, Waring; Y, Yule.


Co-occurences and Social Ties PNAS 107, 22436 (2010)

• Geotagged Flickr Photos

• Divide the world into a gridCount number of cells on which two individuals were within a given interval

randomly selected Flickr users have a 0.0134% chance of havinga social tie, but when two users have multiple spatio-temporal co-occurrences, this probability grows significantly. For example, twopeople have almost a 60% chance—nearly 5,000 times the base-line probability—of having a social tie on Flickr when they havefive co-occurrences at a temporal range of a day in distinct cells ofside length equal to 1 latitude-longitude degree (about 80 km on aside at the mid latitudes). Moreover, this number is likely an un-derestimate of the true probability, because many Flickr userschoose to keep their contact list private or do not use the socialnetworking features of the site at all (and hence those social tiesare missing from our ground truth data). Even with just three co-occurrences for this value of s and t, the probability is roughly 5%,which is more than 300 times greater than the prior probability ofhaving a social tie in our dataset.

The dependence of the probability on the cell size s is moresubtle: Because the co-occurrences are required to be in distinctcells, it is possible for k co-occurrences at a small value of s to alltake place inside the same cell at a larger value of s. As a result, kco-occurrences in distinct 1° cells may be more or less informativethan k co-occurrences in distinct .01° cells, because the latter mayall take place close together. (For example, three co-occurrencesthat each take place within .01° of each other in New York Cityrepresent closer spatial proximity, but the fact that there are threeof them may be less significant because they all take place withinthe same city; on the other hand, three co-occurrences that eachtake place within 1° of each other at points spread out across theUnited States represent less spatial proximity per co-occurrence,but collectively they may be more significant because they aretaking place far apart from each other.) The presence of thesecounteracting forces is borne out in Fig. 2, in which we see thatthe probabilities of friendship do not necessarily increase as thecell size decreases.

In Fig. 3 we correct for this effect by counting at most oneco-occurrence in any 1° cell, regardless of the value of s; thisforces the total possible number of co-occurrences betweentwo people to be 180 × 360 ¼ 64;800 regardless of the spatial cellsize s. With this correction in place, the probability of a social tiegrows monotonically as the cell size s decreases; for example, withk ¼ 3 and t equal to a day, the probability increases from about5% for s ¼ 1° to over 80% for s ¼ 0.001°.

Another source of subtlety arises from the fact that the area ofthe spatial cells varies significantly over the surface of the globe,because degrees of longitude become closer together as one tra-verses the globe from the equator to the poles. To address thisissue, we also performed our analysis using equal-area partition-ings of the globe computed via HEALPix (4). We found that theresults did not differ significantly, and hence in what follows weuse the conceptually simpler cells measured in degrees.

A Model of Spatio-Temporal Co-occurrences. The fact that a verysmall number of co-occurrences can lead to orders-of-magnitudegreater probabilities of a social tie suggests the need for a deeperinvestigation of the underlying phenomenon. We show that thebasic effect is a robust one, in that it can arise even on very simplemodels of social networks, provided we have an appropriateprobabilistic model for how activity is correlated across social ties.We begin with a simple model, followed by a richer one thatmatches the observed data more closely.

To formulate the simpler model, we suppose that the world isdivided into N geographic cells (like those pictured in Fig. 1).There are M people, each having one social tie, so that the socialnetwork consists of M∕2 disjoint edges. Each day, each pair offriends chooses to visit a place jointly with probability β and in-dependently with probability 1 − β; in either case the choice oflocation(s) is made uniformly at random. Using Bayes’ Law,the probability that two people are friends (event F) given thatthey visit exactly the same cells on k consecutive days (event Ck) is

PðFjCkÞ ¼PðFÞPðCkjFÞ

PðCkÞ:

The prior probability that two people are friends, PðFÞ, is 1M−1,

while the likelihood function PðCkjFÞ in the numerator is pk1,where p1 is the probability of two friends being at the same placeon a given day,

p1 ¼ β þ 1 − βN

:

The prior probability on observing k co-occurrences of two ran-dom people is

PðCkÞ ¼ PðCkjFÞPðFÞ þ PðCkjFÞPðFÞ ¼ pk1 ·1

M−1 þ pk2 ·M−2M−1 ;

where F denotes the event that the two people are not friends,and p2 ¼ 1

N is the probability of a co-occurrence between twononfriends. By substituting and simplifying into the Bayes’ Lawequation, we have,

PðFjCkÞ ¼pk1

pk1 þ pk2ðM − 2Þ:

Fig. 4A presents a plot of this probability as a function of k (withparameters M ¼ 7;500, N ¼ 100, β ¼ 0.05), showing a strong re-semblance to the observed t ¼ 1, s ¼ 1 plot of Fig. 2D. Note thatwith M large and k small, this function simplifies to an exponen-tial distribution,

PðFjCkÞ ≈pk1Mpk2

¼ 1

Mek log

p1p2 ¼ 1

Mek log βðN−1Þþ1;

which explains the near-linear curve in the semilog plot in Fig. 4A,in whichN and β jointly control the growth rate of the exponentialfunction, and M controls the probability at k ¼ 0.

While this basic probabilistic model explains the major fea-tures of Fig. 2, it is too simple to capture all of the details, includ-ing the rapid probability increase between k ¼ 0 and k ¼ 1. Tomodel the significance of a single co-occurrence, we take into ac-count the principle of homophily: the fact that people connectedby a social tie are more likely to engage in related activities, dueto their inherent similarity, even when they are choosing indepen-dently. For example, two people who know each other are morelikely to live close together and hence to visit places that are neareach of them. To incorporate this notion, we extend the model togive each individual an attribute that is shared across social ties.As before, we assume that there are M people, each with exactlyone social tie. The N geographic cells are arranged in a grid, andeach pair of friends (A, B) has a randomly chosen “home” cell,drawn from the two-dimensional empirical distribution of Flickrphotograhs (used here as a proxy because we do not know actual

A Jan 3+

A Jan 1+

A Jan 6+

A Jan 5+

B Jan 2+

B Jan 1+

B Jan 7+

B Jan 8+

A Jan 1+

B Jan 1+

s

s

A Jan 8+

B Jan 1+

Fig. 1. Illustration of how spatio-temporal co-occurrences are counted, forsome sample time-stamped observations of individuals A and B. The world isdivided into discrete cells of size s × s, and we count the number of cells k inwhich the two individuals have been observed within a time threshold of tdays—in this case, k ¼ 3 when t is 2.

Crandall et al. PNAS ∣ December 28, 2010 ∣ vol. 107 ∣ no. 52 ∣ 22437

COMPU

TERSC

IENCE

SSO

CIALSC

IENCE

S


Co-occurences and Social Ties PNAS 107, 22436 (2010)

home cities of Flickr users), which approximately follows a powerlaw with exponent 2.45. When A or B chooses a place on a givenday, they sample from a distribution DðA;BÞ, which is peakedaround the home cell and decays with distance according to an-other power law distribution (with exponent γ) (5, 6). On eachday, each person independently decides whether to visit a cell,with probability α, or to do nothing (and hence not be observedthat day). If two friends each choose to visit a cell (an event withprobability α2), then with probability β they visit the same cell,and with probability 1 − β their selections are independent. Inall cases, they select cells from the distribution DðA;BÞ.

The probabilities of friendship as a function of co-occurrenceproduced by this model (Fig. 4B) qualitatively approximate thedistributions observed in the actual Flickr data (Fig. 2D) acrossthe five time ranges we study (1 day, 7 days, 14 days, 28 days, and1 year). (In contrast, multiple simplifications of this model thatwe investigated, including sampling home cells independent ofthe social network and substituting uniform or Gaussian distribu-

tions for the home cell and travel distributions, did not match theempirical observations well.) The values for the model para-meters (M ¼ 7;500, N ¼ 64;800, α ¼ 0.29, β ¼ 0.12, γ ¼ 1.8)were found by minimizing the Kolmogorov–Smirnov statisticsbetween the distributions predicted by the model and those ob-served in the data, across all five time ranges, using a brute-forcesearch over a grid of quantized parameter values. Better quanti-tative fits to the model are possible if the parameters are adjustedfor each of the five temporal distributions separately; for examplesetting α ¼ 0.55 and β ¼ 0.05 gives a very good fit for the distri-bution corresponding to temporal range 1. A better fit for all timeperiods with a single set of model parameters could likely beachieved by explicitly modeling correlation of user activitiesacross time, instead of assuming that all decisions are madeon a day-by-day basis as our model currently does.

The analyses from these models thus indicate how very fewco-occurrences can lead to a sharp increase in the probabilityof a social tie, even with an extremely simple underlying network

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of contemporaneous events

Pro

babi

lity

of fr

iend

ship

1 day7 days14 days28 days1 year

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of fr

iend

ship


0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of fr

iend

ship


0 5 10 15 20

103

102

101

100


Pro

babi

lity

of fr

iend

ship


0 5 10 15 20

103

102

101

100


Pro

babi

lity

of fr

iend

ship


0 5 10 15 20

103

102

101

100


Pro

babi

lity

of fr

iend

ship


A B C

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of fr

iend

ship


0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pro

babi

lity

of fr

iend

ship


0 5 10 15 20

103

102

101

100


Pro

babi

lity

of fr

iend

ship


0 5 10 15 20

103

102

101

100


Pro

babi

lity

of fr

iend

ship


D E

s = 0 .001• s = 0 .01• s = 0 .1•

s = 1 .0• s = 10.0•

Fig. 2. The probability that two Flickr users have formed a social contact on the site, as a function of the number of times they have taken pictures atapproximately the same place and time. A shows the probabilities for a spatial cell size of s ¼ 0.001° (or around 80 meters in the middle latitudes) for timeperiods ranging from a day to a year. The lower plot is the same as the upper plot but with a log scale on the y-axis. Also shown are plots for cell size s equal to(B) 0.01°, (C) 0.1°, (D) 1.0°, and (E) 10.0°.

22438 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1006155107 Crandall et al.


Human Mobility Nature 453, 779 (2008)

with r0g ~5:8 km, br 5 1.65 6 0.15 and k 5 350 km (Fig. 1d, see

Supplementary Information for statistical validation). Levy flightsare characterized by a high degree of intrinsic heterogeneity, raisingthe possibility that equation (2) could emerge from an ensemble ofidentical agents, each following a Levy trajectory. Therefore, wedetermined P(rg) for an ensemble of agents following a random walk(RW), Levy flight (LF) or truncated Levy flight (TLF) (Fig. 1d)8,12,13.We found that an ensemble of Levy agents display a significant degreeof heterogeneity in rg; however, this was not sufficient to explain thetruncated power-law distribution P(rg) exhibited by the mobilephone users. Taken together, Fig. 1c and d suggest that the differencein the range of typical mobility patterns of individuals (rg) has astrong impact on the truncated Levy behaviour seen in equation(1), ruling out hypothesis A.

If individual trajectories are described by an LF or TLF, thenthe radius of gyration should increase with time as rg(t) , t3/(2 1 b)

(ref. 21), whereas, for an RW, rg(t) , t1/2; that is, the longer weobserve a user, the higher the chance that she/he will travel to areasnot visited before. To check the validity of these predictions, wemeasured the time dependence of the radius of gyration for userswhose gyration radius would be considered small (rg(T) # 3 km),medium (20 , rg(T) # 30 km) or large (rg(T) . 100 km) at the endof our observation period (T 5 6 months). The results indicate that

the time dependence of the average radius of gyration of mobilephone users is better approximated by a logarithmic increase, notonly a manifestly slower dependence than the one predicted by apower law but also one that may appear similar to a saturationprocess (Fig. 2a and Supplementary Fig. 4).

In Fig. 2b, we chose users with similar asymptotic rg(T) afterT 5 6 months, and measured the jump size distribution P(Drjrg)for each group. As the inset of Fig. 2b shows, users with small rg travelmostly over small distances, whereas those with large rg tend todisplay a combination of many small and a few larger jump sizes.Once we rescaled the distributions with rg (Fig. 2b), we found that thedata collapsed into a single curve, suggesting that a single jump sizedistribution characterizes all users, independent of their rg. Thisindicates that P Dr rg

!!" #*r{a

g F Dr$

rg

" #, where a < 1.2 6 0.1 and

F(x) is an rg-independent function with asymptotic behaviour, thatis, F(x) , x2a for x , 1 and F(x) rapidly decreases for x? 1.Therefore, the travel patterns of individual users may be approxi-mated by a Levy flight up to a distance characterized by rg. Mostimportant, however, is the fact that the individual trajectories arebounded beyond rg; thus, large displacements, which are the sourceof the distinct and anomalous nature of Levy flights, are statisticallyabsent. To understand the relationship between the different expo-nents, we note that the measured probability distributions are related

Figure 1 | Basic human mobility patterns. a, Week-long trajectory of 40mobile phone users indicates that most individuals travel only over shortdistances, but a few regularly move over hundreds of kilometres. b, Thedetailed trajectory of a single user. The different phone towers are shown asgreen dots, and the Voronoi lattice in grey marks the approximate receptionarea of each tower. The data set studied by us records only the identity of theclosest tower to a mobile user; thus, we can not identify the position of a userwithin a Voronoi cell. The trajectory of the user shown in b is constructedfrom 186 two-hourly reports, during which the user visited a total of 12different locations (tower vicinities). Among these, the user is found on 96and 67 occasions in the two most preferred locations; the frequency of visits

for each location is shown as a vertical bar. The circle represents the radius ofgyration centred in the trajectory’s centre of mass. c, Probability densityfunction P(Dr) of travel distances obtained for the two studied data sets D1

and D2. The solid line indicates a truncated power law for which theparameters are provided in the text (see equation (1)). d, The distributionP(rg) of the radius of gyration measured for the users, where rg(T) wasmeasured after T 5 6 months of observation. The solid line represents asimilar truncated power-law fit (see equation (2)). The dotted, dashed anddot-dashed curves show P(rg) obtained from the standard null models (RW,LF and TLF, respectively), where for the TLF we used the same step sizedistribution as the one measured for the mobile phone users.

LETTERS NATURE | Vol 453 | 5 June 2008

780Nature Publishing Group©2008

P (�r) = (�r +�r0)��

exp (�rg/)

Cell Phones


Human Mobility Nature 453, 779 (2008)

Received 19 December 2007; accepted 27 March 2008.

1. Horner, M. W. & O’Kelly, M. E. S Embedding economies of scale concepts for hubnetworks design. J. Transp. Geogr. 9, 255–265 (2001).

2. Kitamura, R., Chen, C., Pendyala, R. M. & Narayaran, R. Micro-simulation of dailyactivity-travelpatternsfortraveldemandforecasting.Transportation 27,25–51(2000).

3. Colizza, V., Barrat, A., Barthelemy, M., Valleron, A.-J. & Vespignani, A. Modelingthe worldwide spread of pandemic influenza: baseline case and containmentinterventions. PLoS Medicine 4, 95–110 (2007).

4. Eubank, S. et al. Controlling epidemics in realistic urban social networks. Nature429, 180–184 (2004).

5. Hufnagel, L., Brockmann, D. & Geisel, T. Forecast and control of epidemics in aglobalized world. Proc. Natl Acad. Sci. USA 101, 15124–15129 (2004).

6. Kleinberg, J. The wireless epidemic. Nature 449, 287–288 (2007).7. Brockmann, D. D., Hufnagel, L. & Geisel, T. The scaling laws of human travel.

Nature 439, 462–465 (2006).8. Havlin, S. & Ben-Avraham, D. Diffusion in disordered media. Adv. Phys. 51,

187–292 (2002).9. Viswanathan, G. M. et al. Levy flight search patterns of wandering albatrosses.

Nature 381, 413–415 (1996).10. Ramos-Fernandez, G. et al. Levy walk patterns in the foraging movements of

spider monkeys (Ateles geoffroyi). Behav. Ecol. Sociobiol. 273, 1743–1750 (2004).11. Sims, D. W. et al. Scaling laws of marine predator search behaviour. Nature 451,

1098–1102 (2008).12. Klafter, J., Shlesinger, M. F. & Zumofen, G. Beyond brownian motion. Phys. Today

49, 33–39 (1996).13. Mantegna, R. N. & Stanley, H. E. Stochastic process with ultraslow convergence to

a gaussian: the truncated Levy flight. Phys. Rev. Lett. 73, 2946–2949 (1994).14. Edwards, A. M. et al. Revisiting Levy flight search patterns of wandering

albatrosses, bumblebees and deer. Nature 449, 1044–1049 (2007).15. Sohn, T. et al. in Proc. 8th Int. Conf. UbiComp 2006 212–224 (Springer, Berlin, 2006).16. Onnela, J.-P. et al. Structure and tie strengths in mobile communication networks.

Proc. Natl Acad. Sci. USA 104, 7332–7336 (2007).17. Gonzalez, M. C. & Barabasi, A.-L. Complex networks: from data to models. Nature

Physics 3, 224–225 (2007).18. Palla, G., Barabasi, A.-L. & Vicsek, T. Quantifying social group evolution. Nature

446, 664–667 (2007).19. Hidalgo, C. A. & Rodriguez-Sickert, C. The dynamics of a mobile phone network.

Physica A 387, 3017–3024 (2008).

20. Barabasi, A.-L. The origin of bursts and heavy tails in human dynamics. Nature435, 207–211 (2005).

21. Redner, S. A Guide to First-Passage Processes (Cambridge Univ. Press, Cambridge,UK, 2001).

22. Condamin, S., Benichou, O., Tejedor, V. & Klafter, J. First-passage times incomplex scale-invariant media. Nature 450, 77–80 (2007).

23. Schlich, R. & Axhausen, K. W. Habitual travel behaviour: evidence from a six-weektravel diary. Transportation 30, 13–36 (2003).

24. Eagle, N. & Pentland, A. Eigenbehaviours: identifying structure in routine. Behav.Ecol. Sociobiol. (in the press).

25. Yook, S.-H., Jeong, H. & Barabasi, A. L. Modeling the Internet’s large-scaletopology. Proc. Natl Acad. Sci. USA 99, 13382–13386 (2002).

26. Caldarelli, G. Scale-Free Networks: Complex Webs in Nature and Technology.(Oxford Univ. Press, New York, 2007).

27. Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of Networks: From Biological Nets tothe Internet and WWW (Oxford Univ. Press, New York, 2003).

28. Song, C. M., Havlin, S. & Makse, H. A. Self-similarity of complex networks. Nature433, 392–395 (2005).

29. Gonzalez, M. C., Lind, P. G. & Herrmann, H. J. A system of mobile agents to modelsocial networks. Phys. Rev. Lett. 96, 088702 (2006).

30. Cecconi, F., Marsili, M., Banavar, J. R. & Maritan, A. Diffusion, peer pressure, andtailed distributions. Phys. Rev. Lett. 89, 088102 (2002).

Supplementary Information is linked to the online version of the paper atwww.nature.com/nature.

Acknowledgements We thank D. Brockmann, T. Geisel, J. Park, S. Redner,Z. Toroczkai, A. Vespignani and P. Wang for discussions and comments on themanuscript. This work was supported by the James S. McDonnell Foundation 21stCentury Initiative in Studying Complex Systems, the National Science Foundationwithin the DDDAS (CNS-0540348), ITR (DMR-0426737) and IIS-0513650programs, and the US Office of Naval Research Award N00014-07-C. Dataanalysis was performed on the Notre Dame Biocomplexity Cluster supported inpart by the NSF MRI grant number DBI-0420980. C.A.H. acknowledges supportfrom the Kellogg Institute at Notre Dame.

Author Information Reprints and permissions information is available atwww.nature.com/reprints. Correspondence and requests for materials should beaddressed to A.-L.B. ([email protected]).

–15 0 15–15

0

15

–150 0 150–150

0

150

–1,200 0 1,200–1,200

0

1,200

Fmin

Fmax

x (km) x (km) x (km)

y (k

m)

y (k

m)

y (k

m)

y/s y

y/s y

x/sx

0.1

0.2

0.3

0.4

s y/s

x

10 100 1,0001rg (km)

–15 0 15–15

0

15

–15

0

15

–15

0

15

–15 0 15 –15 0 15

–10 –5 0 5 1010–6

10–4

10–2

100

(x/s

x,0)

rg ≤ 3 km

20 km < rg < 30 km

rg > 100 km

~

10–2

10–3

10–4

10–5

10–6

a

b

dcy/s y

x/sx x/sx

x/sx

F

Figure 3 | The shape of human trajectories.a, The probability density function W(x, y) offinding a mobile phone user in a location (x, y) inthe user’s intrinsic reference frame (seeSupplementary Information for details). Thethree plots, from left to right, were generated for10,000 users with: rg # 3, 20 , rg # 30 andrg . 100 km. The trajectories become moreanisotropic as rg increases. b, After scaling eachposition with sx and sy, the resulting~WW x=sx ,y

!sy

" #has approximately the same shape

for each group. c, The change in the shape ofW(x, y) can be quantified calculating the isotropyratio S ; sy/sx as a function of rg, which decreasesas S*r{0:12

g (solid line). Error bars represent thestandard error. d, ~WW x=sx ,0ð Þ representing thex-axis cross-section of the rescaled distribution~WW x=sx ,y

!sy

" #shown in b.

LETTERS NATURE | Vol 453 | 5 June 2008

782Nature Publishing Group©2008

Cell Phones


Privacy Sci Rep 3, 1376 (2013)

function fits the data better than other two-parameters functionssuch as a 2 exp (lx), a stretched exponential a 2 exp xb, or astandard linear function a 2 bx (see Table S1). Both estimators fora and b are highly significant (p , 0.001)32, and the mean pseudo-R2

is 0.98 for the Ip54 case and the Ip510 case. The fit is good at all levelsof spatial and temporal aggregation [Fig. S3A–B].

The power-law dependency of e means that, on average, each timethe spatial or temporal resolution of the traces is divided by two, theiruniqueness decreases by a constant factor , (2)2b. This implies thatprivacy is increasingly hard to gain by lowering the resolution of adataset.

Fig. 2B shows that, as expected, e increases with p. The mitigatingeffect of p on e is mediated by the exponent b which decays linearlywith p: b 5 0.157 2 0.007p [Fig. 4E]. The dependence of b on pimplies that a few additional points might be all that is needed toidentify an individual in a dataset with a lower resolution. In fact,given four points, a two-fold decrease in spatial or temporal resolu-tion makes it 9.3% less likely to identify an individual, while given tenpoints, the same two-fold decrease results in a reduction of only 6.2%(see Table S1).

Because of the functional dependency of e on p through the expo-nent b, mobility datasets are likely to be re-identifiable usinginformation on only a few outside locations.

DiscussionOur ability to generalize these results to other mobility datasetsdepends on the sensitivity of our analysis to extensions of the data

to larger populations, or geographies. An increase in populationdensity will tend to decrease e. Yet, it will also be accompanied byan increase in the number of antennas, businesses or WiFi hotspotsused for localizations. These effects run opposite to each other, andtherefore, suggest that our results should generalize to higher popu-lation densities.

Extensions of the geographical range of observation are alsounlikely to affect the results as human mobility is known to be highlycircumscribed. In fact, 94% of the individuals move within an averageradius of less than 100 km17. This implies that geographical exten-sions of the dataset will stay locally equivalent to our observations,making the results robust to changes in geographical range.

From an inference perspective, it is worth noticing that the spatio-temporal points do not equally increase the likelihood of uniquelyidentifying a trace. Furthermore, the information added by a point ishighly dependent from the points already known. The amount ofinformation gained by knowing one more point can be defined as thereduction of the cardinality of S(Ip) associated with this extra point.The larger the decrease, the more useful the piece of information is.Intuitively, a point on the MIT campus at 3AM is more likely tomake a trace unique than a point in downtown Boston on a Fridayevening.

This study is likely to underestimate e, and therefore the ease of re-identification, as the spatio-temporal points are drawn at randomfrom users’ mobility traces. Our Ip are thus subject to the user’sspatial and temporal distributions. Spatially, it has been shown thatthe uncertainty of a typical user’s whereabouts measured by its

Figure 2 | (A) Ip52 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II). In this case, the targetwas in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) arecompatible with the constraints defined by Ip52. The subset S(Ip52) contains more than one trace and is therefore not unique. However, the green tracewould be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip53). (B) The uniqueness of traces with respect to the numberp of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. | S(Ip) | 5 1. The blue bars represent the fraction of | S(Ip) |# 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip54) is enough to uniquely characterize 95% of the traces amongst 1.5 Musers. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. Atmost eleven points are enough to uniquely characterize all considered traces.

10 6

10 5

10 4

10 3

10 0 10 1 10 2 10 3

Number of antennas

Inha

bita

nts

Pro

babi

lity

dens

ity fu

nctio

n

Median inter-interactions time per user [h]0 12 24 36 48 60 72 84 96

10 0

10 -1

10 -2

10 -3

10 -4

10 0

10 -1

10 -2

10 -3

10 -4

10 -5

0 500 1000 1500 2000 2500Number of interactions

Pro

babi

lity

dens

ity fu

nctio

n

A B C

Figure 3 | (A) Probability density function of the amount of recorded spatio-temporal points per user during a month. (B) Probability density functionof the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2 5 .6426). These plotsstrongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.



function fits the data better than other two-parameters functionssuch as a 2 exp (lx), a stretched exponential a 2 exp xb, or astandard linear function a 2 bx (see Table S1). Both estimators fora and b are highly significant (p , 0.001)32, and the mean pseudo-R2

is 0.98 for the Ip54 case and the Ip510 case. The fit is good at all levelsof spatial and temporal aggregation [Fig. S3A–B].

The power-law dependency of e means that, on average, each timethe spatial or temporal resolution of the traces is divided by two, theiruniqueness decreases by a constant factor , (2)2b. This implies thatprivacy is increasingly hard to gain by lowering the resolution of adataset.

Fig. 2B shows that, as expected, e increases with p. The mitigatingeffect of p on e is mediated by the exponent b which decays linearlywith p: b 5 0.157 2 0.007p [Fig. 4E]. The dependence of b on pimplies that a few additional points might be all that is needed toidentify an individual in a dataset with a lower resolution. In fact,given four points, a two-fold decrease in spatial or temporal resolu-tion makes it 9.3% less likely to identify an individual, while given tenpoints, the same two-fold decrease results in a reduction of only 6.2%(see Table S1).

Because of the functional dependency of e on p through the expo-nent b, mobility datasets are likely to be re-identifiable usinginformation on only a few outside locations.

DiscussionOur ability to generalize these results to other mobility datasetsdepends on the sensitivity of our analysis to extensions of the data

to larger populations, or geographies. An increase in populationdensity will tend to decrease e. Yet, it will also be accompanied byan increase in the number of antennas, businesses or WiFi hotspotsused for localizations. These effects run opposite to each other, andtherefore, suggest that our results should generalize to higher popu-lation densities.

Extensions of the geographical range of observation are alsounlikely to affect the results as human mobility is known to be highlycircumscribed. In fact, 94% of the individuals move within an averageradius of less than 100 km17. This implies that geographical exten-sions of the dataset will stay locally equivalent to our observations,making the results robust to changes in geographical range.

From an inference perspective, it is worth noticing that the spatio-temporal points do not equally increase the likelihood of uniquelyidentifying a trace. Furthermore, the information added by a point ishighly dependent from the points already known. The amount ofinformation gained by knowing one more point can be defined as thereduction of the cardinality of S(Ip) associated with this extra point.The larger the decrease, the more useful the piece of information is.Intuitively, a point on the MIT campus at 3AM is more likely tomake a trace unique than a point in downtown Boston on a Fridayevening.

This study is likely to underestimate e, and therefore the ease of re-identification, as the spatio-temporal points are drawn at randomfrom users’ mobility traces. Our Ip are thus subject to the user’sspatial and temporal distributions. Spatially, it has been shown thatthe uncertainty of a typical user’s whereabouts measured by its

Figure 2 | (A) Ip52 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II). In this case, the targetwas in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) arecompatible with the constraints defined by Ip52. The subset S(Ip52) contains more than one trace and is therefore not unique. However, the green tracewould be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (Ip53). (B) The uniqueness of traces with respect to the numberp of given spatio-temporal points (Ip). The green bars represent the fraction of unique traces, i.e. | S(Ip) | 5 1. The blue bars represent the fraction of | S(Ip) |# 2. Therefore knowing as few as four spatio-temporal points taken at random (Ip54) is enough to uniquely characterize 95% of the traces amongst 1.5 Musers. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. Atmost eleven points are enough to uniquely characterize all considered traces.

10 6

10 5

10 4

10 3

10 0 10 1 10 2 10 3

Number of antennas

Inha

bita

nts

Pro

babi

lity

dens

ity fu

nctio

n

Median inter-interactions time per user [h]0 12 24 36 48 60 72 84 96

10 0

10 -1

10 -2

10 -3

10 -4

10 0

10 -1

10 -2

10 -3

10 -4

10 -5

0 500 1000 1500 2000 2500Number of interactions

Pro

babi

lity

dens

ity fu

nctio

n

A B C

Figure 3 | (A) Probability density function of the amount of recorded spatio-temporal points per user during a month. (B) Probability density functionof the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2 5 .6426). These plotsstrongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps.



Cell Phones


Privacy Sci Rep 3, 1376 (2013)

entropy is 1.74, less than two locations18. This makes our randomchoices of points likely to pick the user’s top locations (typically‘‘home’’ and ‘‘office’’). Temporally, the distribution of calls duringthe week is far from uniform [Fig. S1] which makes our randomchoice more likely to pick a point at 4PM than at 3AM. However,even in this case, the traces we considered that are most difficult toidentify can be uniquely identified knowing only 11 locations [Fig. 2C].

For the purpose of re-identification, more sophisticatedapproaches could collect points that are more likely to reduce theuncertainty, exploit irregularities in an individual’s behaviour, orimplicitly take into account information such as home and work-place or travels abroad29,33. Such approaches are likely to reduce thenumber of locations required to identify an individual, vis-a-vis theaverage uniqueness of traces.

We showed that the uniqueness of human mobility traces is high,thereby emphasizing the importance of the idiosyncrasy of humanmovements for individual privacy. Indeed, this uniqueness meansthat little outside information is needed to re-identify the trace of atargeted individual even in a sparse, large-scale, and coarse mobilitydataset. Given the amount of information that can be inferred frommobility data, as well as the potentially large number of simplyanonymized mobility datasets available, this is a growing concern.We further showed that while E* vhð Þb, b , 2p/100. Together,these determine the uniqueness of human mobility traces given thetraces’ resolution and the available outside information. These resultsshould inform future thinking in the collection, use, and protectionof mobility data. Going forward, the importance of location data willonly increase34 and knowing the bounds of individual’s privacy will

Temporal resolution [h]

Spa

tial r

esol

utio

n [v

]

1 cell

3 cells

5 cells

7 cells

9 cells

11 cells

13 cells


Nor

mal

ized

uni

quen

ess

of tr

aces

Spa

tial r

esol

utio

n [v

]


A B

Spatial resolution [v]

Nor

mal

ized

uni

quen

ess

of tr

aces

C D

15

13

11

9

7

5

3

11 3 5 7 9 11 13 15

15

13

11

9

7

5

3

11 3 5 7 9 11 13 15

10 0

10 0

10 0 10 1

10 0 10 1

4 5 6 7 8 9 10

0.06

0.10

0.14

p

β

E

1 hour

3 hours

5 hours

7 hours

9 hours

11 hours

13 hours

Uniqueness of traces0.70

Uniqueness of traces0.70

β=0.157−0.007p

Figure 4 | Uniqueness of traces [e] when we lower the resolution of the dataset with (A) p 5 4 and (D) p 5 10 points. It is easier to attack a dataset that iscoarse on one dimension and fine along another than a medium-grained dataset along both dimensions. Given four spatio-temporal points, more than60% of the traces are uniquely characterized in a dataset with an h 5 15-hours temporal resolution while less than 40% of the traces are uniquelycharacterized in a dataset with a temporal resolution of h 5 7 hours and with clusters of v 5 7 antennas. The region covered by an antenna ranges from0.15 km2 in urban areas to 15 km2 in rural areas. (B–C) When lowering the temporal or the spatial resolution of the dataset, the uniqueness of tracesdecrease as a power function e 5 a 2 xb. (E) While e decreases according to a power function, its exponent b decreases linearly with the number of pointsp. Accordingly, a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution.



Cell Phones


Gravity Law of Commuting PNAS 106, 21484 (2009)

wij

i

j

US county commuting network

each node i : subpopulation (census area) each link (ij) : interaction between subpopulations i and j weight wij : number of people commuting from i to j per unit time


Gravity Law of Commuting PNAS 106, 21484 (2009)

w(D

) /

w(M

)

C)

E) F)

Distance (km)

w(D

) / (N

N

)

Distance (km)

10 -5

10 -4

10 -3

10 -2

10 -2

10 0

10 2

10 -2

10 0

10 2

10 2 10 4 10 6 10 810 -2

10 0

10 2

Population of destinationPopulation of origin

w(D

) /

w(M

)w

(D)

/ w

(M)

0 100 200 300 0 100 200 300

10 2 10 10 6 10 8

D)

ij

!"

A)

B)101

103

105

101

103

105

w(D

) /

w(M

)

C)

E) F)

Distance (km)

w(D

) / (N

N

)

Distance (km)

10 -5

10 -4

10 -3

10 -2

10 -2

10 0

10 2

10 -2

10 0

10 2

10 2 10 4 10 6 10 810 -2

10 0

10 2


w(D

) /

w(M

)w

(D)

/ w

(M)

0 100 200 300 0 100 200 300

10 2 10 10 6 10 8

D)

ij

!"

A)

B)101

103

105

101

103

105 w

(D)

/ w

(M)

C)

E) F)

Distance (km)

w(D

) / (N

N

)

Distance (km)

10 -5

10 -4

10 -3

10 -2

10 -2

10 0

10 2

10 -2

10 0

10 2

10 2 10 4 10 6 10 810 -2

10 0

10 2


w(D

) /

w(M

)w

(D)

/ w

(M)

0 100 200 300 0 100 200 300

10 2 10 10 6 10 8

D)

ij

!"

A)

B)101

103

105

101

103

105

w(D

) /

w(M

)

C)

E) F)

Distance (km)

w(D

) / (N

N

)

Distance (km)

10 -5

10 -4

10 -3

10 -2

10 -2

10 0

10 2

10 -2

10 0

10 2

10 2 10 4 10 6 10 810 -2

10 0

10 2


w(D

) /

w(M

)w

(D)

/ w

(M)

0 100 200 300 0 100 200 300

10 2 10 10 6 10 8

D)

ij

!"

A)

B)101

103

105

101

103

105

136 D. Balcan et al. / Journal of Computational Science 1 (2010) 132–145

Table 1Commuting networks in each continent. Number of countries (N), number of admin-istrative units (V) and inter-links between them (E) are summarized.

Continent N V E

Europe 17 65,880 4,490,650North America 2 6986 182,255Latin America 5 4301 102,117Asia 4 4355 380,385Oceania 2 746 30,679

Total 30 82,268 5,186,186

commuting. This allows to deal with self-similar units across theworld with respect to mobility as emerged from the tessellation andnot country specific administrative boundaries. We have thereforemapped the different levels of commuting data into the geographi-cal census areas formed by the Voronoi-like tessellation proceduredescribed above. The mapped commuting flows can be seen as asecond transport network connecting subpopulations that are geo-graphically close. This second network can be overlaid to the WANin a multi-scale fashion to simulate realistic scenarios for diseasespreading. The network exhibits important variability in the num-ber of commuters on each connection as well as in the total numberof commuters per geographical census area. Being the census areasstatistically homogeneous we can also extract a general statisticallaw that allows for the synthetic generation of commuting net-works in countries where real data are not available. A full accountof the commuting data obtained across different continents andtheir statistical analysis can be found in Ref. [2].

3.3. Disease model

Each geographical census area corresponds to a subpopulationin the metapopulation model. The infection dynamics within eachsubpopulation is governed by a disease specific compartmentalmodel in which we assume homogeneous mixing in the popula-tion. Although the model can use any compartmental structure,for the sake of clarity we will carry on our discussion by usingthe explicit example of a typical influenza-like illness (ILI) wherewe consider a Susceptible-Latent-Infectious-Recovered (SLIR) com-partmental scheme. In Fig. 3, a diagram of the compartmentalstructure with transitions between compartments is shown. Thecontagion process, i.e., generation of new infections, is the onlytransition mechanism which is altered by short-range mobility,whereas all the other transitions between compartments are spon-taneous and remain unaffected by the commuting. The rate atwhich a susceptible individual in subpopulation j acquires theinfection, the so called force of infection !j, is determined by inter-actions with infectious persons either in the home subpopulation jor in its neighboring subpopulations on the commuting network. In

Table 2Transitions between compartments and their rates.

Transition Type Rate

Sj → Lj Contagion !j

Lj → Iaj

Spontaneous εpa

Lj → Itj

ε(1 − pa)pt

Lj → Intj

ε(1 − pa)(1 − pt)Iaj

→ Rj #

Itj

→ Rj #

Intj

→ Rj #

general, the force of infection is assumed to follow the mass actionprinciple for which the infection rate is ! = ˇI / N where ˇ is theinfection transmission rate and I / N is the density of infected indi-viduals in the population. In the case of asymptomatic individualsthe force of infection is usually reduced by a factor rˇ. In the case ofmultiple interacting subpopulations and different classes of infec-tives the force of infection will be the sum of different contributionsas reported in Section 4.3.

Given the force of infection !j in subpopulation j, each personin the susceptible compartment (Sj) contracts the infection withprobability !j$t and enters the latent compartment (Lj), where $tis the time interval considered. Latent individuals exit the compart-ment with probability ε$t, and transit to asymptomatic infectiouscompartment (Ia

j ) with probability pa or, with the complemen-tary probability 1 − pa, become symptomatic infectious. Infectiouspersons with symptoms are further divided between those whocan travel (It

j ), probability pt, and those who are travel-restricted(Int

j ) with probability 1 − pt. All the infectious persons permanentlyrecover with probability #$t, entering the recovered compartment(Rj) in the next time step. All transitions and corresponding ratesare summarized in Table 2 and in Fig. 3.

4. Epidemic and mobility dynamics

Once the mobility data layers and the disease dynamics hasbeen defined, the number of individuals in each compartment [m]and subpopulation j follows a discrete and stochastic dynamicalequation that reads as

X[m]j (t + $t) − X[m]

j (t) = $X[m]j + %j([m]) (1)

where the term $X[m]j represents the change due to the compart-

ment transitions induced by the disease dynamics and the transportoperator %j([m]) represents the variations due to the travelingand mobility of individuals. The latter operator takes into accountthe long-range airline mobility and sets the minimal time scale ofintegration at 1 day. The mobility due to the commuting flows is

Fig. 3. Compartmental structure of the epidemic model within each subpopulation. A susceptible individual in contact with a symptomatic or asymptomatic infectious personcontracts the infection at rate ˇ or rˇˇ, respectively, and enters the latent compartment where he is infected but not yet infectious. At the end of the latency period ε−1,each latent individual becomes infectious, entering the symptomatic compartments with probability 1 − pa or becoming asymptomatic with probability pa . The symptomaticcases are further divided between those who are allowed to travel (with probability pt) and those who would stop traveling when ill (with probability 1 − pt). Infectiousindividuals recover permanently with rate #. All transition processes are modeled through multinomial processes.


Mobility and Social Networks

and for their dependence on the distance. The error Err of thisnull model is between 0:66–0:76 for the three countries, aroundtwice the error of the TF model (see Figure 6).The linking model (L model) is a simplified version of the TF

model, without random mobility and the box size d?0. Agentsmove to visit their contacts with probability pv, whereas withprobability 1{pv they do not perform any action. In this versionof the model, users can connect only by random connections orwhen two of them coincide, visiting a common friend, which leadsto triadic closure. These two processes do not depend on thedistances between the users. A thorough description can beobtained with a mean-field approach (see the correspondingsection). The results of the L model are shown in Figure 2. Due tothe triangle closing mechanism, this null model creates networkswith a considerable level of clustering. However, it does notreproduce the distance dependencies of Pl(d), R(d), Jf (d) andC(d). The error Err of the L model is also around twice higherthan the error of the TF model (see Figure 6).The geography and the structure are coupled in the TF model

through the random mobility. Changes in the underlying mobilitymechanism affect the quality of the results. The lowest Err valuesare obtained with the power-law distribution in the jump lengths,while normal or uniformly distributed jumps yield worse results

(e.g., for the US the TF model has Err lower by 0:5 and 1:5 thanthe TF-normal and the TF-uniform models, respectively, as shownin Figure 6).Simplified models that neglect either geography or network

structure perform considerably worse than the TF model inreproducing the properties of real networks. Likewise, non-realisticassumptions on human mobility mechanism yield worse resultsthan the default TF model. To conclude, the coupling ofgeography and structure through a realistic mobility mechanismproduces networks with significantly more realistic geographic andstructural properties.

Sensitivity of the TF Model to the Parameters and itsModificationsThe results presented so far have been obtained at the optimal

values of pv and pc. The question remains, however, of how robustthese results are to changes in the values of the parameters. InFigure 7, we report the effect of varying pv while pc is maintainedconstant in its optimal value. The linking probability Pl dð Þ loses itspower-law shape for very low values of pv, marking the limit inwhich random mobility is the main mechanism for the agents’traveling in detriment of friend visits. In this case, most of the linksare created due to encounters occurring in nearby locations or are

Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (differentcolors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual usersthroughout entire simulation. Ego network shows the social connections at the end of the simulation.doi:10.1371/journal.pone.0092196.g004

Coupling Mobility and Interactions in Social Media

PLOS ONE | www.plosone.org 6 March 2014 | Volume 9 | Issue 3 | e92196

and for their dependence on the distance. The error Err of thisnull model is between 0:66–0:76 for the three countries, aroundtwice the error of the TF model (see Figure 6).The linking model (L model) is a simplified version of the TF

model, without random mobility and the box size d?0. Agentsmove to visit their contacts with probability pv, whereas withprobability 1{pv they do not perform any action. In this versionof the model, users can connect only by random connections orwhen two of them coincide, visiting a common friend, which leadsto triadic closure. These two processes do not depend on thedistances between the users. A thorough description can beobtained with a mean-field approach (see the correspondingsection). The results of the L model are shown in Figure 2. Due tothe triangle closing mechanism, this null model creates networkswith a considerable level of clustering. However, it does notreproduce the distance dependencies of Pl(d), R(d), Jf (d) andC(d). The error Err of the L model is also around twice higherthan the error of the TF model (see Figure 6).The geography and the structure are coupled in the TF model

through the random mobility. Changes in the underlying mobilitymechanism affect the quality of the results. The lowest Err valuesare obtained with the power-law distribution in the jump lengths,while normal or uniformly distributed jumps yield worse results

(e.g., for the US the TF model has Err lower by 0:5 and 1:5 thanthe TF-normal and the TF-uniform models, respectively, as shownin Figure 6).Simplified models that neglect either geography or network

structure perform considerably worse than the TF model inreproducing the properties of real networks. Likewise, non-realisticassumptions on human mobility mechanism yield worse resultsthan the default TF model. To conclude, the coupling ofgeography and structure through a realistic mobility mechanismproduces networks with significantly more realistic geographic andstructural properties.

Sensitivity of the TF Model to the Parameters and itsModificationsThe results presented so far have been obtained at the optimal

values of pv and pc. The question remains, however, of how robustthese results are to changes in the values of the parameters. InFigure 7, we report the effect of varying pv while pc is maintainedconstant in its optimal value. The linking probability Pl dð Þ loses itspower-law shape for very low values of pv, marking the limit inwhich random mobility is the main mechanism for the agents’traveling in detriment of friend visits. In this case, most of the linksare created due to encounters occurring in nearby locations or are

Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (differentcolors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual usersthroughout entire simulation. Ego network shows the social connections at the end of the simulation.doi:10.1371/journal.pone.0092196.g004



PLoS One 9, E92196 (2014)


Geo-Social Properties

that has also an edge between i and k, forming a triangle. Note thata triangle consists of 3 triads centered on different nodes. Theeffect of the distance on the clustering coefficient can beincorporated by measuring the distances from each central nodej to two neighbors i and k forming a triad, d~dijzdjk, andcalculating the network clustering restricted to triads with distanced. This new function C(d) is the probability of closing a trianglegiven the distance d in a triad

C(d)~D(d)

L(d), ð2Þ

where (d) and (d) are the numbers of triads and closed triadsfor the distance d, respectively. The value of the global clusteringcoefficient C can be recovered by averaging C(d) over d. In thedatasets, we observe a drop in C(d) followed by a plateau, which isbest visible for the US networks (Figure 2E).Given a triangle, several configurations are possible if there is

diversity in the edge lengths. The triangle can be equilateral if allthe edges have the same length, isosceles if two have the samelength and the other is smaller, etc. We estimate the dominantshapes of the triangles in the network by measuring the disparity Ddefined as:

D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ

where d1, d2 and d3 are the geographical distances between thelocations of the users forming the triangle. The disparity takesvalues between 0 and 1 as the shape of the triangle passes fromequilateral to isosceles, where one edge is much smaller than theother two. D shows a distribution with two maxima in the onlinesocial networks (Figure 2F), for low and high values. The two most

common geometries of the triangles are: i) all 3 users are at asimilar distance, ii) 2 users are close to each other, while the thirdone is distant. Since most edges correspond to small distances, thismeans that most triangles are constituted by three users that are allclose to each other geographically. However, the stretchedisosceles configuration is also relatively common.Summarizing, we have defined the following metrics in order to

characterize the networks structure and its relation to geographicaldistance:

N P1(d): Probability of linking at a distance d (Figure 2A).

N P(k): Degree distribution (Figure 2B).

N R(d): The probability of reciprocation conditional on a link at adistance (Figure 2C).

N Jf(d): Average overlap as a function of the distance (Figure 2D).

N C(d): Clustering coefficient as a function of the triad distance(Figure 2E).

N P(D): Distribution of distance disparity for the triangles’ edges(Figure 2F).

We will use these metrics in the coming sections to estimate theability of model to produce social networks comparable with thoseobtained from the empirical datasets.

Model CalibrationNext, we will find a compromise between the different metrics

and search for the parameter values for which a given model bestfits simultaneously the various statistical properties. To do so, wedefine an overall error Err to quantify the difference between thenetworks generated with the model and the empirical ones. Theparameters of the model are then explored to find the values thatminimize Err. We measure the error Err X½ $ for each property Xand take the average over all the properties

Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linkingmodel (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) andC(d).doi:10.1371/journal.pone.0092196.g002



DL


C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL


C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL

Triangle Disparity


C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL

Reciprocity

DisparityDisparity


C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL

Prob of a Link


C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL

Clustering

PLoS One 9, E92196 (2014)


Geo-Social Model

New position of u{{

{Detect all

encounters ein the box of u

Visit a randomneighbour

Jump toa new location

Starting positionof user u

Created newsocial links

PLoS One 9, E92196 (2014)


Model Fitting

Err~1

8fErr Pl dð Þ½ $zErr P kð Þ½ $zErr R dð Þ½ $zErr Jf dð Þ½ $

zErr C dð Þ½ $zErr P Dð Þ½ $zErr Nc½ $zErr Cavg

! "g,

ð4Þ

where Nc is the total number of nodes in connected components ofthe network and Cavg is the undirected local clustering coefficientaveraged over the Nc connected nodes. The local clusteringcoefficient of a node i is defined as the ratio between number ofclosed triads centered on node i and the total number of triadscentered on that node.The properties X integrating Err can be scalars, functions or

distributions and encompass different orders of magnitude. Wedefine the error of a property X as

Err X½ $~Pn

i~1 yXi {f Xi## ##

Pni~1 yXi

## ## , ð5Þ

where yXi is the i-th observed value of the property X , f Xi is thecorresponding i-th value of the property obtained by the model. Inthe case of a distribution, i runs over the nmeasured bins, while fora scalar (such as the number of nodes or the clustering coefficient)the sum has only one term.We perform a Latin square sampling of the parameter space of

pv and pc as shown in Figure 3 in order to find the minimum valueof Err. The parameter space is covered uniformly in a linear scalefor pv and in a logarithmic one for pc. For all the countries, theminimum value of the error is obtained for pv in the interval

0:05,0:3ð Þ and pc in the range (5:10{3,5:10{2). The values of Errfound at the minimum are 0:30 for the US, 0:18 for the UK and0:39 for Germany. For simplicity, we focus on the Twitternetworks only, although similar results are obtained for the otherdatasets.

Results

Simulations for the Optimal ParametersAn example with the displacements between the consecutive

locations and the ego networks for a sample of individuals, asgenerated by the TF model, are displayed in Figure 4. Theparameters of the model are set to the ones that correspond to theminimum of the error Err. As shown, the agents tend to stay closeto their original positions. Occasional long jumps occur due tofriend visits that live far apart. In this range of parameters andsimulation times, the main mechanism for generating long distance

connections is random linking (controlled by pc). Agents typicallyreturn back to their original positions because this is where most oftheir contacts live. The frequency of the long distance jumps andconnections varies for the three countries due to the differentspatial distribution of the user populations. In the ego networks,the presence of multiple triangles with long distance edges can beobserved.The geo-social properties of the networks generated by the TF

model are shown in Figure 5 for the US and in Figures S3 and S4for the UK and Germany, respectively. Additionally, we show howeach of the introduced properties contributes to the total error ofthe model in Table S1. The model is able to reproduce the trendsin the probability Pl dð Þ, the reciprocity R dð Þ, the social overlapJf dð Þ and the disparity distribution P Dð Þ with good accuracy. Thedifficulties encountered with the degree distribution P kð Þ and theclustering as a function of the distance C dð Þ are not unexpectedsince the model does not incorporate mechanisms to explicitlyenhance the heterogeneity in the agents’ contacts nor favor anyspecific dependence of the clustering on the distance. We havetested variants of the TF model in which connections are createdusing the preferential attachment rule. The overall fitting error forthese variants of the model is not lower than for the basic TFmodel, as we show in Appendix S1.

Insights of the TF ModelIn this section we explore two null models uncoupling mobility

and social interactions to help us interpret the mechanisms actingin the TF model. The first null model, the spatial model (S model),is based solely on the geography and consists of randomlyconnecting pair of users with a probability depending on thedistance, but does not take network structure into account. Thesecond null model, the linking model (L model), in contrast, isbased only on random linking and triadic closure, and it isequivalent to the TF model without the mobility. We consider thetwo uncoupled null models and compare their results with those ofthe TF model. In this way, we demonstrate the importance of thecoupling through a realistic mobility mechanism to reproduce theempirical networks.The spatial model (S model) consists of randomly connecting

pair of users with a probability that decays as power-law of thedistance between them (suggested in [41]). The exponent of thepower-law is fixed at {0:7 following Figure 2A. The results ofthe S model are shown in the panels of Figure 2. While it is set tomatch Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ orP Dð Þ are not well reproduced. The S model fails to account for thehigh level of clustering and reciprocity in the empirical networks

Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a redrectangle.doi:10.1371/journal.pone.0092196.g003



Prob. to Make a New FriendProb

. to

Visit a

n Ol

d Fr

iend

PLoS One 9, E92196 (2014)


random connections, and so the distribution of triangles disparityP Dð Þ loses its bimodal shape. Furthermore, the friend visitsprovide opportunities to reciprocate the connections. This is whyfor extremely low values of pv, the reciprocity R dð Þ is close to zero.Towards the other limit, i.e., pv?1 the social overlap Jf (d) and thetriangle-closing probability C dð Þ steadily increase. In this limit, thelinking probability Pl dð Þ, the reciprocity R dð Þ and the distributionof triangles disparity P Dð Þ recuperate their shapes of the optimum.Figure 8 explores the impact of varying pc while pv is fixed to its

optimal value. The effect of pc on Jf (d) and C(d) is the opposite tothat of pv: these metrics decrease at all distances with increasing pc.The reason for this is that visits to friends are the main forcesbehind the creation of new triads and the subsequent closure oftriangles. Note that the more connections are created randomly(higher pc), the less links will be a result of friend visits. We willexpose and describe in detail the interplay between these twomechanisms in the mean-field calculations.A possible variation of the TF model consists of eliminating

friend visits or random connections (i.e., setting pv or pc to 0). This

prevents the model from producing networks with characteristicscomparable to the real ones in all the cases, leading to increase inErr of around 0.5. Interestingly, the model results are quite robustto variations in the update rules, the random connectionmechanism, the connecting rules in each agent neighborhoodand the variants in the way users visit friends. These variationslead to changes in Err smaller than 0.1. A detailed discussion ofthe results with different model variants is included in AppendixS1.

Mean-field ApproachIn this section, we consider the L model, introduced earlier in

this section, to gain some analytical insights on the mechanismsruling the final network structure. Although this model is asimplified version of the TF model, the results of the simulationsyield a relatively low value of Err (Figures 6, and Figures S9 andS10 in Appendix S1). We write the equations for the timeevolution of the properties of the network and solve themnumerically. Among all the properties, we focus on the average

Figure 6. Comparison of different models. The minimal values of the error Err for the TF model, the two null models: spatial (S model) or linking(L model), and the TF model with normally or uniformly distributed travel distances.doi:10.1371/journal.pone.0092196.g006

Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data(red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3and S4.doi:10.1371/journal.pone.0092196.g005













Model ResultsReciprocity

Clustering Triangle Disparity












C(d)~D(d)

L(d), ð2Þ



D~6d21zd2

2zd23

(d1zd2zd3)2{

1

3

! ", ð3Þ
















DL





















Prob of a Link

PLoS One 9, E92196 (2014)


Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)


Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)

Starting from Paris

Starting from New York

a

b


Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)


Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)

50 100 150 200 250 300 350

0.1

0.2

0.3

0.4

0.5

0.6

Coverage

R~

LocalNon−Local

a

100

200

300

400

500

600

0.2 0.3 0.4 0.5 0.6 0.7 0.8Proportion of Non−Local Users

Cov

erag

e

b

125 135 145 155

New YorkChicago

San FranciscoShanghai

DallasBerlinParis

Saint PetersburgBeijing

Moscow

Coverage

c

325 335 345

HoustonBarcelona

BrusselsDetroit

LimaIstanbul

RomeMoscow

ParisLisbon

Coverage

d


City Communities J. R. Soc. Interface 12, 20150473 (2015)

0 2 4 6 8 10

Los Angeles San Francisco

MiamiSingapore

TokyoParis

LondonNew York

Weighted Betwennness (x 102) Weighted degree


Angkor Wat

Forbidden City

Corcovado

Eiffel Tower

GizaGolden PavilionGrand Canyon

Hagia Sophia

Iguazu Falls

Kukulcan

London Tower

Machu Pichu

Mount Fuji

Niagara Falls

Taj Mahal

Pisa TowerTimes Square

Zocalo

Saint Basil's Cathedral

Ahlambra


Tourism EPJ Data Science 5, 12 (2016)


Touristic Sites

0.4 0.5 0.6Radius

Times SquareNiagara FallsAngkor Wat

Grand CanyonMachu Pichu

GizaForbidden City

Eiffel TowerPisa Tower

Taj Mahal

80 90 100 110 120Coverage (cell)

Iguazu FallsGiza

Times SquareMachu Pichu

Forbidden CityNiagara FallsEiffel Tower

Taj MahalGrand Canyon

Pisa Tower

20 24 28 32Coverage (country)

London TowerTimes SquareHagia SophiaMachu PichuAngkor Wat

Forbidden CityPisa Tower

Eiffel TowerGiza

Taj Mahal(a) (b) (c)

EPJ Data Science 5, 12 (2016)


Touristic Sites

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●



Discussion• Online Social Networks generate unprecedented amounts of data on Human

Behavior• The massification of GPS-enabled devices allows us to observe Geographical

variations• Human mobility is an intrinsically multi-scale process• Twitter is a good source of geolocated data, but it has many biases that must be

considered• Different types of links serve different social and information diffusion functions• The strength of ties provides important clues to the social structure• Colocation increases the likelihood of friendship • Mobility and Social Structure mutually influence each other• Mobility is a proxy for the centrality of a city or touristic locale


ReferencesPLoS One 10, e0115545 (2015)

Sci Rep 3, 1376 (2013)

Social Networks 34, 73 (2012)

Social Networks 34, 82 (2012)

PLoS One 7, e29358 (2012)

ICWSM’11, 375 (2011)

PNAS 107, 22436 (2010)

Nature 453, 779 (2008)

PNAS 104, 7333 (2007)


J. R. Soc. Interface 12, 20150473 (2015)

PLoS One 9, E92196 (2014)

PLoS One 8, E61981 (2013)

ICWSM’11, 89 (2011)

PNAS 106, 21484 (2009)

www.bgoncalves.com@bgoncalves Jun 20-22


CompleNet 2017Dubrovnic, Croatia — March/April

Jun 20-22