efficient and scalable socware detection in online social - usenix

16
Efficient and Scalable Socware Detection in Online Social Networks Md Sazzadur Rahman, Ting-Kai Huang, Harsha V. Madhyastha, Michalis Faloutsos Department of Computer Science and Engineering University of California, Riverside AbstractOnline social networks (OSNs) have become the new vec- tor for cybercrime, and hackers are finding new ways to propagate spam and malware on these platforms, which we refer to as socware. As we show here, socware cannot be identified with existing security mechanisms (e.g., URL blacklists), because it exploits different weaknesses and of- ten has different intentions. In this paper, we present MyPageKeeper, a Facebook ap- plication that we have developed to protect Facebook users from socware. Here, we present results from the perspective of over 12K users who have installed MyPageKeeper and their roughly 2.4 million friends. Our work makes three main contributions. First, to enable protection of users at scale, we design an efficient socware detection method which takes advantage of the social context of posts. We find that our classifier is both accurate (97% of posts flagged by it are indeed socware and it incorrectly flags only 0.005% of benign posts) and efficient (it requires 46 ms on average to classify a post). Second, we show that socware significantly differs from traditional email spam or web-based malware. For example, website blacklists identify only 3% of the posts flagged by MyPageKeeper, while 26% of flagged posts point to malicious apps and pages hosted on Facebook (which no current antivirus or blacklist is designed to detect). Third, we quantify the prevalence of socware by analyzing roughly 40 million posts over four months; 49% of our users were exposed to at least one socware post in this period. Finally, we identify a new type of parasitic behavior, which we refer to as “Like-as-a-Service”, whose goal is to artificially boost the number of “Likes” of a Facebook page. 1 Introduction As online social networks (OSNs) are becoming the new epicenter of the web, hackers are expanding their territory to these services [8]. Anyone using Facebook or Twitter is likely to be familiar with what we call here socware 1 : fake, annoying, possibly damaging posts from friends of the potential victim. The propagation of socware takes the form of postings and communications between 1 We find the introduction of the term socware necessary because, as we elaborate later in Section 2, the types of intent associated with socware encompasses more than traditional phishing and malware. friends on OSNs. Users are enticed into visiting suspi- cious websites or installing apps with the lure of false re- wards (e.g., free iPads in memory of Steve Jobs [30]), and they unwittingly send the post to their friends, thus en- abling a viral spreading. This is exactly where the power of socware lies: posts come with the implicit endorse- ment of the sending friend. Beyond this being a nuisance, socware also enables cyber-crime, with several Facebook scams resulting in loss of real money for users [11, 12]. Defenses against email spam are insufficient for identi- fying socware since reputation-based filtering [29, 28, 51] is insufficient to detect socware received from friends and, as we show later, the keywords used in email spam significantly differ from those used in socware. We also find that URL blacklists designed to detect phishing and malware on the web do not suffice, e.g., because a large fraction of socware (26% in our dataset) points to suspi- cious applications hosted on Facebook. Finally, though Facebook has its own mechanisms for detecting and re- moving malware [52], they seem to be less aggressive either due to what they define as malware or due to com- putational limitations. In this paper, we present the design and implemen- tation of a Facebook application, MyPageKeeper [24], that we develop specifically for the purpose of protect- ing Facebook users from socware. For any subscribing user of MyPageKeeper, whenever socware appears in that user’s wall or news feed, we seek to detect the socware soon thereafter and alert the user (hopefully before she views the post). Until October 2011, MyPageKeeper had been installed by more than 12K Facebook users (since its launch in June 2011). By monitoring the news feeds of these users, we also observe posts on the walls of the 2.4 million friends of these users. In this paper, we eval- uate MyPageKeeper using a dataset of over 40 million posts that it inspected during the four month period from June to October 2011. The key contributions of our work can be grouped into three main thrusts. a. Designing an accurate, efficient, and scalable de- tection method. In order to operate MyPageKeeper at 1

Upload: others

Post on 11-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Efficient and Scalable Socware Detection in Online SocialNetworks

Md Sazzadur Rahman Ting-Kai Huang Harsha V Madhyastha Michalis FaloutsosDepartment of Computer Science and Engineering

University of California Riverside

AbstractmdashOnline social networks (OSNs) have become the new vec-

tor for cybercrime and hackers are finding new ways topropagate spam and malware on these platforms whichwe refer to as socware As we show here socware cannotbe identified with existing security mechanisms (eg URLblacklists) because it exploits different weaknesses and of-ten has different intentions

In this paper we present MyPageKeeper a Facebook ap-plication that we have developed to protect Facebook usersfrom socware Here we present results from the perspectiveof over 12K users who have installed MyPageKeeper andtheir roughly 24 million friends Our work makes threemain contributions First to enable protection of users atscale we design an efficient socware detection method whichtakes advantage of the social context of posts We find thatour classifier is both accurate (97 of posts flagged by itare indeed socware and it incorrectly flags only 0005 ofbenign posts) and efficient (it requires 46 ms on average toclassify a post) Second we show that socware significantlydiffers from traditional email spam or web-based malwareFor example website blacklists identify only 3 of the postsflagged by MyPageKeeper while 26 of flagged posts pointto malicious apps and pages hosted on Facebook (which nocurrent antivirus or blacklist is designed to detect) Thirdwe quantify the prevalence of socware by analyzing roughly40 million posts over four months 49 of our users wereexposed to at least one socware post in this period Finallywe identify a new type of parasitic behavior which we referto as ldquoLike-as-a-Servicerdquo whose goal is to artificially boostthe number of ldquoLikesrdquo of a Facebook page

1 IntroductionAs online social networks (OSNs) are becoming the newepicenter of the web hackers are expanding their territoryto these services [8] Anyone using Facebook or Twitteris likely to be familiar with what we call here socware1fake annoying possibly damaging posts from friendsof the potential victim The propagation of socwaretakes the form of postings and communications between

1 We find the introduction of the term socware necessary becauseas we elaborate later in Section 2 the types of intent associated withsocware encompasses more than traditional phishing and malware

friends on OSNs Users are enticed into visiting suspi-cious websites or installing apps with the lure of false re-wards (eg free iPads in memory of Steve Jobs [30]) andthey unwittingly send the post to their friends thus en-abling a viral spreading This is exactly where the powerof socware lies posts come with the implicit endorse-ment of the sending friend Beyond this being a nuisancesocware also enables cyber-crime with several Facebookscams resulting in loss of real money for users [11 12]

Defenses against email spam are insufficient for identi-fying socware since reputation-based filtering [29 28 51]is insufficient to detect socware received from friendsand as we show later the keywords used in email spamsignificantly differ from those used in socware We alsofind that URL blacklists designed to detect phishing andmalware on the web do not suffice eg because a largefraction of socware (26 in our dataset) points to suspi-cious applications hosted on Facebook Finally thoughFacebook has its own mechanisms for detecting and re-moving malware [52] they seem to be less aggressiveeither due to what they define as malware or due to com-putational limitations

In this paper we present the design and implemen-tation of a Facebook application MyPageKeeper [24]that we develop specifically for the purpose of protect-ing Facebook users from socware For any subscribinguser of MyPageKeeper whenever socware appears in thatuserrsquos wall or news feed we seek to detect the socwaresoon thereafter and alert the user (hopefully before sheviews the post) Until October 2011 MyPageKeeper hadbeen installed by more than 12K Facebook users (sinceits launch in June 2011) By monitoring the news feedsof these users we also observe posts on the walls of the24 million friends of these users In this paper we eval-uate MyPageKeeper using a dataset of over 40 millionposts that it inspected during the four month period fromJune to October 2011

The key contributions of our work can be grouped intothree main thrusts

a Designing an accurate efficient and scalable de-tection method In order to operate MyPageKeeper at

1

scale but at low cost the distinguishing characteristic ofour approach is our strident focus on efficiency Prior so-lutions for detecting spam and malware on OSNs (whichwe describe in detail later) rely on information obtainedeither by crawling the URLs included in posts or by per-forming DNS resolution on these URLs In contrast oursocware classifier relies solely on the social context asso-ciated with each post (eg the number of walls and newsfeeds in which posts with the same embedded URL areobserved and the similarity of text descriptions acrossthese posts) Note that this approach means that we donot even resolve shortened URLs (eg using services likebitly) into the full URLs that they represent This ap-proach maximizes the rate at which we can classify poststhus reducing the cost of resources required to support agiven population of users

We employ a Machine Learning classification moduleusing Support Vector Machines on a carefully selectedset of such features that are readily available from theobserved posts 97 of posts flagged by our classifier areindeed socware and it incorrectly flags only 0005 ofbenign posts Furthermore it requires an average of only46 ms to classify a post

b Socware is a new kind of malware We show thatsocware is significantly different than traditional emailspam or web-based malware First URL blacklists can-not detect socware effectively These blacklists identifyonly 3 of the malicious posts that MyPageKeeper flagsThe inability of website blacklists to identify socwareis partly due to the fact that a significant fraction ofsocware is hosted on popular blogging domains such asblogspotcom and on Facebook itself Specifically26 of the flagged posts point to Facebook apps or pagesMoreover we also observe a low overlap between thekeywords associated with email based spam and those wefind in socware

c Quantifying socware prevalence and intentionWe find that 49 of our users were exposed to at leastone socware post in four months We also identify a newtype of parasitic behavior which we refer to as ldquoLike-as-a-Servicerdquo Its goal is to artificially boost the number ofldquoLikesrdquo of a Facebook page With the lure of games andrewards several Facebook apps push users to Like theFacebook pages of say a store or a product thus artifi-cially inflating their reputation on Facebook This furtherconfirms the difference between socware and other formsof malware propagation

2 Socware on FacebookIn this section we provide relevant background aboutFacebook and we describe typical characteristics ofsocware found on Facebook

21 The Facebook terminology

Facebook is the largest online social network today withover 900 million registered users roughly half of whomvisit the site daily Here we discuss some standard Face-book terminology relevant to our work

bull Post A post represents the basic unit of informationshared on Facebook Typical posts either contain onlytext (status updates) a URL with an associated text de-scription or a photoalbum shared by a user In ourwork we focus on posts that contain URLs

bull Wall A Facebook userrsquos wall is a page where friends ofthe user can post messages to the user Such messagesare called wall posts Other than to the user herselfposts on a userrsquos wall are visible to other users on Face-book determined by the userrsquos privacy settings Typi-cally a userrsquos wall is made visible to the userrsquos friendsand in some cases to friends of friends

bull News feed A Facebook userrsquos news feed page is a sum-mary of the social activity of the userrsquos friends on Face-book For example a userrsquos news feed contains poststhat one of the userrsquos friends may have shared with allof her friends Facebook continually updates the newsfeed of every user and the content of a userrsquos news feeddepends on when it is queried

bull Like Like is a Facebook widget that is associated withan object such as a post a page or an app If a userclicks the Like widget associated with an object theobject will appear in the news feed of the userrsquos friendsand thus allow information about the object to spreadacross Facebook Moreover the number of Likes (iethe number of users who have clicked the Like widget)received by an object also represents the reputation orpopularity of the object

bull Application Facebook allows third-party developers tocreate their own applications that Facebook users canadd Every time a user visits an applicationrsquos page onFacebook Facebook dynamically loads the content ofthe application from a URL called the canvas URLpointing to the application server provided by the ap-plicationrsquos developer Since content of an application isdynamically loaded every time a user visits the appli-cationrsquos page on Facebook the application developerenjoys great control over content shown in the applica-tion page The Facebook platform uses OAuth 20 [2]for user authentication application authorization andapplication authentication Here application authoriza-tion ensures that the users grant precise data (eg emailaddress) and capabilities (eg ability to post on theuserrsquos wall) to the applications they choose to add andapplication authentication ensures that a user grants ac-cess to her data to the correct application

2

22 SocwareWe start by defining the meaning of socware We de-scribe typical characteristics of socware and elaborate onhow socware distinguishes itself from traditional emailspam and web malware

What is socware Our intention is to use the termsocware to encompass all criminal and parasitic behav-ior in an OSN including anything that annoys hurts ormakes money off of the user In the context of this paperwe consider a Facebook post as malicious if it satisfiesone of the following conditions (1) the post spreads mal-ware and compromises the device of the user (2) the webpage pointed to by the post requires the user to give awaypersonal information (3) the post promises false rewards(eg free products) (4) the post is made on a userrsquos be-half without the userrsquos knowledge (typically by havingpreviously lured the user into providing relevant permis-sions to a rogue Facebook app) (5) the web page pointedto by the post requires the user to carry out tasks (eg fillout surveys) that help profit the owner of that website or(6) the post causes the user to artificially inflate the repu-tation of the page (eg by forcing the user to lsquoLikersquo thepage) While the first two criteria are typical malware andphishing the latter four are distinctive of socware

Disclaimer As with email spam there can be someambiguity in the definition of socware a post consid-ered as annoying by one user may be considered usefulby another user In practice our ultimate litmus test isthe opinion of MyPageKeeperrsquos users if most of themreport a post as annoying we will flag it as such

How does socware work Socware appears in a Face-book userrsquos wall or news feed typically in the form of apost which contains two parts First the post contains aURL 2 usually obfuscated with a URL shortening ser-vice to a webpage that hosts either malicious or spamcontent Second socware posts typically contain a catchytext message (eg ldquotwo free Southwest ticketsrdquo) that en-tice users to click on the URL included in the post Op-tionally socware posts also contain a thumbnail screen-shot of the landing page of the URL also used to enticethe user to click on the link For example a purportedimage of Osamarsquos corpse is included in a post that claimsto point to a video of his death

The operation of most socware epidemics can be asso-ciated with two distinct mechanisms

a Propagation mechanism Once a user follows theembedded URL to the target website the post tries topropagate itself through that user For this the user isoften asked to complete several steps in order to obtain

2We leave the identification of socware posts that do not contain anURL for future work The propagation of socware is harder in suchcases since the user needs to perform a more laborious operation (egenter an URL into the browserrsquos address bar) than simply clicking onthe embedded URL

AppName

Application Message MonthlyActiveUsers

FreePhoneCalls

Irsquom making a Free Call with the FreePhone Call Facebook App Irsquoll neverpay for a phone call again Make yourfree call at URL

435392

The App Check if a friend has deleted you URL 35216The App Check if a friend has deleted you URL 25778

Table 1 Three rogue Facebook applications identified by My-PageKeeper

Page Name Message to persuade lsquoLikersquo No ofLikes

Clif Bar Hey there Looking for a clif builderrsquoscoupon Just like us by clicking the but-ton above thanks

79919

FarmVille Bonus You canrsquot claim you you havenrsquot clickedon the like button

94907

Courtesy Chevrolet Like our page to play and have a changeto win

86287

Greggs The Bakers Like us to claim your voucher 288039Mobilink Infinity Like us for big infinite fun 26105

Table 2 Top five pages identified by MyPageKeeper that per-suade users to lsquoLikersquo them

the fake reward (eg ldquoFree Facebook T-shirtrdquo) Thesesteps involve ldquoLikingrdquo or ldquosharingrdquo the post or postingthe socware on the userrsquos wall Thus the cycle contin-ues with the friends of that user who see the post in theirnews feed In contrast users seldom forward email spamto their friends

b Exploitation mechanism The exploitation oftenstarts after the propagation phase The hacker attemptseither to learn the userrsquos private information via a phish-ing attack [9] to spread malware to user devices or tomake money by ldquoforcingrdquo a particular user action or re-sponse such as completing a survey for which the hackergets paid [19]

Where is socware hosted Socware can be broadlyclassified into two categories based on the infrastructurethat hosts them

a Socware hosted outside Facebook In this cate-gory URLs point to a domain hosted outside FacebookSince the URL points to a landing page outside the OSNhackers can directly launch the different kinds of attacksmentioned above once a user visits the URL in a socwarepost Though several URL blacklists should be able toflag such URLs the process of updating these blacklists istoo slow to keep up with the viral propagation of socwareon OSNs [44]

b Socware hosted on Facebook A significant frac-tion of socware is hosted on Facebook itself the embed-ded URL points to a Facebook page or application Natu-rally current blacklists and reputation-based schemes failto flag such URLs Such URLs typically point to the fol-lowing types of Facebook objectsbull Malicious Facebook applications Rogue applica-

3

tions post catchy messages (eg ldquoCheck who deletedyou from your profilerdquo) on the walls of users with alink pointing to the installation page of the applicationTable 1 lists three such socware-spreading applicationsin our data Users are conned into installing the ap-plication to their profile and granting several permis-sions to it The application then not only gets accessto that userrsquos personal information (such as email ad-dress home town and high school) but also gains theability to post on the victimrsquos wall As before posts ona userrsquos wall typically appear on the news feeds of theuserrsquos friends and the propagation cycle repeats Cre-ating such applications has become easy with ready touse toolkits starting at $25 [18]

bull Malicious Facebook events Sometimes hackers cre-ate Facebook events that contain a malicious link Onesuch event is the lsquoGet a free Facebook T-Shirt (Spon-sored by Reebok)rsquo scam This event page states that500000 users will get a free T-shirt from Facebook Tobe one among those 500000 users a user must attendthe event invite her friends to join and enter her ship-ping address

bull Malicious Facebook pages Another approach takenby hackers to spread socware is to create a Facebookpage and post spam links on the page [27] We alsoidentified a trend in aggressive marketing by compa-nies that force users to click ldquoLikerdquo on their Facebookpages to spread their pages as well as increase the rep-utation of the page Table 2 lists the top five such Face-book pages along with the message on the page andthe number of Likes received by these pages

3 MyPageKeeper ArchitectureTo identify socware and protect Facebook users from itwe develop MyPageKeeper MyPageKeeper is a Face-book application that continually checks the walls andnews feeds of subscribed users identifies socware postsand alerts the users We present our goals in designingMyPageKeeper and then describe the system architec-ture and implementation details

31 GoalsWe design MyPageKeeper with the following three pri-mary goals in mind

1 Accuracy Our foremost goal is to ensure accurateidentification of socware We are faced with the obvi-ous trade-off between missing malware posts (false neg-atives) and ldquocrying wolfrdquo too often (false positives) Al-though one could argue that minimizing false negatives ismore important users would abandon overly sensitive de-tection methods as recognized by the developers of Face-bookrsquos Immune System [52]

2 Scalability Our end goal is to have MyPageKeeperprovide protection from socware for all users on Face-

book not just for a select few Therefore the system mustbe scalable to easily handle increased load imposed by agrowth in the number of subscribed users

3 Efficiency Finally we seek to minimize our costsin operating MyPageKeeper The period between whena post first becomes visible to a user and the time it ischecked by MyPageKeeper represents the window of vul-nerability when the user is exposed to potential socwareTo minimize the resources necessary to keep this win-dow of vulnerability short MyPageKeeperrsquos techniquesfor classification of posts must be efficient

32 MyPageKeeper componentsMyPageKeeper consists of six functional modules

a User authorization module We obtain a userrsquosauthorization to check her wall and news feed through aFacebook application which we have developed Oncea user installs the MyPageKeeper application we obtainthe necessary credentials to access the posts of that userFor alerting the user we also request permission to accessthe userrsquos email address and to post on the userrsquos walland news feed Figure 1(a) shows how a Facebook userauthorizes an application

b Crawling module MyPageKeeper periodicallycollects the posts in every userrsquos wall and news feed Asmentioned previously we currently focus only on poststhat contain a URL Apart from the URL each post com-prises several other pieces of information such as a textmessage associated with the post the user who made thepost number of comments and Likes on the post and thetime when the post was created

c Feature extraction module To classify a post My-PageKeeper evaluates every embedded URL in the postOur key novelty lies in considering only the social con-text (eg the text message in the post and the numberof Likes on it) for the classification of the URL and therelated post Furthermore we use the fact that we are ob-serving more than one user which can help us detect anepidemic spread We discuss these features in more detaillater in Section 33

d Classification module The classification moduleuses a Machine Learning classifier based on Support Vec-tor Machines but also utilizes several local and externalwhitelists and blacklists that help speed up the processand increase the overall accuracy The classification mod-ule receives a URL and the related social context featuresextracted in the previous step Since the classification isour key contribution we discuss this in more detail inSection 33 If a URL is classified as socware all postscontaining the URL are labeled as such

e Notification module The notification module no-tifies all users who have socware posts in their wall ornews feed The user can currently specify the notificationmechanism which can be a combination of emailing the

4

User

Facebook Servers

Application Server

4 Generate and share

access token

1 App add request

2 Return permission setrequired by the app

3 Allow permission set

Crawler DB

Facebook DB

WLFeatureExtractor

URL

benign unknown

No SVM Classifier

benign Unknown

Malicious

Notification

BLNo

Yes Yes

Crawling Facebook posts Classifying Facebook posts

(a) (b)Figure 1 (a) Application installation process on Facebook and (b) architecture of MyPageKeeper

user or posting a comment on the suspect posts In thefuture we will consider allowing our system to removethe malicious post automatically but this can create lia-bilities in the case of false positives

f User feedback module Finally to improve My-PageKeeperrsquos ability to detect socware we leverage ouruser community We allow users to report suspiciousposts through a convenient user-interface In such a re-port the user can optionally describe the reason why sheconsiders the post as socware

33 Identification of socwareThe key novelty of MyPageKeeper lies in the classifica-tion module (summarized in Figure 1(b)) As describedearlier the input to the classification module is a URLand the related social context features extracted from theposts that contain the URL Our classification algorithmoperates in two phases with the expectation that URLsand related posts that make it through either phase with-out a match are likely benign and are treated as such

Using whitelists and blacklists To improve the effi-ciency and accuracy of our classifier we use lists of URLsand domains in the following two steps First MyPage-Keeper matches every URL against a whitelist of popularreputable domains We currently use a whitelist compris-ing the top 70 domains listed by Quantcast excluding do-mains that host user-contributed content (eg OSNs andblogging sites) Any URL that matches this whitelist isdeemed safe and it is not processed further

Second all the URLs that remain are then matchedwith several URL blacklists that list domains and URLsthat have been identified as responsible for spam phish-ing or malware Again the need to minimize classi-fication latency forces us to only use blacklists that wecan download and match against locally Such blacklistsinclude those from Googlersquos Safe Browsing API [17]Malware Patrol [23] PhishTank [26] APWG [1] Spam-Cop [28] joewein [20] and Escrow Fraud [7] Queryingblacklists that are hosted externally such as SURBL [31]URIBL [33] and WOT [34] will introduce significant la-tency and increase MyPageKeeperrsquos latency in detectingsocware thus inflating the window of vulnerability AnyURL that matches any of the blacklists that we use is clas-sified as socware

Using machine learning with social context fea-tures All URLs that do not match the whitelist or anyof the blacklists are evaluated using a Support VectorMachines (SVM) based classifier SVM is widely andsuccessfully used for binary classification in security andother disciplines [49 46] [32] We train our system witha batch of manually labeled data that we gathered overseveral months prior to the launch of MyPageKeeper Forevery input URL and post the classifier outputs a binarydecision to indicate whether it is malicious or not OurSVM classifier uses the following features

Spam keyword score Presence of spam keywords in apost provides a strong indication that the post is spamSome examples of such spam keywords are lsquoFREErsquolsquoHurryrsquo lsquoDealrsquo and lsquoShockedrsquo To compile a list of suchkeywords that are distinctive to socware our intuitionis to identify those keywords that 1) occur frequently insocware posts and 2) appear with a greater frequency insocware as compared to their frequency in benign posts

We compile such a list of keywords by comparinga dataset of manually identified socware posts witha dataset of posts that contain URLs that match ourwhitelist (we discuss how to maintain this list of key-words in Section 7) We transform posts in either datasetto a bag of words with their frequency of occurrenceWe then compute the likelihood ratio p1p2 for eachkeyword where p1 = p(word|socwarepost) and p2 =p(word|benignpost) The likelihood ratio of a key-word indicates the bias of the keyword appearing morein socware than in benign posts In our current imple-mentation of MyPageKeeper we have found that the useof the 6 keywords with the highest likelihood ratio val-ues among the 100 most frequently occurring keywordsin socware is sufficient to accurately detect socware

Thereafter to classify a URL MyPageKeeper searchesall posts that contain the URL for the presence of thesespam keywords and computes a spam keyword score asthe ratio of the number of occurrences of spam keywordsacross these posts to the number of posts

Message similarity If a post is part of a spam cam-paign it usually contains a text message that is similarto the text in other posts containing the same URL (egbecause users propagate the post by simply sharing it)On the other hand when different users share the same

5

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

scale but at low cost the distinguishing characteristic ofour approach is our strident focus on efficiency Prior so-lutions for detecting spam and malware on OSNs (whichwe describe in detail later) rely on information obtainedeither by crawling the URLs included in posts or by per-forming DNS resolution on these URLs In contrast oursocware classifier relies solely on the social context asso-ciated with each post (eg the number of walls and newsfeeds in which posts with the same embedded URL areobserved and the similarity of text descriptions acrossthese posts) Note that this approach means that we donot even resolve shortened URLs (eg using services likebitly) into the full URLs that they represent This ap-proach maximizes the rate at which we can classify poststhus reducing the cost of resources required to support agiven population of users

We employ a Machine Learning classification moduleusing Support Vector Machines on a carefully selectedset of such features that are readily available from theobserved posts 97 of posts flagged by our classifier areindeed socware and it incorrectly flags only 0005 ofbenign posts Furthermore it requires an average of only46 ms to classify a post

b Socware is a new kind of malware We show thatsocware is significantly different than traditional emailspam or web-based malware First URL blacklists can-not detect socware effectively These blacklists identifyonly 3 of the malicious posts that MyPageKeeper flagsThe inability of website blacklists to identify socwareis partly due to the fact that a significant fraction ofsocware is hosted on popular blogging domains such asblogspotcom and on Facebook itself Specifically26 of the flagged posts point to Facebook apps or pagesMoreover we also observe a low overlap between thekeywords associated with email based spam and those wefind in socware

c Quantifying socware prevalence and intentionWe find that 49 of our users were exposed to at leastone socware post in four months We also identify a newtype of parasitic behavior which we refer to as ldquoLike-as-a-Servicerdquo Its goal is to artificially boost the number ofldquoLikesrdquo of a Facebook page With the lure of games andrewards several Facebook apps push users to Like theFacebook pages of say a store or a product thus artifi-cially inflating their reputation on Facebook This furtherconfirms the difference between socware and other formsof malware propagation

2 Socware on FacebookIn this section we provide relevant background aboutFacebook and we describe typical characteristics ofsocware found on Facebook

21 The Facebook terminology

Facebook is the largest online social network today withover 900 million registered users roughly half of whomvisit the site daily Here we discuss some standard Face-book terminology relevant to our work

bull Post A post represents the basic unit of informationshared on Facebook Typical posts either contain onlytext (status updates) a URL with an associated text de-scription or a photoalbum shared by a user In ourwork we focus on posts that contain URLs

bull Wall A Facebook userrsquos wall is a page where friends ofthe user can post messages to the user Such messagesare called wall posts Other than to the user herselfposts on a userrsquos wall are visible to other users on Face-book determined by the userrsquos privacy settings Typi-cally a userrsquos wall is made visible to the userrsquos friendsand in some cases to friends of friends

bull News feed A Facebook userrsquos news feed page is a sum-mary of the social activity of the userrsquos friends on Face-book For example a userrsquos news feed contains poststhat one of the userrsquos friends may have shared with allof her friends Facebook continually updates the newsfeed of every user and the content of a userrsquos news feeddepends on when it is queried

bull Like Like is a Facebook widget that is associated withan object such as a post a page or an app If a userclicks the Like widget associated with an object theobject will appear in the news feed of the userrsquos friendsand thus allow information about the object to spreadacross Facebook Moreover the number of Likes (iethe number of users who have clicked the Like widget)received by an object also represents the reputation orpopularity of the object

bull Application Facebook allows third-party developers tocreate their own applications that Facebook users canadd Every time a user visits an applicationrsquos page onFacebook Facebook dynamically loads the content ofthe application from a URL called the canvas URLpointing to the application server provided by the ap-plicationrsquos developer Since content of an application isdynamically loaded every time a user visits the appli-cationrsquos page on Facebook the application developerenjoys great control over content shown in the applica-tion page The Facebook platform uses OAuth 20 [2]for user authentication application authorization andapplication authentication Here application authoriza-tion ensures that the users grant precise data (eg emailaddress) and capabilities (eg ability to post on theuserrsquos wall) to the applications they choose to add andapplication authentication ensures that a user grants ac-cess to her data to the correct application

2

22 SocwareWe start by defining the meaning of socware We de-scribe typical characteristics of socware and elaborate onhow socware distinguishes itself from traditional emailspam and web malware

What is socware Our intention is to use the termsocware to encompass all criminal and parasitic behav-ior in an OSN including anything that annoys hurts ormakes money off of the user In the context of this paperwe consider a Facebook post as malicious if it satisfiesone of the following conditions (1) the post spreads mal-ware and compromises the device of the user (2) the webpage pointed to by the post requires the user to give awaypersonal information (3) the post promises false rewards(eg free products) (4) the post is made on a userrsquos be-half without the userrsquos knowledge (typically by havingpreviously lured the user into providing relevant permis-sions to a rogue Facebook app) (5) the web page pointedto by the post requires the user to carry out tasks (eg fillout surveys) that help profit the owner of that website or(6) the post causes the user to artificially inflate the repu-tation of the page (eg by forcing the user to lsquoLikersquo thepage) While the first two criteria are typical malware andphishing the latter four are distinctive of socware

Disclaimer As with email spam there can be someambiguity in the definition of socware a post consid-ered as annoying by one user may be considered usefulby another user In practice our ultimate litmus test isthe opinion of MyPageKeeperrsquos users if most of themreport a post as annoying we will flag it as such

How does socware work Socware appears in a Face-book userrsquos wall or news feed typically in the form of apost which contains two parts First the post contains aURL 2 usually obfuscated with a URL shortening ser-vice to a webpage that hosts either malicious or spamcontent Second socware posts typically contain a catchytext message (eg ldquotwo free Southwest ticketsrdquo) that en-tice users to click on the URL included in the post Op-tionally socware posts also contain a thumbnail screen-shot of the landing page of the URL also used to enticethe user to click on the link For example a purportedimage of Osamarsquos corpse is included in a post that claimsto point to a video of his death

The operation of most socware epidemics can be asso-ciated with two distinct mechanisms

a Propagation mechanism Once a user follows theembedded URL to the target website the post tries topropagate itself through that user For this the user isoften asked to complete several steps in order to obtain

2We leave the identification of socware posts that do not contain anURL for future work The propagation of socware is harder in suchcases since the user needs to perform a more laborious operation (egenter an URL into the browserrsquos address bar) than simply clicking onthe embedded URL

AppName

Application Message MonthlyActiveUsers

FreePhoneCalls

Irsquom making a Free Call with the FreePhone Call Facebook App Irsquoll neverpay for a phone call again Make yourfree call at URL

435392

The App Check if a friend has deleted you URL 35216The App Check if a friend has deleted you URL 25778

Table 1 Three rogue Facebook applications identified by My-PageKeeper

Page Name Message to persuade lsquoLikersquo No ofLikes

Clif Bar Hey there Looking for a clif builderrsquoscoupon Just like us by clicking the but-ton above thanks

79919

FarmVille Bonus You canrsquot claim you you havenrsquot clickedon the like button

94907

Courtesy Chevrolet Like our page to play and have a changeto win

86287

Greggs The Bakers Like us to claim your voucher 288039Mobilink Infinity Like us for big infinite fun 26105

Table 2 Top five pages identified by MyPageKeeper that per-suade users to lsquoLikersquo them

the fake reward (eg ldquoFree Facebook T-shirtrdquo) Thesesteps involve ldquoLikingrdquo or ldquosharingrdquo the post or postingthe socware on the userrsquos wall Thus the cycle contin-ues with the friends of that user who see the post in theirnews feed In contrast users seldom forward email spamto their friends

b Exploitation mechanism The exploitation oftenstarts after the propagation phase The hacker attemptseither to learn the userrsquos private information via a phish-ing attack [9] to spread malware to user devices or tomake money by ldquoforcingrdquo a particular user action or re-sponse such as completing a survey for which the hackergets paid [19]

Where is socware hosted Socware can be broadlyclassified into two categories based on the infrastructurethat hosts them

a Socware hosted outside Facebook In this cate-gory URLs point to a domain hosted outside FacebookSince the URL points to a landing page outside the OSNhackers can directly launch the different kinds of attacksmentioned above once a user visits the URL in a socwarepost Though several URL blacklists should be able toflag such URLs the process of updating these blacklists istoo slow to keep up with the viral propagation of socwareon OSNs [44]

b Socware hosted on Facebook A significant frac-tion of socware is hosted on Facebook itself the embed-ded URL points to a Facebook page or application Natu-rally current blacklists and reputation-based schemes failto flag such URLs Such URLs typically point to the fol-lowing types of Facebook objectsbull Malicious Facebook applications Rogue applica-

3

tions post catchy messages (eg ldquoCheck who deletedyou from your profilerdquo) on the walls of users with alink pointing to the installation page of the applicationTable 1 lists three such socware-spreading applicationsin our data Users are conned into installing the ap-plication to their profile and granting several permis-sions to it The application then not only gets accessto that userrsquos personal information (such as email ad-dress home town and high school) but also gains theability to post on the victimrsquos wall As before posts ona userrsquos wall typically appear on the news feeds of theuserrsquos friends and the propagation cycle repeats Cre-ating such applications has become easy with ready touse toolkits starting at $25 [18]

bull Malicious Facebook events Sometimes hackers cre-ate Facebook events that contain a malicious link Onesuch event is the lsquoGet a free Facebook T-Shirt (Spon-sored by Reebok)rsquo scam This event page states that500000 users will get a free T-shirt from Facebook Tobe one among those 500000 users a user must attendthe event invite her friends to join and enter her ship-ping address

bull Malicious Facebook pages Another approach takenby hackers to spread socware is to create a Facebookpage and post spam links on the page [27] We alsoidentified a trend in aggressive marketing by compa-nies that force users to click ldquoLikerdquo on their Facebookpages to spread their pages as well as increase the rep-utation of the page Table 2 lists the top five such Face-book pages along with the message on the page andthe number of Likes received by these pages

3 MyPageKeeper ArchitectureTo identify socware and protect Facebook users from itwe develop MyPageKeeper MyPageKeeper is a Face-book application that continually checks the walls andnews feeds of subscribed users identifies socware postsand alerts the users We present our goals in designingMyPageKeeper and then describe the system architec-ture and implementation details

31 GoalsWe design MyPageKeeper with the following three pri-mary goals in mind

1 Accuracy Our foremost goal is to ensure accurateidentification of socware We are faced with the obvi-ous trade-off between missing malware posts (false neg-atives) and ldquocrying wolfrdquo too often (false positives) Al-though one could argue that minimizing false negatives ismore important users would abandon overly sensitive de-tection methods as recognized by the developers of Face-bookrsquos Immune System [52]

2 Scalability Our end goal is to have MyPageKeeperprovide protection from socware for all users on Face-

book not just for a select few Therefore the system mustbe scalable to easily handle increased load imposed by agrowth in the number of subscribed users

3 Efficiency Finally we seek to minimize our costsin operating MyPageKeeper The period between whena post first becomes visible to a user and the time it ischecked by MyPageKeeper represents the window of vul-nerability when the user is exposed to potential socwareTo minimize the resources necessary to keep this win-dow of vulnerability short MyPageKeeperrsquos techniquesfor classification of posts must be efficient

32 MyPageKeeper componentsMyPageKeeper consists of six functional modules

a User authorization module We obtain a userrsquosauthorization to check her wall and news feed through aFacebook application which we have developed Oncea user installs the MyPageKeeper application we obtainthe necessary credentials to access the posts of that userFor alerting the user we also request permission to accessthe userrsquos email address and to post on the userrsquos walland news feed Figure 1(a) shows how a Facebook userauthorizes an application

b Crawling module MyPageKeeper periodicallycollects the posts in every userrsquos wall and news feed Asmentioned previously we currently focus only on poststhat contain a URL Apart from the URL each post com-prises several other pieces of information such as a textmessage associated with the post the user who made thepost number of comments and Likes on the post and thetime when the post was created

c Feature extraction module To classify a post My-PageKeeper evaluates every embedded URL in the postOur key novelty lies in considering only the social con-text (eg the text message in the post and the numberof Likes on it) for the classification of the URL and therelated post Furthermore we use the fact that we are ob-serving more than one user which can help us detect anepidemic spread We discuss these features in more detaillater in Section 33

d Classification module The classification moduleuses a Machine Learning classifier based on Support Vec-tor Machines but also utilizes several local and externalwhitelists and blacklists that help speed up the processand increase the overall accuracy The classification mod-ule receives a URL and the related social context featuresextracted in the previous step Since the classification isour key contribution we discuss this in more detail inSection 33 If a URL is classified as socware all postscontaining the URL are labeled as such

e Notification module The notification module no-tifies all users who have socware posts in their wall ornews feed The user can currently specify the notificationmechanism which can be a combination of emailing the

4

User

Facebook Servers

Application Server

4 Generate and share

access token

1 App add request

2 Return permission setrequired by the app

3 Allow permission set

Crawler DB

Facebook DB

WLFeatureExtractor

URL

benign unknown

No SVM Classifier

benign Unknown

Malicious

Notification

BLNo

Yes Yes

Crawling Facebook posts Classifying Facebook posts

(a) (b)Figure 1 (a) Application installation process on Facebook and (b) architecture of MyPageKeeper

user or posting a comment on the suspect posts In thefuture we will consider allowing our system to removethe malicious post automatically but this can create lia-bilities in the case of false positives

f User feedback module Finally to improve My-PageKeeperrsquos ability to detect socware we leverage ouruser community We allow users to report suspiciousposts through a convenient user-interface In such a re-port the user can optionally describe the reason why sheconsiders the post as socware

33 Identification of socwareThe key novelty of MyPageKeeper lies in the classifica-tion module (summarized in Figure 1(b)) As describedearlier the input to the classification module is a URLand the related social context features extracted from theposts that contain the URL Our classification algorithmoperates in two phases with the expectation that URLsand related posts that make it through either phase with-out a match are likely benign and are treated as such

Using whitelists and blacklists To improve the effi-ciency and accuracy of our classifier we use lists of URLsand domains in the following two steps First MyPage-Keeper matches every URL against a whitelist of popularreputable domains We currently use a whitelist compris-ing the top 70 domains listed by Quantcast excluding do-mains that host user-contributed content (eg OSNs andblogging sites) Any URL that matches this whitelist isdeemed safe and it is not processed further

Second all the URLs that remain are then matchedwith several URL blacklists that list domains and URLsthat have been identified as responsible for spam phish-ing or malware Again the need to minimize classi-fication latency forces us to only use blacklists that wecan download and match against locally Such blacklistsinclude those from Googlersquos Safe Browsing API [17]Malware Patrol [23] PhishTank [26] APWG [1] Spam-Cop [28] joewein [20] and Escrow Fraud [7] Queryingblacklists that are hosted externally such as SURBL [31]URIBL [33] and WOT [34] will introduce significant la-tency and increase MyPageKeeperrsquos latency in detectingsocware thus inflating the window of vulnerability AnyURL that matches any of the blacklists that we use is clas-sified as socware

Using machine learning with social context fea-tures All URLs that do not match the whitelist or anyof the blacklists are evaluated using a Support VectorMachines (SVM) based classifier SVM is widely andsuccessfully used for binary classification in security andother disciplines [49 46] [32] We train our system witha batch of manually labeled data that we gathered overseveral months prior to the launch of MyPageKeeper Forevery input URL and post the classifier outputs a binarydecision to indicate whether it is malicious or not OurSVM classifier uses the following features

Spam keyword score Presence of spam keywords in apost provides a strong indication that the post is spamSome examples of such spam keywords are lsquoFREErsquolsquoHurryrsquo lsquoDealrsquo and lsquoShockedrsquo To compile a list of suchkeywords that are distinctive to socware our intuitionis to identify those keywords that 1) occur frequently insocware posts and 2) appear with a greater frequency insocware as compared to their frequency in benign posts

We compile such a list of keywords by comparinga dataset of manually identified socware posts witha dataset of posts that contain URLs that match ourwhitelist (we discuss how to maintain this list of key-words in Section 7) We transform posts in either datasetto a bag of words with their frequency of occurrenceWe then compute the likelihood ratio p1p2 for eachkeyword where p1 = p(word|socwarepost) and p2 =p(word|benignpost) The likelihood ratio of a key-word indicates the bias of the keyword appearing morein socware than in benign posts In our current imple-mentation of MyPageKeeper we have found that the useof the 6 keywords with the highest likelihood ratio val-ues among the 100 most frequently occurring keywordsin socware is sufficient to accurately detect socware

Thereafter to classify a URL MyPageKeeper searchesall posts that contain the URL for the presence of thesespam keywords and computes a spam keyword score asthe ratio of the number of occurrences of spam keywordsacross these posts to the number of posts

Message similarity If a post is part of a spam cam-paign it usually contains a text message that is similarto the text in other posts containing the same URL (egbecause users propagate the post by simply sharing it)On the other hand when different users share the same

5

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

22 SocwareWe start by defining the meaning of socware We de-scribe typical characteristics of socware and elaborate onhow socware distinguishes itself from traditional emailspam and web malware

What is socware Our intention is to use the termsocware to encompass all criminal and parasitic behav-ior in an OSN including anything that annoys hurts ormakes money off of the user In the context of this paperwe consider a Facebook post as malicious if it satisfiesone of the following conditions (1) the post spreads mal-ware and compromises the device of the user (2) the webpage pointed to by the post requires the user to give awaypersonal information (3) the post promises false rewards(eg free products) (4) the post is made on a userrsquos be-half without the userrsquos knowledge (typically by havingpreviously lured the user into providing relevant permis-sions to a rogue Facebook app) (5) the web page pointedto by the post requires the user to carry out tasks (eg fillout surveys) that help profit the owner of that website or(6) the post causes the user to artificially inflate the repu-tation of the page (eg by forcing the user to lsquoLikersquo thepage) While the first two criteria are typical malware andphishing the latter four are distinctive of socware

Disclaimer As with email spam there can be someambiguity in the definition of socware a post consid-ered as annoying by one user may be considered usefulby another user In practice our ultimate litmus test isthe opinion of MyPageKeeperrsquos users if most of themreport a post as annoying we will flag it as such

How does socware work Socware appears in a Face-book userrsquos wall or news feed typically in the form of apost which contains two parts First the post contains aURL 2 usually obfuscated with a URL shortening ser-vice to a webpage that hosts either malicious or spamcontent Second socware posts typically contain a catchytext message (eg ldquotwo free Southwest ticketsrdquo) that en-tice users to click on the URL included in the post Op-tionally socware posts also contain a thumbnail screen-shot of the landing page of the URL also used to enticethe user to click on the link For example a purportedimage of Osamarsquos corpse is included in a post that claimsto point to a video of his death

The operation of most socware epidemics can be asso-ciated with two distinct mechanisms

a Propagation mechanism Once a user follows theembedded URL to the target website the post tries topropagate itself through that user For this the user isoften asked to complete several steps in order to obtain

2We leave the identification of socware posts that do not contain anURL for future work The propagation of socware is harder in suchcases since the user needs to perform a more laborious operation (egenter an URL into the browserrsquos address bar) than simply clicking onthe embedded URL

AppName

Application Message MonthlyActiveUsers

FreePhoneCalls

Irsquom making a Free Call with the FreePhone Call Facebook App Irsquoll neverpay for a phone call again Make yourfree call at URL

435392

The App Check if a friend has deleted you URL 35216The App Check if a friend has deleted you URL 25778

Table 1 Three rogue Facebook applications identified by My-PageKeeper

Page Name Message to persuade lsquoLikersquo No ofLikes

Clif Bar Hey there Looking for a clif builderrsquoscoupon Just like us by clicking the but-ton above thanks

79919

FarmVille Bonus You canrsquot claim you you havenrsquot clickedon the like button

94907

Courtesy Chevrolet Like our page to play and have a changeto win

86287

Greggs The Bakers Like us to claim your voucher 288039Mobilink Infinity Like us for big infinite fun 26105

Table 2 Top five pages identified by MyPageKeeper that per-suade users to lsquoLikersquo them

the fake reward (eg ldquoFree Facebook T-shirtrdquo) Thesesteps involve ldquoLikingrdquo or ldquosharingrdquo the post or postingthe socware on the userrsquos wall Thus the cycle contin-ues with the friends of that user who see the post in theirnews feed In contrast users seldom forward email spamto their friends

b Exploitation mechanism The exploitation oftenstarts after the propagation phase The hacker attemptseither to learn the userrsquos private information via a phish-ing attack [9] to spread malware to user devices or tomake money by ldquoforcingrdquo a particular user action or re-sponse such as completing a survey for which the hackergets paid [19]

Where is socware hosted Socware can be broadlyclassified into two categories based on the infrastructurethat hosts them

a Socware hosted outside Facebook In this cate-gory URLs point to a domain hosted outside FacebookSince the URL points to a landing page outside the OSNhackers can directly launch the different kinds of attacksmentioned above once a user visits the URL in a socwarepost Though several URL blacklists should be able toflag such URLs the process of updating these blacklists istoo slow to keep up with the viral propagation of socwareon OSNs [44]

b Socware hosted on Facebook A significant frac-tion of socware is hosted on Facebook itself the embed-ded URL points to a Facebook page or application Natu-rally current blacklists and reputation-based schemes failto flag such URLs Such URLs typically point to the fol-lowing types of Facebook objectsbull Malicious Facebook applications Rogue applica-

3

tions post catchy messages (eg ldquoCheck who deletedyou from your profilerdquo) on the walls of users with alink pointing to the installation page of the applicationTable 1 lists three such socware-spreading applicationsin our data Users are conned into installing the ap-plication to their profile and granting several permis-sions to it The application then not only gets accessto that userrsquos personal information (such as email ad-dress home town and high school) but also gains theability to post on the victimrsquos wall As before posts ona userrsquos wall typically appear on the news feeds of theuserrsquos friends and the propagation cycle repeats Cre-ating such applications has become easy with ready touse toolkits starting at $25 [18]

bull Malicious Facebook events Sometimes hackers cre-ate Facebook events that contain a malicious link Onesuch event is the lsquoGet a free Facebook T-Shirt (Spon-sored by Reebok)rsquo scam This event page states that500000 users will get a free T-shirt from Facebook Tobe one among those 500000 users a user must attendthe event invite her friends to join and enter her ship-ping address

bull Malicious Facebook pages Another approach takenby hackers to spread socware is to create a Facebookpage and post spam links on the page [27] We alsoidentified a trend in aggressive marketing by compa-nies that force users to click ldquoLikerdquo on their Facebookpages to spread their pages as well as increase the rep-utation of the page Table 2 lists the top five such Face-book pages along with the message on the page andthe number of Likes received by these pages

3 MyPageKeeper ArchitectureTo identify socware and protect Facebook users from itwe develop MyPageKeeper MyPageKeeper is a Face-book application that continually checks the walls andnews feeds of subscribed users identifies socware postsand alerts the users We present our goals in designingMyPageKeeper and then describe the system architec-ture and implementation details

31 GoalsWe design MyPageKeeper with the following three pri-mary goals in mind

1 Accuracy Our foremost goal is to ensure accurateidentification of socware We are faced with the obvi-ous trade-off between missing malware posts (false neg-atives) and ldquocrying wolfrdquo too often (false positives) Al-though one could argue that minimizing false negatives ismore important users would abandon overly sensitive de-tection methods as recognized by the developers of Face-bookrsquos Immune System [52]

2 Scalability Our end goal is to have MyPageKeeperprovide protection from socware for all users on Face-

book not just for a select few Therefore the system mustbe scalable to easily handle increased load imposed by agrowth in the number of subscribed users

3 Efficiency Finally we seek to minimize our costsin operating MyPageKeeper The period between whena post first becomes visible to a user and the time it ischecked by MyPageKeeper represents the window of vul-nerability when the user is exposed to potential socwareTo minimize the resources necessary to keep this win-dow of vulnerability short MyPageKeeperrsquos techniquesfor classification of posts must be efficient

32 MyPageKeeper componentsMyPageKeeper consists of six functional modules

a User authorization module We obtain a userrsquosauthorization to check her wall and news feed through aFacebook application which we have developed Oncea user installs the MyPageKeeper application we obtainthe necessary credentials to access the posts of that userFor alerting the user we also request permission to accessthe userrsquos email address and to post on the userrsquos walland news feed Figure 1(a) shows how a Facebook userauthorizes an application

b Crawling module MyPageKeeper periodicallycollects the posts in every userrsquos wall and news feed Asmentioned previously we currently focus only on poststhat contain a URL Apart from the URL each post com-prises several other pieces of information such as a textmessage associated with the post the user who made thepost number of comments and Likes on the post and thetime when the post was created

c Feature extraction module To classify a post My-PageKeeper evaluates every embedded URL in the postOur key novelty lies in considering only the social con-text (eg the text message in the post and the numberof Likes on it) for the classification of the URL and therelated post Furthermore we use the fact that we are ob-serving more than one user which can help us detect anepidemic spread We discuss these features in more detaillater in Section 33

d Classification module The classification moduleuses a Machine Learning classifier based on Support Vec-tor Machines but also utilizes several local and externalwhitelists and blacklists that help speed up the processand increase the overall accuracy The classification mod-ule receives a URL and the related social context featuresextracted in the previous step Since the classification isour key contribution we discuss this in more detail inSection 33 If a URL is classified as socware all postscontaining the URL are labeled as such

e Notification module The notification module no-tifies all users who have socware posts in their wall ornews feed The user can currently specify the notificationmechanism which can be a combination of emailing the

4

User

Facebook Servers

Application Server

4 Generate and share

access token

1 App add request

2 Return permission setrequired by the app

3 Allow permission set

Crawler DB

Facebook DB

WLFeatureExtractor

URL

benign unknown

No SVM Classifier

benign Unknown

Malicious

Notification

BLNo

Yes Yes

Crawling Facebook posts Classifying Facebook posts

(a) (b)Figure 1 (a) Application installation process on Facebook and (b) architecture of MyPageKeeper

user or posting a comment on the suspect posts In thefuture we will consider allowing our system to removethe malicious post automatically but this can create lia-bilities in the case of false positives

f User feedback module Finally to improve My-PageKeeperrsquos ability to detect socware we leverage ouruser community We allow users to report suspiciousposts through a convenient user-interface In such a re-port the user can optionally describe the reason why sheconsiders the post as socware

33 Identification of socwareThe key novelty of MyPageKeeper lies in the classifica-tion module (summarized in Figure 1(b)) As describedearlier the input to the classification module is a URLand the related social context features extracted from theposts that contain the URL Our classification algorithmoperates in two phases with the expectation that URLsand related posts that make it through either phase with-out a match are likely benign and are treated as such

Using whitelists and blacklists To improve the effi-ciency and accuracy of our classifier we use lists of URLsand domains in the following two steps First MyPage-Keeper matches every URL against a whitelist of popularreputable domains We currently use a whitelist compris-ing the top 70 domains listed by Quantcast excluding do-mains that host user-contributed content (eg OSNs andblogging sites) Any URL that matches this whitelist isdeemed safe and it is not processed further

Second all the URLs that remain are then matchedwith several URL blacklists that list domains and URLsthat have been identified as responsible for spam phish-ing or malware Again the need to minimize classi-fication latency forces us to only use blacklists that wecan download and match against locally Such blacklistsinclude those from Googlersquos Safe Browsing API [17]Malware Patrol [23] PhishTank [26] APWG [1] Spam-Cop [28] joewein [20] and Escrow Fraud [7] Queryingblacklists that are hosted externally such as SURBL [31]URIBL [33] and WOT [34] will introduce significant la-tency and increase MyPageKeeperrsquos latency in detectingsocware thus inflating the window of vulnerability AnyURL that matches any of the blacklists that we use is clas-sified as socware

Using machine learning with social context fea-tures All URLs that do not match the whitelist or anyof the blacklists are evaluated using a Support VectorMachines (SVM) based classifier SVM is widely andsuccessfully used for binary classification in security andother disciplines [49 46] [32] We train our system witha batch of manually labeled data that we gathered overseveral months prior to the launch of MyPageKeeper Forevery input URL and post the classifier outputs a binarydecision to indicate whether it is malicious or not OurSVM classifier uses the following features

Spam keyword score Presence of spam keywords in apost provides a strong indication that the post is spamSome examples of such spam keywords are lsquoFREErsquolsquoHurryrsquo lsquoDealrsquo and lsquoShockedrsquo To compile a list of suchkeywords that are distinctive to socware our intuitionis to identify those keywords that 1) occur frequently insocware posts and 2) appear with a greater frequency insocware as compared to their frequency in benign posts

We compile such a list of keywords by comparinga dataset of manually identified socware posts witha dataset of posts that contain URLs that match ourwhitelist (we discuss how to maintain this list of key-words in Section 7) We transform posts in either datasetto a bag of words with their frequency of occurrenceWe then compute the likelihood ratio p1p2 for eachkeyword where p1 = p(word|socwarepost) and p2 =p(word|benignpost) The likelihood ratio of a key-word indicates the bias of the keyword appearing morein socware than in benign posts In our current imple-mentation of MyPageKeeper we have found that the useof the 6 keywords with the highest likelihood ratio val-ues among the 100 most frequently occurring keywordsin socware is sufficient to accurately detect socware

Thereafter to classify a URL MyPageKeeper searchesall posts that contain the URL for the presence of thesespam keywords and computes a spam keyword score asthe ratio of the number of occurrences of spam keywordsacross these posts to the number of posts

Message similarity If a post is part of a spam cam-paign it usually contains a text message that is similarto the text in other posts containing the same URL (egbecause users propagate the post by simply sharing it)On the other hand when different users share the same

5

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

tions post catchy messages (eg ldquoCheck who deletedyou from your profilerdquo) on the walls of users with alink pointing to the installation page of the applicationTable 1 lists three such socware-spreading applicationsin our data Users are conned into installing the ap-plication to their profile and granting several permis-sions to it The application then not only gets accessto that userrsquos personal information (such as email ad-dress home town and high school) but also gains theability to post on the victimrsquos wall As before posts ona userrsquos wall typically appear on the news feeds of theuserrsquos friends and the propagation cycle repeats Cre-ating such applications has become easy with ready touse toolkits starting at $25 [18]

bull Malicious Facebook events Sometimes hackers cre-ate Facebook events that contain a malicious link Onesuch event is the lsquoGet a free Facebook T-Shirt (Spon-sored by Reebok)rsquo scam This event page states that500000 users will get a free T-shirt from Facebook Tobe one among those 500000 users a user must attendthe event invite her friends to join and enter her ship-ping address

bull Malicious Facebook pages Another approach takenby hackers to spread socware is to create a Facebookpage and post spam links on the page [27] We alsoidentified a trend in aggressive marketing by compa-nies that force users to click ldquoLikerdquo on their Facebookpages to spread their pages as well as increase the rep-utation of the page Table 2 lists the top five such Face-book pages along with the message on the page andthe number of Likes received by these pages

3 MyPageKeeper ArchitectureTo identify socware and protect Facebook users from itwe develop MyPageKeeper MyPageKeeper is a Face-book application that continually checks the walls andnews feeds of subscribed users identifies socware postsand alerts the users We present our goals in designingMyPageKeeper and then describe the system architec-ture and implementation details

31 GoalsWe design MyPageKeeper with the following three pri-mary goals in mind

1 Accuracy Our foremost goal is to ensure accurateidentification of socware We are faced with the obvi-ous trade-off between missing malware posts (false neg-atives) and ldquocrying wolfrdquo too often (false positives) Al-though one could argue that minimizing false negatives ismore important users would abandon overly sensitive de-tection methods as recognized by the developers of Face-bookrsquos Immune System [52]

2 Scalability Our end goal is to have MyPageKeeperprovide protection from socware for all users on Face-

book not just for a select few Therefore the system mustbe scalable to easily handle increased load imposed by agrowth in the number of subscribed users

3 Efficiency Finally we seek to minimize our costsin operating MyPageKeeper The period between whena post first becomes visible to a user and the time it ischecked by MyPageKeeper represents the window of vul-nerability when the user is exposed to potential socwareTo minimize the resources necessary to keep this win-dow of vulnerability short MyPageKeeperrsquos techniquesfor classification of posts must be efficient

32 MyPageKeeper componentsMyPageKeeper consists of six functional modules

a User authorization module We obtain a userrsquosauthorization to check her wall and news feed through aFacebook application which we have developed Oncea user installs the MyPageKeeper application we obtainthe necessary credentials to access the posts of that userFor alerting the user we also request permission to accessthe userrsquos email address and to post on the userrsquos walland news feed Figure 1(a) shows how a Facebook userauthorizes an application

b Crawling module MyPageKeeper periodicallycollects the posts in every userrsquos wall and news feed Asmentioned previously we currently focus only on poststhat contain a URL Apart from the URL each post com-prises several other pieces of information such as a textmessage associated with the post the user who made thepost number of comments and Likes on the post and thetime when the post was created

c Feature extraction module To classify a post My-PageKeeper evaluates every embedded URL in the postOur key novelty lies in considering only the social con-text (eg the text message in the post and the numberof Likes on it) for the classification of the URL and therelated post Furthermore we use the fact that we are ob-serving more than one user which can help us detect anepidemic spread We discuss these features in more detaillater in Section 33

d Classification module The classification moduleuses a Machine Learning classifier based on Support Vec-tor Machines but also utilizes several local and externalwhitelists and blacklists that help speed up the processand increase the overall accuracy The classification mod-ule receives a URL and the related social context featuresextracted in the previous step Since the classification isour key contribution we discuss this in more detail inSection 33 If a URL is classified as socware all postscontaining the URL are labeled as such

e Notification module The notification module no-tifies all users who have socware posts in their wall ornews feed The user can currently specify the notificationmechanism which can be a combination of emailing the

4

User

Facebook Servers

Application Server

4 Generate and share

access token

1 App add request

2 Return permission setrequired by the app

3 Allow permission set

Crawler DB

Facebook DB

WLFeatureExtractor

URL

benign unknown

No SVM Classifier

benign Unknown

Malicious

Notification

BLNo

Yes Yes

Crawling Facebook posts Classifying Facebook posts

(a) (b)Figure 1 (a) Application installation process on Facebook and (b) architecture of MyPageKeeper

user or posting a comment on the suspect posts In thefuture we will consider allowing our system to removethe malicious post automatically but this can create lia-bilities in the case of false positives

f User feedback module Finally to improve My-PageKeeperrsquos ability to detect socware we leverage ouruser community We allow users to report suspiciousposts through a convenient user-interface In such a re-port the user can optionally describe the reason why sheconsiders the post as socware

33 Identification of socwareThe key novelty of MyPageKeeper lies in the classifica-tion module (summarized in Figure 1(b)) As describedearlier the input to the classification module is a URLand the related social context features extracted from theposts that contain the URL Our classification algorithmoperates in two phases with the expectation that URLsand related posts that make it through either phase with-out a match are likely benign and are treated as such

Using whitelists and blacklists To improve the effi-ciency and accuracy of our classifier we use lists of URLsand domains in the following two steps First MyPage-Keeper matches every URL against a whitelist of popularreputable domains We currently use a whitelist compris-ing the top 70 domains listed by Quantcast excluding do-mains that host user-contributed content (eg OSNs andblogging sites) Any URL that matches this whitelist isdeemed safe and it is not processed further

Second all the URLs that remain are then matchedwith several URL blacklists that list domains and URLsthat have been identified as responsible for spam phish-ing or malware Again the need to minimize classi-fication latency forces us to only use blacklists that wecan download and match against locally Such blacklistsinclude those from Googlersquos Safe Browsing API [17]Malware Patrol [23] PhishTank [26] APWG [1] Spam-Cop [28] joewein [20] and Escrow Fraud [7] Queryingblacklists that are hosted externally such as SURBL [31]URIBL [33] and WOT [34] will introduce significant la-tency and increase MyPageKeeperrsquos latency in detectingsocware thus inflating the window of vulnerability AnyURL that matches any of the blacklists that we use is clas-sified as socware

Using machine learning with social context fea-tures All URLs that do not match the whitelist or anyof the blacklists are evaluated using a Support VectorMachines (SVM) based classifier SVM is widely andsuccessfully used for binary classification in security andother disciplines [49 46] [32] We train our system witha batch of manually labeled data that we gathered overseveral months prior to the launch of MyPageKeeper Forevery input URL and post the classifier outputs a binarydecision to indicate whether it is malicious or not OurSVM classifier uses the following features

Spam keyword score Presence of spam keywords in apost provides a strong indication that the post is spamSome examples of such spam keywords are lsquoFREErsquolsquoHurryrsquo lsquoDealrsquo and lsquoShockedrsquo To compile a list of suchkeywords that are distinctive to socware our intuitionis to identify those keywords that 1) occur frequently insocware posts and 2) appear with a greater frequency insocware as compared to their frequency in benign posts

We compile such a list of keywords by comparinga dataset of manually identified socware posts witha dataset of posts that contain URLs that match ourwhitelist (we discuss how to maintain this list of key-words in Section 7) We transform posts in either datasetto a bag of words with their frequency of occurrenceWe then compute the likelihood ratio p1p2 for eachkeyword where p1 = p(word|socwarepost) and p2 =p(word|benignpost) The likelihood ratio of a key-word indicates the bias of the keyword appearing morein socware than in benign posts In our current imple-mentation of MyPageKeeper we have found that the useof the 6 keywords with the highest likelihood ratio val-ues among the 100 most frequently occurring keywordsin socware is sufficient to accurately detect socware

Thereafter to classify a URL MyPageKeeper searchesall posts that contain the URL for the presence of thesespam keywords and computes a spam keyword score asthe ratio of the number of occurrences of spam keywordsacross these posts to the number of posts

Message similarity If a post is part of a spam cam-paign it usually contains a text message that is similarto the text in other posts containing the same URL (egbecause users propagate the post by simply sharing it)On the other hand when different users share the same

5

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

User

Facebook Servers

Application Server

4 Generate and share

access token

1 App add request

2 Return permission setrequired by the app

3 Allow permission set

Crawler DB

Facebook DB

WLFeatureExtractor

URL

benign unknown

No SVM Classifier

benign Unknown

Malicious

Notification

BLNo

Yes Yes

Crawling Facebook posts Classifying Facebook posts

(a) (b)Figure 1 (a) Application installation process on Facebook and (b) architecture of MyPageKeeper

user or posting a comment on the suspect posts In thefuture we will consider allowing our system to removethe malicious post automatically but this can create lia-bilities in the case of false positives

f User feedback module Finally to improve My-PageKeeperrsquos ability to detect socware we leverage ouruser community We allow users to report suspiciousposts through a convenient user-interface In such a re-port the user can optionally describe the reason why sheconsiders the post as socware

33 Identification of socwareThe key novelty of MyPageKeeper lies in the classifica-tion module (summarized in Figure 1(b)) As describedearlier the input to the classification module is a URLand the related social context features extracted from theposts that contain the URL Our classification algorithmoperates in two phases with the expectation that URLsand related posts that make it through either phase with-out a match are likely benign and are treated as such

Using whitelists and blacklists To improve the effi-ciency and accuracy of our classifier we use lists of URLsand domains in the following two steps First MyPage-Keeper matches every URL against a whitelist of popularreputable domains We currently use a whitelist compris-ing the top 70 domains listed by Quantcast excluding do-mains that host user-contributed content (eg OSNs andblogging sites) Any URL that matches this whitelist isdeemed safe and it is not processed further

Second all the URLs that remain are then matchedwith several URL blacklists that list domains and URLsthat have been identified as responsible for spam phish-ing or malware Again the need to minimize classi-fication latency forces us to only use blacklists that wecan download and match against locally Such blacklistsinclude those from Googlersquos Safe Browsing API [17]Malware Patrol [23] PhishTank [26] APWG [1] Spam-Cop [28] joewein [20] and Escrow Fraud [7] Queryingblacklists that are hosted externally such as SURBL [31]URIBL [33] and WOT [34] will introduce significant la-tency and increase MyPageKeeperrsquos latency in detectingsocware thus inflating the window of vulnerability AnyURL that matches any of the blacklists that we use is clas-sified as socware

Using machine learning with social context fea-tures All URLs that do not match the whitelist or anyof the blacklists are evaluated using a Support VectorMachines (SVM) based classifier SVM is widely andsuccessfully used for binary classification in security andother disciplines [49 46] [32] We train our system witha batch of manually labeled data that we gathered overseveral months prior to the launch of MyPageKeeper Forevery input URL and post the classifier outputs a binarydecision to indicate whether it is malicious or not OurSVM classifier uses the following features

Spam keyword score Presence of spam keywords in apost provides a strong indication that the post is spamSome examples of such spam keywords are lsquoFREErsquolsquoHurryrsquo lsquoDealrsquo and lsquoShockedrsquo To compile a list of suchkeywords that are distinctive to socware our intuitionis to identify those keywords that 1) occur frequently insocware posts and 2) appear with a greater frequency insocware as compared to their frequency in benign posts

We compile such a list of keywords by comparinga dataset of manually identified socware posts witha dataset of posts that contain URLs that match ourwhitelist (we discuss how to maintain this list of key-words in Section 7) We transform posts in either datasetto a bag of words with their frequency of occurrenceWe then compute the likelihood ratio p1p2 for eachkeyword where p1 = p(word|socwarepost) and p2 =p(word|benignpost) The likelihood ratio of a key-word indicates the bias of the keyword appearing morein socware than in benign posts In our current imple-mentation of MyPageKeeper we have found that the useof the 6 keywords with the highest likelihood ratio val-ues among the 100 most frequently occurring keywordsin socware is sufficient to accurately detect socware

Thereafter to classify a URL MyPageKeeper searchesall posts that contain the URL for the presence of thesespam keywords and computes a spam keyword score asthe ratio of the number of occurrences of spam keywordsacross these posts to the number of posts

Message similarity If a post is part of a spam cam-paign it usually contains a text message that is similarto the text in other posts containing the same URL (egbecause users propagate the post by simply sharing it)On the other hand when different users share the same

5

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

popular URL they are likely to include different text de-scriptions in their posts Therefore greater similarity inthe text messages across all posts containing a URL por-tends a higher probability that the URL leads to spam Tocapture this intuition for each URL we compute a mes-sage similarity score that captures the variance in the textmessages across all posts that contain the URL For eachpost MyPageKeeper sums the ASCII values of the char-acters in the text message in the post and then computesthe standard deviation of this sum across all the posts thatcontain the URL If the text descriptions in all posts aresimilar the standard deviation will be low

News feed post and wall post count The more suc-cessful a spam campaign the greater the number of wallsand news feeds in which posts corresponding to the cam-paign will be seen Therefore for each URL MyPage-Keeper computes counts of the number of wall posts andthe number of news feed posts which contained the URL

Like and comment count Facebook users can lsquoLikersquoany post to indicate their interest or approval Users canalso post comments to follow up on the post again indi-cating their interest Users are unlikely to lsquoLikersquo postspointing to socware or comment on such posts sincethey add little value Therefore for every URL My-PageKeeper computes counts of the number of Likes andnumber of comments seen across all posts that containthe URL

URL obfuscation Hackers often try to spread mali-cious links in an obfuscated form eg by shortening itwith a URL shortening service such as bitly or googlWe store a binary feature with every URL that indicateswhether the URL has been shortened or not we maintaina list of URL shorteners

Note that none of the above features by themselves areconclusive evidence of socware and other features couldpotentially further enhance the classifier (eg we can ac-count for spam keywords such as lsquofreersquo included in URLssuch as httpnfljerseyfreecom) However as we showlater in our evaluation the features that we currently con-sider yield high classification accuracy in combination

34 Implementing MyPageKeeperWe provide some details on MyPageKeeperrsquos implemen-tation

Facebook application First we implement the My-PageKeeper Facebook application using FBML [14]We implement our application server using Apache(web server) Django (web framework) and Postgres(database) Once a user installs the MyPageKeeper app inher profile Facebook generates a secret access token andforwards the token to our application server which wethen save in a database This token is used by the crawlerto crawl the walls and news feeds of subscribed users us-ing the Facebook open-graph API If any user deactivates

Data Total distinct URLsMyPageKeeper users 12456 -

Friends of MyPageKeeper users 2370272 -News feed posts 38764575 29522732

Wall posts 1760737 1532055User reports 679 333

Table 3 Summary of MyPageKeeper data

MyPageKeeper from their profile Facebook disables thistoken and notifies our application server whereupon westop crawling that userrsquos wall and news feed

Crawler instances and frequency We run a set ofcrawlers in Amazon EC2 instances to periodically crawlthe walls and news feeds of MyPageKeeperrsquos users Theset of users are partitioned across the crawlers In ourcurrent instantiation we run one crawler process for ev-ery 1000 users Thus as more users subscribe to My-PageKeeper we can easily scale the task of crawling theirwalls and news feeds by instantiating more EC2 instancesfor the task Our Python-based crawlers use the open-graph API incorporating usersrsquo secret access tokens tocrawl posts from Facebook Once the data is received inJSON format the crawlers parse the data and save it in alocal Postgres database

Currently as a tradeoff between timeliness of detectionand resource costs on EC2 we instantiate MyPageKeeperto crawl every userrsquos wall and news feed once every twohours Every couple of hours all of our crawlers start upand each crawler fetches new posts that were previouslynot seen for the users assigned to it Once all crawlerscomplete execution the data from their local databases ismigrated to a central database

Checker instances Checker modules are used to clas-sify every post as socware or benign Every two hoursthe central scheduler forks an appropriate number ofchecker modules determined by the number of new URLscrawled since the last round of checking Thus the iden-tification of socware is also scalable since each checkermodule runs on a subset of the pool of URLs Eachchecker evaluates the URLs it receives as inputmdashusinga combination of whitelists blacklists and a classifiermdashand saves the results in a database We use the libsvm [41]library for SVM based classification Once all checkermodules complete execution notifiers are invoked to no-tify all users who have posts either on their wall or intheir news feed that contain URLs that have been flaggedas socware

4 EvaluationNext we evaluate MyPageKeeper from three perspec-tives First we evaluate the accuracy with which it clas-sifies socware Second we determine the contribution ofMyPageKeeperrsquos social context based classifier in iden-tifying socware compared to the URL blacklists that ituses Lastly we compare MyPageKeeperrsquos efficiency

6

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

Feature F-ScoreURL obfuscated 0300378

Spam keyword score 0262220 of news feed posts 0173836

Message similarity score 0131733 of Likes 0039895

of wall posts 0019857 of comments 0006367

Table 4 Feature scores used by MyPageKeeperrsquos classifierAlternative source of postsFlagged by blacklist 18923Flagged by onfbme 2102

Content deleted by Facebook 3918Blacklisted app 1290Blacklisted IP 5827

Domain is deleted 247Points to app install 4658

Spamming app 6547Manually verified 14876

True positives 58388 (97)Unknown 1803 (3)

Total 60191

Table 5 Validation of socware flagged by MyPageKeeper clas-sifier

with alternative approaches that would either crawl everyURL or at least resolve short URLs in order to identifysocware

Table 3 summarizes the dataset of Facebook posts onwhich we conduct our evaluation This data is obtainedduring MyPageKeeperrsquos operation over a four month pe-riod from 20th June to 19th October 2011 MyPage-Keeper had over 12K users during this period who hadaround 237M friends in union Our data comprises 387million and 17 million posts that contain URLs fromthe news feeds and walls of these 12K users We con-sider only those posts that contain URLs since MyPage-Keeper currently checks only such posts Overall these40 million posts contained around 30 million uniqueURLs In addition we received 679 reports of socwarefrom 533 distinct MyPageKeeper users during the fourmonth period with 333 distinct URLs across these re-ports Though it is hard to make any general claims withregard to representativeness of our data we find that sev-eral user metrics (eg the male-to-female ratio and thedistribution of users across age groups) closely match thatof the Facebook user base at-large

41 AccuracyAs previously mentioned MyPageKeeper first matchesevery URL to be checked against a whitelist If no matchis found it checks the URL with a set of locally queriableURL blacklists Finally MyPageKeeper applies its socialcontext based classifier learned using the SVM modelIn this process we assume URL information providedby whitelists and blacklists to be ground truth ie clas-sification provided by them need not be independentlyvalidated Therefore we focus here on validating the

App name Description of postsSendible Social Media Manage-

ment6687

iRazoo Search amp win 18534Loot 4Loot lets you win all

sorts of Loot whilesearching the web

1891

Table 6 Top three spamming applications in our dataset

socware flagged by MyPageKeeperrsquos classifier based onsocial context features

We trained MyPageKeeperrsquos classifier using a manu-ally verified dataset of URLs that contain 2500 positivesamples and 5000 negative samples of socware posts wegathered these samples over several months while devel-oping MyPageKeeper Table 4 shows the importance ofthe various features in the SVM classifier learned Dur-ing the course of MyPageKeeperrsquos operation over fourmonths we applied the classifier to check 753516 uniqueURLs these are URLs that do not match the whitelist orany of the blacklists Of these URLs the classifier iden-tified 4972 URLs seen across 60191 posts as instancesof socware

It is important to note that when MyPageKeeper seesa URL in multiple posts over time the values of the fea-tures associated with the URL may change every timeit appears eg the message similarity score associatedwith the URL can change However once MyPage-Keeper classifies a URL as socware during any of its oc-currences it flags all previously seen posts that containthe URL and notifies the corresponding users Thereforein evaluating MyPageKeeperrsquos classifier URL blacklistsor MyPageKeeper as a whole we consider here that atechnique classified a particular URL as socware if thatURL was flagged by that technique upon any of theURLrsquos occurrences Correspondingly we consider a URLto have not been classified as socware if it was not iden-tified as such during any of its occurrences

Checking the validity of socware identified by My-PageKeeperrsquos classifier is not straightforward since thereis no ground truth for what represents socware and whatdoes not However here we attempt to evaluate the pos-itive samples of socware identified by MyPageKeeperrsquosclassifier using a combination of a host of complemen-tary techniques (we later discuss in Section 7 the valida-tion of posts that are deemed safe by MyPageKeeper) Todo so we use an instrumented Firefox browser to crawlthe 4972 URLs flagged by MyPageKeeper at the end ofthe four month period of MyPageKeeper operation Forevery URL that we crawl we record the landing URLthe IP address and other whois information of the land-ing domain and contents of the landing page To verifythe reputation of every URL we then apply several tech-niques in the order summarized in Table 5bull Blacklisted URLs First we check if any of the URLs

or the corresponding landing URLs are found in any

7

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

URL blacklists Note that though we use blacklistsin the operation of MyPageKeeper itself we use onlythose that can be stored and queried locally There-fore here we use for validation other external black-lists for which we have to issue remote queries Fur-ther even for blacklists used in MyPageKeeper theymay not identify some instances of socware when theyinitially appear because blacklists have been found tolag in keeping up with the viral propagation of spamon OSNs [44] Hence we check if a URL identified assocware by MyPageKeeperrsquos classifier appeared in anyof the blacklists used by MyPageKeeper at a later pointin time even though it did not appear in any of thoseblacklists initially when MyPageKeeper spotted postscontaining that URL

bull Flagged by fbme URL shortener Many URLs postedon Facebook are shortened using Facebookrsquos URLshortener fbme When Facebook determines any linkshortened using their service to be unsafe the corre-sponding shortened URL thereafter redirects to Face-bookrsquos home pagemdashfacebookcomhomephpmdashinstead of the actual landing page Of the URLs flaggedby MyPageKeeperrsquos classifier we check if those short-ened using Facebookrsquos URL shortening service redirectto Facebookrsquos home page

bull Content deleted from Facebook If Facebook deter-mines any URL hosted under the facebookcom do-main to be unsafe (eg the page for a spamming Face-book application) it thereafter redirects that URL tofacebookcom4oh4php We use this as anothersource of information to validate URLs flagged by My-PageKeeperrsquos classifier

bull Blacklisted apps If the URLs in posts made by a Face-book app are flagged due to any of the above reasonswe consider that app to be malicious and declare allother URLs posted by it as unsafe thus helping val-idate some of the URLs declared as socware by My-PageKeeperrsquos classifier

bull Blacklisted IPs For every URL flagged by any of theabove techniques we record the IP address when thatURL is crawled and blacklist that IP Of the URLsflagged by MyPageKeeperrsquos classifier we then con-sider those that lead to one of these blacklisted IP ad-dresses as correctly classified

bull Domain deleted Malicious domains are often deletedonce they are caught serving malicious content There-fore we deem MyPageKeeperrsquos positive classificationof a URL to be correct if the domain for that URL nolonger exists when we attempt to crawl it

bull Obfuscation of app installation page Posts made byFacebook applications to attract users to install themtypically include an un-shortened URL pointing to aFacebook page that contains information about the ap-

Source () of URLs () of posts Overlap withclassifier (of URLs)

Google SBA2 221 (68) 378 (04) 0Phishtank 12 (04) 435 (05) 1Malware Norm 69 (21) 154 (02) 0Joewein 240 (74) 652 (07) 11APWG 56 (17) 569 (06) 0Spamcop 232 (71) 921 (10) 0All blacklists 830 (256) 3104 (34) 12MyPageKeeperclassifier

2405 (744) 89389 (966)

Table 7 Comparison of contribution made by blacklists andclassifier to MyPageKeeperrsquos identification of socware duringthe four month period of operation

plication Once a user visits this page she can readthe applicationrsquos description and then click on a link onthis page if she decides to install it However postsfrom some surreptitious applications contained short-ened URLs that directly take the user to a page wherethey request the user to grant permissions (eg to poston the userrsquos wall) and install the application We havefound all instances of such applications to be spammingapplications Therefore if any of the URLs flagged byMyPageKeeperrsquos classifier is a shortened URL that di-rectly points to the installation page for a Facebook appwe declare that classification correct

bull Spamming app From our dataset we manually identi-fied several Facebook applications that try to spread onFacebook by promising free money to users and makeposts that point to the application page Once installedby a user such applications periodically post on theuserrsquos wall (without requesting the userrsquos authorizationfor each post) in an attempt to further propagate by at-tracting that userrsquos friends Table 6 shows some suchapplications that frequently appear in our dataset AnyURLs classified as socware by MyPageKeeperrsquos classi-fier that happen to be posted by one of these manuallyidentified spamming apps are deemed correct

bull Manual analysis Finally over the operation of My-PageKeeper during the four months we periodicallyverified a subset of URLs flagged by the classifierThese provide an additional source of validation

In all the union of the above techniques validatesthat 58388 out of 60191 posts declared as socware bythe MyPageKeeper classifier are indeed so Therefore97 of the socware identified by MyPageKeeperrsquos clas-sifier are true positives On the other hand the 1803posts incorrectly classified as socware constitute less than0005 of the over 40 million posts in our dataset Notethat though all of the above techniques could be foldedinto MyPageKeeper itself to help identify socware wedo not do so because all of these techniques require us tocrawl a URL in order to evaluate it we cannot afford thelatency of crawling

8

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

0

10

20

30

40

50

MPK

Resolver

Craw

ler

Th

rou

gh

pu

t (U

RL

sec

)

(a) Throughput

10-2

10-1

100

101

102

MPK

Resolver

Craw

ler

Lat

ency

(se

c)

(b) Latency

Figure 2 Comparison of MyPageKeeperrsquos throughput and la-tency in classifying URLs with a short URL resolver and acrawler-based approach The height of the box shows the me-dian with the whiskers representing 5th and 95th percentiles

42 Comparison with blacklistsThough we see that the identification of socware by My-PageKeeperrsquos classifier is accurate the next logical ques-tion is what is the classifierrsquos contribution to MyPage-Keeper in comparison with URL blacklists Table 7 pro-vides a breakdown of the URLs and posts classified assocware by MyPageKeeper during the four month periodunder consideration There are two main takeaways fromthis table First we see that the classifier finds 744 ofsocware URLs and 966 of socware posts identified byMyPageKeeper Thus the classifier accounts for a largemajority of socware identified by MyPageKeeper and isthus critical to the systemrsquos operation Second there isvery little overlap between the URLs flagged by black-lists and those flagged by the classifier The typically lowfrequency of occurrence of URLs that match blacklistsis another reason that the classifierrsquos share of identifiedsocware posts is significantly greater than its correspond-ing share of flagged URLs

43 EfficiencyBeyond accuracy it is critical that MyPageKeeperrsquos iden-tification of socware be efficient so as to minimize thecosts that we need to bear in order to keep the delay inidentifying socware and alerting users low The match-ing of a URL against a whitelist or a local set of blacklistsincurs minimal computational overhead In addition wefind that execution of the classifier also imposes minimaldelay per URL verified

To demonstrate the efficiency of MyPageKeeper wecompare the rate at which it classifies URLs with the clas-sification throughput that two other alternative classes ofapproaches would be able to sustain Our first point ofcomparison is an approach that relies only on locally que-riable URL whitelists and blacklists but resolves all short-ened URLs into the corresponding complete URL Oursecond alternative crawls URLs to evaluate them eg us-ing the content on the page or the IP address of the targetwebsite Figure 2(a) compares the throughput of classi-fying URLs with the three approaches using data from

40

50

60

70

80

90

100

100

101

102

103

o

f users

of email notifications sent

Figure 3 49 of MyPageKeeperrsquos 12456 users were notifiedof socware at least once in four months of MyPageKeeperrsquos op-eration

0

5

10

15

20

25

30

35

40

101

102

103

104

Mean

of

noti

ficati

ons

of friends

Figure 4 Correlation between vulnerability and social degreeof exposed users

two weeks of MyPageKeeperrsquos execution We see thatthe throughput with MyPageKeeper is almost an order ofmagnitude greater than the alternatives with all three ap-proaches using the same set of resources on EC2 As wesee in Figure 2(b) MyPageKeeperrsquos better performancestems from its lower execution latency to check an URLthe median classification latency with MyPageKeeper is48 ms compared to a median of 426 ms when resolvingshort URLs and 19 seconds when crawling URLs Thuswe are able to significantly reduce MyPageKeeperrsquos clas-sification latency compared to approaches that need toresolve short URLs or crawl target web pages by keep-ing all of its computation local

Furthermore a crawler-based approach will be signif-icantly more expensive than MyPageKeeper Thomaset al [54] found that crawler-based classification of 15million URLs per day using cloud infrastructure resultsin an expense of $800day Therefore we estimate thatit would cost approximately $15 millionyear to handleFacebookrsquos workload 1 million URLs are shared every20 minutes on Facebook [35] Since MyPageKeeperrsquosclassification latency is 40 times less than a crawler-basedapproach we estimate that the expense incurred with My-PageKeeper would be at least 40 times lower than a sys-tem that classifies URLs by crawling them

5 Analysis of SocwareThus far we described how MyPageKeeper detectssocware efficiently at scale In this section we analyzethe socware that we have found during MyPageKeeperrsquosoperation to throw light on characteristics of socware onFacebook

9

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

0

500

1000

1500

2000

2500

3000

3500

061

8

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

o

f noti

ficati

ons

Figure 5 No of socware notifications per day On 11th July19th Sep and 3rd Oct socware was observed in large scale

0

20

40

60

80

100

100

101

102

103

o

f li

nks

Active duration (days)

Figure 6 Active-time of socware links 20 of socware linkswere observed more than 10 days apart

51 Prevalence of socware49 of MyPageKeeperrsquos users were exposed tosocware within four months First we analyze theprevalence of socware on Facebook To do so we de-fine that a user was exposed to a particular socware postif that post appeared in her wall or news feed As shownin Figure 3 49 of MyPageKeeperrsquos users were exposedto at least one socware post during the four month pe-riod we consider here Though this already indicates thewide reach of socware on Facebook we stress that 49is only a lower-bound due to a couple of reasons Firstmany of MyPageKeeperrsquos users subscribed to our appli-cation at some time in the midst of the four month periodand therefore we miss socware that they were potentiallyexposed to prior to them subscribing to MyPageKeeperSecond Facebook itself detects and removes posts thatit considers as spam or pointing to malware [36 38 52]All the socware detected by MyPageKeeper is after suchfiltering by Facebook

Given that some users are exposed to more socwarethan others we analyze if the social degree of a user hasany impact on the probability of a user being exposed tosocware Figure 4 shows the number of socware notifi-cations received by MyPageKeeper users as a function ofthe number of friends they have on Facebook We binusers with the number of friends within 10 of each otherand plot the average number of notifications per bin weconsider here only those users who were subscribed toMyPageKeeper for at least three months We see that theprobability of users being exposed to socware is largelyindependent of their social degree This indicates thatwhether a user is more likely to be exposed to socwareis not simply a function of how many friends she has but

Shortening service of socware URLsbitly 219

tinyurlcom 188googl 51

tco 316tinycc 16owly 11

onfbme 10isgd 07jmp 04

0rzcom 03All shortened URLs 54

Table 8 Top URL shortening services in our socware dataset

Domain Name of URLs of postsfacebookcom 207 263blogspotcom 63 87miessassinfo 19 32shurulburultk 18 12tomodayinfo 08 013

Table 9 Top two-level domains in our socware dataset

likely depends on the susceptibility of those friends to be-coming victims of scams and helping propagate them

We also find that socware on Facebook is prevalentover time Figure 5 shows the number of socware no-tifications sent per day by MyPageKeeper to its usersWe see a consistently large number of notifications go-ing out daily with noticeable spikes on a few days On11th July 2011 a scam that conned users to completesurveys with the pretext of fake free products went vi-ral and posts pointing to the scam appeared 4056 timeson the walls and news feeds of MyPageKeeperrsquos usersTwo other scams that promised lsquoFree Facebook shoesrsquoand conned users to fill out surveys also caused MyPage-Keeper to send out a large number of notifications on thatday On 19th Sep 2011 different variants of the lsquoFace-book Free T-Shirtrsquo scam [9] were spreading on Facebookand was spotted 2040 times by MyPageKeeper On 3rd

Oct 2011 a video scam was spreading on Facebook andMyPageKeeper observed it in 1739 posts

We next analyze the prevalence and impact of socwarefrom the perspective of individual socware links Foreach link we define its ldquoactive-timerdquo as the differencebetween the first and last times of its occurrence in ourdataset Figure 6 shows that we did not see 60 ofsocware links beyond one day Subsequent posts con-taining these links may have been filtered by Facebookonce it recognized their spammy or malicious nature orour dataset may miss those posts due to MyPageKeeperrsquoslimited view into Facebookrsquos 850 million users Fur-ther we do not attempt any clustering of links into cam-paigns here However even with these caveats 20 ofsocware links were seen in multiple posts separated byat least 10 days suggesting that a significant fractionof socware eludes Facebookrsquos detection mechanisms andlasts on Facebook for significant durations

10

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

0 5

10 15 20 25 30 35 40 45

Disabled

App inst

Deleted

LaaSEvent

App prof

o

f U

RL

s

Figure 7 Breakdown of socware links when crawledin Nov 2011 that originally point to web pages in thefacebookcom domain

52 Domain name characteristics20 of socware links are hosted inside FacebookIn the next section of our analysis we focus on thedomain-level characteristics of socware links First Ta-ble 8 shows the top ten URL shortening services usedin socware links observed by MyPageKeeper In allshortened URLs account for 54 of socware links in ourdataset Our design of MyPageKeeperrsquos classifier to relysolely on social context and to not resolve short URLshence makes a significant difference (as previously seenin the comparison of classification latency)

Further we find it surprising that a large fractionof socware links (46) are not shortened given thatshortening of URLs enables spammers to obfuscatethem On further investigation we find that many Face-book scams such as lsquofree iPhonersquo and lsquofree NFL jer-seyrsquo use domain names that clearly state the messageof the scam eg httpiphonefree5com and httpnfljerseyfreecom These URLs are more likely to elicithigher click-through rates compared to shortened URLsOn the other hand most of the shortened URLs were usedby malicious or spam applications (eg lsquoThe Apprsquo lsquoPro-file Stalkerrsquo) that generate shortened URLs pointing totheir applicationrsquos installation page We find that 89of shortened URLs in our dataset of socware links wereposted by Facebook applications

Next based on our crawl of the socware links in ourdataset we inspect the top two-level domains found onthe landing pages pointed to by these links First asshown in Table 9 we find that a large fraction of socware(over 20 of URLs and 26 of posts) is hosted on Face-book itself Second a sizeable fraction of socware usessites such as blogspotcom and wordpresscomthat enable the spammers to easily create a large numberof URLs without going through the hassle of registeringnew domains Further all of these domains are of goodrepute and are unlikely to be flagged by traditional web-site blacklists

53 Analysis of socware hosted in FacebookHackers use numerous channels in Facebook tospread socware Given the large fraction of socware

0

20

40

60

80

100

0 20 40 60 80 100

o

f U

RL

s

of clicks originated from Facebook

Figure 8 For most socware links shortened with bitly orgoogl a large majority of the clicks came from Facebook

hosted on Facebook itself we next analyze this subsetof socware First in early November 2011 we crawledevery socware link in our dataset that had pointed to alanding page in the facebookcom domain at the timewhen MyPageKeeper had initially classified that link assocware Figure 7 presents a breakdown of the resultsof this crawl If Facebook disables a URL it redirectsus to facebookcomhomephp Similarly if crawl-ing a URL points us to facebookcom4oh4 it im-plies that Facebook has deleted the content at that URLTherefore as seen in Figure 7 a large fraction of socwarelinks that were originally pointing to Facebook have nowbeen deactivated However we also see that a significantfraction of these linksmdashover 40mdashwere still live Fur-ther the figure shows that spammers use several differentchannels such as applications events and pages to prop-agate their scams on Facebook In the figure lsquoApp instrsquoand lsquoApp profrsquo refer to the installation and profile pagesof Facebook applications and lsquoLaaSrsquo refers to campaignsintended to increase the number of Likes on a Facebookpage (described in detail in Section 6)

In our dataset we see 257 distinct socware links short-ened with the bitly and googlURL shorteners thatpoint to landing pages in the facebookcom domainUsing the APIs [5 16] offered by these URL shorten-ing services we computed the number of clicks recordedfor these 257 links in two casesmdash1) where the Refer-rer was Facebook and 2) where the Referrer was anyother domain Figure 8 shows that Facebook is the dom-inant platform from which most of these links receivedmost of their clicks 80 of links received over 70 oftheir clicks from Facebook This seems to indicate thatmost socware hosted on Facebook is propagated solelyon Facebook and tailored for that platform

54 Comparison of socware to email spamSocware keywords exhibit little (10) overlap withspam email keywords As we saw earlier in Section 4spam keyword score is a key feature in MyPageKeeperrsquosclassifier Therefore in the final section of our analysiswe investigate the overlap in lsquospam keywordsrsquo that weobserve in socware on Facebook with those seen in an-other medium targeted by spammers specifically email

11

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

Socware Likelihood Spam email Likelihoodword ratio word ratiofree 121 money 115lt 3 infin price 266

iphone infin free 008awesome 313 account 96

win 243 stock 97wow 908 address 52hurry 368 bank 564omg 3323 pills infin

amazing 49 viagra infindeal 19 watch 19

Table 10 Top keywords from socware posts and spam emails

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8

o

f S

pam

Key

word

s

Log odds ratio of spam keywords

Figure 9 Overlap of keywords between email and Facebook

We investigate whether spammers use similar keywordson Facebook as they use in email spam

To perform this analysis we collected over 17000spam emails from [50] For Facebook spam we use92493 socware posts collected by MyPageKeeper Wetransform posts in either dataset to a bag of words withtheir frequency of occurrence Similar to [54] we thencompute the log odds ratio for each keyword to deter-mine its overlap in Facebook socware and spam emailHere the log odds ratio for a keyword is defined byratio = |log(p1q2p2q1)| where pi is the likelihood ofthat keyword appearing in set i and qi = 1 minus pi A valueof 0 for the log odds ratio indicates that the keyword isequally likely to appear in both datasets whereas an infi-nite ratio indicates that the keyword appears in only oneof the datasets In Figure 9 (infinite values are omitted)we see only a 10 overlap in spam keywords betweenemail and Facebook This indicates that Facebook spamsignificantly differs from traditional email spam

Further Table 10 shows the likelihood ratio (definedearlier in Section 33) for the top keywords in eitherdataset The higher the likelihood ratio of a socwarekeyword the stronger the bias of the keyword appear-ing more in Facebook socware than in email spam aninfinite ratio implies the keyword exclusively appears inFacebook socware The word lsquoomgrsquo is 332 times morelikely to be used in Facebook socware than in email spamOn the other hand words such as lsquopillsrsquo and lsquoviagrarsquo arerestricted solely to email spam

6 Like-as-a-ServiceFacebook has now become the premier online destinationon the Internet Over 900 million users half of whomvisit the site daily spend over 4 hours on the site every

Like 10 Click Like to receive free iPad

Product x

wwwfacebookcompageX

a) User visits a product pageIt lures user to click `Like

Like 11 Install the appto play and win

Product x

wwwfacebookcompageX

b) Like count increases by 1Now spammer lures user to

install the LaaS app

Click here to win

My wall

Click here to winClick here to win

wwwfacebookcomhome

c) LaaS app gets permissionto spam users wall anytime

Figure 10 A representation of how a Like-as-a-service Face-book application collects Likes for its clientrsquos page and gainsaccess to the userrsquos wall for spamming Dotted region of thepage is controlled by the spammer

0

200

400

600

800

070

2

071

6

073

0

081

3

082

7

091

0

092

4

100

8

102

2

0

10

20

30

40

o

f new

s fe

ed p

ost

s

o

f w

all

post

snewsfeedwallposts

Figure 11 Timeline of posts made by the Games LaaS Face-book application seen on usersrsquo walls and news feeds

month [10] To leverage user activity on Facebook anincreasingly large number of businesses have Facebookpages associated with their products However attractingusers to their page is a challenge for any business Oneway of doing so is to make users who visit a Facebookpage click the lsquoLikersquo button on the page A large numberof Likes has two significant implications First the num-ber of Likes associated with a page has begun to representthe reputation associated with a page eg a higher num-ber of Likes improves the pagersquos rank in Bing [3] Sec-ond a link to the product page appears in the news feedof the friends of the user who clicked Like on the pagethus enabling the link to the page to spread on Facebook

Based on our view of Facebook socware throughthe MyPageKeeper lens we see an emerging Like-as-a-Service 3 market to help businesses attract usersto their pages We identify several Facebook apps(eg lsquoGamesrsquo [15] lsquoFanOfferrsquo [13] and lsquoLatest Promo-tionsrsquo [21]) which are hired by the owners of Facebookpages to help increase the number of Likes on their pagesThese applications which offer Likes as a service pre-sumably get paid on a lsquoPay-per-Likersquo model by the own-ers of Facebook pages that make use of their services

Figure 10 shows how a Like-as-a-Service (LaaS) ap-plication typically works First a customer of the LaaSapplication integrates the application into their Facebookpage When users visit the page the LaaS application

3 Note that lsquoLike-as-a-Servicelsquo differs from lsquoLikejackinglsquo [22]where users are tricked into clicking the Like button without them re-alizing they are doing so eg by enticing the user to click on a Flashvideo within which the Like button is hidden

12

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

Page Name Application Message No of LikesRaging Bid Just got a better score on Raging Bidrsquos Bouncing Balls contest and I am now in 12297th place I am getting

closer to winning a Sony Bravia 3D HDTV Who thinks they can beat my score Click here to try URL168815

wwwWalkerToyotacom DAILY CONTEST UPDATE I am currently in 7573rd place in Walker Toyotarsquos Tetris contest There is stillplenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Click here to tryURL

136212

Chip Banks Chevrolet Buick DAILY CONTEST UPDATE I am currently in 310th place in Chip Banks Chevrolet Buickrsquos Gem Swap IIcontest There is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score thanme Click here to try URL

2190

Casey Jamerson DAILY CONTEST UPDATE I am currently in 6234th place in Casey Jamerson Musicrsquos Gem Swap II contestThere is still plenty of time to try and win a 16GB iPad2 Who thinks they can get a better score than me Clickhere to try URL

47496

Tara Gray DAILY CONTEST UPDATE I am currently in 10213th place in Tara Grayrsquos Gem Swap II contest There is stillplenty of time to try and win a Burma Ruby Ring Who thinks they can get a better score than me Click here totry URL

231035

Table 11 Five example Facebook pages integrated with the Games LaaS application to spam usersrsquo walls for propagation

82 84 86 88 90 92 94 96 98

100

100

101

102

103

104

o

f U

RL

s

of likescomments

Likes (LaaS)Comments (LaaS)

Likes (Benign)Comments (Benign)

Figure 12 of Likes and comments associated with URLsposted by the Games Facebook app

entices the user to click Like on page Typically the re-ward promised to the user in return for his Like is that theuser can play some games on the page or have a chanceof winning free products However once the user clicksLike on the page to access the promised reward the LaaSapplication then demands that the user add the applicationto his profile in order to proceed further In the processof getting the user to add the LaaS application the appli-cation requests the user to grant permission for it to poston the userrsquos wall Once the application obtains such per-missions it periodically spams the userrsquos wall with poststhat contain links to the Facebook page of the customerwho enrolled the LaaS application for its services Theseposts will appear in the news feeds of the unsuspectinguserrsquos friends who in turn may visit the Facebook pageand go through the same cycle again The LaaS appli-cation thus enables the Facebook pages of its customersto accumulate Likes and increase their reputation eventhough users are clicking Like on these pages with thepromise of false rewards rather than because they like theproducts advertised on the page

Here we analyze the activity of one such LaaSapplicationmdashGames [15] Figure 11 shows that postsmade by this application appear regularly in the wallsand news feeds of MyPageKeeperrsquos users Even with oursmall sample of roughly 12K users from Facebookrsquos totalpopulation of over 850 million users we see that 40 usershave posts made by Games on their walls which impliesthat these users have installed the application and grantedit permission to make posts on their wall at any time We

also see that the number of users who installed Gamessignificantly rose around mid-September 2011 Furtherfrom the news feeds of MyPageKeeper users we see thatGames posted links to as many as 700 Facebook pageson a single day each link points to the Facebook page ofa different customer of this LaaS application Table 11shows the posts made by Games for some of its cus-tomers the variation in text messages across these postsand the large number of Likes garnered by the Facebookpages of these customers

We next analyze the Likes and comments received by721 URLs posted by the Games app As shown in Fig-ure 12 we see that over 95 of these URLs have lessthan 100 Likes and less than 100 comments this frac-tion is significantly lesser on a dataset of randomly cho-sen 721 URLs from benign posts However over 20of the URLs posted by the Games app do receive Likesand comments thus enabling them to propagate on Face-book Real users may be unknowingly helping to spread-ing spam in these cases such users have been previouslyreferred to as creepers [52]

7 DiscussionClient-based solution An alternative to MyPage-Keeperrsquos server-side detection of socware would be toidentify socware on client machines In such an ap-proach a client-side tool can classify a post at the instantwhen the user accesses the post However we choosenot to use such an approach for multiple reasons Firsta server-side solution is more amenable to adoption itis easier to convince users to add an app to their Face-book profile than to convince them to download and in-stall an application or browser extension on their ma-chines Second users can access Facebook from a rangeof browsers and even from different device types (egmobile phones) Developing and maintaining client-side tools for all of these platforms is onerous Finallyand most importantly many of the features used by oursocware classifier (eg message similarity score) funda-mentally depend on aggregating information across users

13

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

Therefore a view of Facebook from the perspective of asingle client may be insufficient to identify socware ac-curately

Estimating false negatives While we evaluated theaccuracy of socware identified by MyPageKeeper bycross-validating with other techniques evaluating the ac-curacy of MyPageKeeperrsquos classifier in cases where it de-clares a URL safe is much harder Not only do we lackground truth but since the highly common case is thata Facebook post is benign manual verification of a ran-domly chosen subset of the classifierrsquos negative outputs isinsufficient

We therefore evaluate whether MyPageKeeperrsquos clas-sifier misses any socware by using data from user-reported samples of socware As shown in Table 3533 distinct MyPageKeeper users have submitted 679such reports and we have received 333 unique URLsacross these reports Based on manual verificationwe find that 296 of these 333 URLs indeed point tospam or malware The remaining 37 URLs point tosites like surveymonkeycom (fill out surveys) andclixsensecom (get paid to view advertisements)which though abused by spammers have legitimate usesas well We suspect that our users did come acrosssocware but reported the URL of the landing page ratherthan the URL that they originally found in a socware post

Of the 296 instances of true socware reported by usersMyPageKeeperrsquos classifier flagged all but 17 of them in-dependently of users reporting them to us This trans-lates into a false negative rate of 5 for the classi-fier However 16 of these 17 URLs had been found tomatch against one of the URL blacklists used by MyPage-Keeper Thus the false negative rate for the whole My-PageKeeper system which combines blacklists and theclassifier to detect socware is 03

Arms race with spammers Though our current tech-niques seem to suffice to accurately identify socwareon Facebook we speculate here on how spammers mayevolve socware given the knowledge of how MyPage-Keeper works One option for spammers to evade My-PageKeeper is to use different shortened URLs for asingle malicious landing URL In such cases MyPage-Keeper would consider every posted shortened URLseperately even though they are all part of the same cam-paign Thus if any of these shortened URLs does notappear on the wallsnews feeds of several users MyPage-Keeper may fail to flag it Another option for socware toevade MyPageKeeper is for spammers to slow down itsrate of propagation as we found in Section 42 MyPage-Keeper sometimes misses socware which is observedonly a few times in our dataset However slowing downa socware epidemic makes it likely that it will be flaggedby other techniques such as URL blacklists Moreoverspammers may often be unable to control how fast a

socware epidemic spreads In the case where an epidemicspreads by luring users into installing a Facebook app thespammer can control how often the app posts spam onthe userrsquos wall However in cases where users are askedto lsquoLikersquo or lsquoSharersquo a post to access a fake reward thesocware is self-propagating and its viral spread cannot becontrolled by spammers

Another option is for spammers to change the key-words that they use in socware posts thus affecting thespam keyword score used by MyPageKeeperrsquos classifierThough spammers are constrained in their choice of key-words by the need to attract users some of the keywordsmay evolve over time as popular colloquial expressions(eg lsquoOMGrsquo) change To evaluate MyPageKeeperrsquos abil-ity to cope with such change we identified the top key-words (those with high likelihood ratio compared to be-nign posts among frequently occurring keywords) dis-tinctive to user-reported socware posts We find that thespam keywords that we use in MyPageKeeperrsquos classifier(identified from manually identified samples of socware)match those computed here Though this captures dataonly across four months MyPageKeeper can similarly re-compute the set of spam keywords over time

8 Related WorkMotivated by the increasing presence of spam and mal-ware on OSNs there have been several recent related ef-forts Here we contrast our work with these prior efforts

Studies of spam on OSNs Gao et al [43] analyzedposts on the walls of 35 million Facebook users andshowed that 10 of links posted on Facebook walls arespam with a large majority pointing to phishing sitesThey also presented techniques to identify compromisedaccounts and spam campaigns In a similar study on Twit-ter Grier et al [44] showed that at least 8 of linksposted on Twitter are spam while 86 of the involvedaccounts are compromised In contrast to this studyThomas et al [55] show that the majority of suspendedaccounts in Twitter are created by spammers as opposedto compromised users All of these efforts however fo-cus on post-mortem analysis of historical OSN data andare not applicable to MyPageKeeperrsquos goal of identify-ing socware soon after it appears on a userrsquos wall or newsfeed

Detecting spam accounts Benevenuto et al [39] andYang et al [57] developed techniques to identify accountsof spammers on Twitter Others have proposed a honey-pot based approach [53 47] to detect spam accounts onOSNs Yardi et al [58] analyzed behavioral patternsamong spam accounts in Twitter Instead of focusing onaccounts created by spammers MyPageKeeper enablessocware detection on the walls and news feeds of legiti-mate Facebook users

14

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

Real-time spam detection in OSNs Thomas etal [54] developed Monarch a real-time system thatcrawls URLs submitted from services such as Twitter todetermine whether a URL directs to spam Monarch re-lies on the network and domain level properties of URLsas well as the content of the web pages obtained whenURLs are crawled Interestingly Monarchrsquos classifica-tion accuracy is shown to be independent of the socialcontext on Twitter MyPageKeeper distinguishes itselffrom Monarch in several waysmdash1) we study socware onFacebook which we see significantly differs in its char-acteristics from traditional spam messages 2) to makeMyPageKeeper efficient our socware classifier operateswithout crawling of links found in posts and 3) we findthat the use of social context based features is crucial toefficient detection of socware In another study Gao etal [42] perform online spam filtering on OSNs using in-cremental clustering Their technique however relies onhaving the whole social graph as input and so is usableonly by the OSN provider MyPageKeeper instead reliesonly on the view of the OSN as seen by MyPageKeeperrsquosusers Lee et al [48] built Warningbird a system to detectsuspicious URLs in Twitter their system however relieson following the HTTP redirection chains of URLs thusmaking their approach less efficient than MyPageKeeper

Wang et al [56] propose a unified spam detectionframework that works across all OSNs but they do nothave an implementation of such a system in practiceStein et al [52] describe Facebookrsquos Immune System(FIS) a scalable real-time adversarial learning systemdeployed in Facebook to protect users from maliciousactivities However Stein et al provide only a high-level overview about threats to the Facebook graph anddo not provide any analysis of the system Similarlyother Facebook applications [6 25 4] that defend usersagainst spam and malware are proprietary with no detailsavailable about how they work Abu-Nimeh et al [37]analyze the URLs flagged by one of these applicationsDefenseio but they do not discuss Defenseiorsquos classifica-tion techniques and their analysis is restricted to that ofthe hosting infrastructure (country and ASN) underlyingFacebook spam To the best of our knowledge we arethe first to provide classification of socware on Facebookthat relies solely on social context based features thusenabling MyPageKeeper to efficiently detect socware atscale

Social context based email spam Jagatic et al [45]discuss how email phishing attacks can be launched byusing publicly available personal information (eg birth-day) from social networks and Brown et al [40] analyzedsuch email spam seen in practice However due to revi-sions in Facebookrsquos privacy policy over the last couple ofyears only a userrsquos friends have access to such informa-tion from the userrsquos profile thus making such email spam

no longer possible Further MyPageKeeper focuses onspam propagated on Facebook rather than via email

9 ConclusionsFacebook is becoming the new epicenter of the weband we showed that hackers are adapting to this changeby designing new types of malware suited to this plat-form which we call socware In this paper we pre-sented the design and implementation of MyPageKeepera Facebook application that can accurately and efficientlyidentify socware at scale Using data from over 12KFacebook users we found that the reach of socware iswidespread and that a significant fraction of socware ishosted on Facebook itself We also showed that existingdefenses such as URL blacklists are ill-suited for identi-fying socware and that socware significantly differs fromemail spam Finally we identified a new trend in ag-gressive marketing of Facebook pages using ldquoLike-as-a-Servicerdquo applications that spam users to make moneybased on a ldquoPay-per-Likerdquo model

References[1] Anti-phishing working group httpwww

antiphishingorg[2] Application authentication flow using oauth

20 httpdevelopersfacebookcomdocsauthentication

[3] Bing gets friendlier with Facebook httpwwwtechnologyreviewcomweb37585

[4] Bitdefender Safego httpwwwfacebookcombitdefendersafego

[5] bitly API httpcodegooglecompbitly-apiwikiApiDocumentation

[6] Defensio Social Web Security httpwwwfacebookcomappsapplicationphpid=177000755670

[7] Escrow-fraud httpescrow-fraudcom[8] Experts Facebook crime is on the

rise httpwwwzdnetcomblogfacebookexperts-facebook-crime-is-on-the-rise2632

[9] Facebook birthday T-shirt scam steals secret mobileemail addresses httpbitlyKvax0t

[10] Facebook is the webrsquos ultimate timesink httpmashablecom20100216facebook-nielsen-stats

[11] Facebook Phishing Scam Costs Victims Thousandsof Dollars httpbitlyLE9EPm

[12] Facebook scam involves money transfers to thePhilippines httpbitlyLE9LdP

[13] Fan Offer httpswwwfacebookcomappsapplicationphpid=107611949261673

[14] FBML- Facebook Markup Language httpsdevelopersfacebookcomdocsreferencefbml

15

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions

[15] Games httpswwwfacebookcomappsapplicationphpid=121297667915814

[16] googl API httpcodegooglecomapisurlshortenerv1getting startedhtml

[17] Google Safe Browsing API httpcodegooglecomapissafebrowsing

[18] Hackers selling $25 toolkit to create maliciousFacebook apps httpzdnetM2WNe1

[19] How to spot a Facebook Survey ScamhttpfacecrookscomSafety-CenterScam-WatchHow-to-spot-a-Facebook-Survey-Scamhtml

[20] Joewein Fighting spam and scams on the Internethttpwwwjoeweinnet

[21] Latest Promotions httpswwwfacebookcomappsapplicationphpid=174789949246851

[22] Likejacking takes off on Facebookhttpwwwreadwritewebcomarchiveslikejacking takes off on facebookphp

[23] MalwarePatrol- Malware is everywhere httpwwwmalwarecombr

[24] MyPageKeeper httpswwwfacebookcomappsapplicationphpid=167087893342260

[25] Norton Safe Web httpwwwfacebookcomappsapplicationphpid=310877173418

[26] Phishtank httpwwwphishtankcom[27] Rihanna video scam rdquohttpwwwvirteaconcom

201111sick-i-just-hate-rihanna-after-watchinghtmlrdquo

[28] Spamcop httpwwwspamcopnet[29] Spamhaus httpwwwspamhausorgsblindex

lasso[30] Steve Jobs death scams are just the greedy exploit-

ing the gullible httpbitlyM67Zme[31] SURBL httpwwwsurblorg[32] SVM Tutorials httpsvmsorgtutorials[33] URIBL httpwwwuriblcom[34] Web-of-trust httpwwwmywotcom[35] What 20 Minutes On Facebook Looks Like http

tcrnchKytqzB[36] Facebook becomes partner with Web

of Trust (WOT) httpswwwfacebookcomnotesfacebook-securitykeeping-you-safe-from-scams-and-spam10150174826745766 May 2011

[37] S Abu-Nimeh T M Chen and O Alzubi Mali-cious and spam posts in online social networks InIEEE Computer Society 2011

[38] F becomes partner with WebSense httpwwwthetechheraldcomarticlephp2011397675Facebook-implements-malicious-link-scanning-serviceOct 2011

[39] F Benevenuto G Magno T Rodrigues andV Almeida Detecting spammers on Twitter InCEAS 2010

[40] G Brown T Howe M Ihbe A Prakash andK Borders Social networks and context-awarespam In ACM CSCW 2008

[41] C-C Chang and C-J Lin LIBSVM A library forsupport vector machines ACM Transactions on In-telligent Systems and Technology 2 2011

[42] H Gao Y Chen K Lee D Palsetia and A Choud-hary Towards online spam filtering in social net-works 2012

[43] H Gao J Hu C Wilson Z Li Y Chen and B YZhao Detecting and characterizing social spamcampaigns In IMC 2010

[44] C Grier K Thomas V Paxson and M Zhangspam The underground on 140 characters or lessIn CCS 2010

[45] T N Jagatic N A Johnson M Jakobsson andF Menczer Social phishing Commun ACM 2007

[46] A Le A Markopoulou and M FaloutsosPhishdef Url names say it all In Infocom 2010

[47] K Lee J Caverlee and S Webb Uncovering socialspammers social honeypots + machine learning InSIGIR 2010

[48] S Lee and J Kim Warningbird Detecting suspi-cious urls in twitter stream 2012

[49] J Ma L K Saul S Savage and G M VoelkerBeyond blacklists learning to detect malicious websites from suspicious urls In KDD 2009

[50] V Metsis I Androutsopoulos and G PaliourasSpam filtering with naive bayes ndash which naivebayes In CEAS 2006

[51] Z Qian Z M Mao Y Xie and F Yu On network-level clusters for spam detection In NDSS 2010

[52] T Stein E Chen and K Mangla Facebook im-mune system In SNS 2011

[53] G Stringhini C Kruegel and G Vigna Detectingspammers on social networks In ACSAC 2010

[54] K Thomas C Grier J Ma V Paxson and D SongDesign and Evaluation of a Real-Time URL SpamFiltering Service In Proceedings of the IEEE Sym-posium on Security and Privacy 2011

[55] K Thomas C Grier V Paxson and D Song Sus-pended accounts in retrospect An analysis of twit-ter spam In IMC 2011

[56] D Wang D Irani and C Pu A social-spam detec-tion framework In CEAS 2011

[57] C Yang R Harkreader and G Gu Die free orlive hard empirical evaluation and new design forfighting evolving twitter spammers In RAID 2011

[58] S Yardi D Romero G Schoenebeck et al De-tecting spam in a twitter network First Monday2009

16

  • Introduction
  • Socware on Facebook
    • The Facebook terminology
    • Socware
      • MyPageKeeper Architecture
        • Goals
        • MyPageKeeper components
        • Identification of socware
        • Implementing MyPageKeeper
          • Evaluation
            • Accuracy
            • Comparison with blacklists
            • Efficiency
              • Analysis of Socware
                • Prevalence of socware
                • Domain name characteristics
                • Analysis of socware hosted in Facebook
                • Comparison of socware to email spam
                  • Like-as-a-Service
                  • Discussion
                  • Related Work
                  • Conclusions