andrew g. west , jian chang, krishna venkatasubramanian, oleg sokolsky, and insup lee
DESCRIPTION
Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee CEAS `11 – September 1, 2011. Link Spamming Wikipedia for Profit. Overview/Outline. How do wikis/Wikipedia prevent link spam? How common is wiki/Wikipedia link spam? - PowerPoint PPT PresentationTRANSCRIPT
Andrew G. West, Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee
CEAS `11 – September 1, 2011
Link Spamming Wikipedia for Profit
Overview/Outline
• How do wikis/Wikipedia prevent link spam?
• How common is wiki/Wikipedia link spam?
• How “successful” are the attack vectors?
• Might there be more effective ones? (yes)
• How would one defend against them?
2
Defining Link Spam
• Any violation of external link policy [2]– Commercial– Non-notable sources:
fan pages, blogs, etc.
• Two dimensions– Destination (URL)– Presentation
• HTML nofollow3
Link spam example
Motivations
Spam not uncommon in collaborative/UGC apps
(surveyed in [9,12])
4
• Wikipedia/wikis are unique:• Edit-anywhere (no append-only semantics)• Global editing (not network limited)• Community-driven mitigation• Extremely high traffic (#7 in Alexa)
• Potential for traffic/profit (e.g., Amazon [14])
STATUS QUO OFDEFENSE MECHANISMS
5
Single-link Mitigation
Assume a “clean” account adds a “new” spam link:
6
Aggregate Mitigation
7
Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts
Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots
• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.
• Warning system [10]• 4 warnings without consequence• 5th blocks account
• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation
Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software
Aggregate Mitigation
8
Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts
Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots
• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.
• Warning system [10]• 4 warnings without consequence• 5th blocks account
• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation
Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software
Aggregate Mitigation
9
Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts
Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots
• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.
• Warning system [10]• 4 warnings without consequence• 5th blocks account
• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation
Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software
Aggregate Mitigation
10
Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts
Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots
• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.
• Warning system [10]• 4 warnings without consequence• 5th blocks account
• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation
Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software
Aggregate Mitigation
11
Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts
Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots
• URL blacklist• Manually maintained• Local + global versions• ≈17k entries in combo.
• Warning system• 4 warnings without consequence• 5th blocks account
• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation
Bots can be very fast1.Rate-limits2.CAPTCHAs3.Special software
TAKEAWAY:
* Only humans can catch “new” instances
* Aggregate mechanisms must wait for atomic instances to compound before they can take affect
HUMAN LATENCY!
STATUS QUO OFWIKIPEDIA SPAMMING
12
Corpus Creation• “Spam” edits are those that:
1. Added exactly one external link2. Made no changes outside context of that link3. Were “rolled-back” (expedited admin. undo)
• Edits meeting: if(1 && 2 && !3) = “Ham”• Edits meeting: if(3) = “Damaging”
13
Two months in mid-2010: 7.4 mil. edits
SPAM HAM DAMAGE
≈4,700 ≈182,000 ≈204,000
Corpus Example (1)
14
Because the link was the ONLY change made. The privileged user’s decision to roll-back that
edit speaks DIRECTLY to that link’s inappropriateness.
Corpus Example (2)
15
Spam Genres
TAKEAWAY: Spam is:• Categorically diverse• “Subtlety”: Info.
adjacent services • Not monetarily-driven?
16
Spam by ODP/DMOZ category
Spam Placement
TAKEAWAY:• Conventions
followed• Subtlety for
persistence?
17
Bad Domains + Blacklist
TAKEAWAYS:• Wiki spammers ≠ email spammers
18
Email Spam URLS
WikiSpamURLs
ø
• Domain statistics don’t suggest max. utility– Only 2 of 25 worst were blacklisted– Only 14 domains appear 10+ times in {SPAM}
Spam Perpetrators
• 57% of spam added by non-registered users– Yet we will show registered accounts beneficial
• Worst users map onto worst domains
19
Geo-locating spammers
• Dedicated spam accounts; most blocked
Spam Life/Impact
Spam lifespan• 19 minutes at median• 85 secs. for damage• Reason for difference
20
Spam page views• Proxy for “link views”• Metric of choice• 6.05 views per spam link
Broadening Search
• Maybe our corpus just missed something?– Archives show some abuse (but non-automated)– Deleted revisions; media coverage
• SUMMARY: Status quo strategies unsuccessful– ≈ 6 views/link not likely to be profitable– Patrollers un-fooled by subtle strategies which
seem to aim for “link persistence”– Cause or effect of unsophisticated strategies?
21
A NOVEL ATTACK MODEL(inspired by [15])
22
Attack Summary
MODEL: Abandon deception, aggressively exploit latency of human detection process.
Attack characterized by 4 vectors:1.Target high-traffic pages2.Autonomous attainment of privileged
accounts; mechanized operation thereof3.Prominent link placement/style4.Distributed
23
Popular Pages (1)
24
Popular Pages (1)
25
Popular Pages (1)
• Imagine 85 seconds on these pages!• Why not just protect these somehow?
– Next: Account-level vulnerabilities 26
Popular Pages (2)
27
Privileged Accounts
28
• Becoming autoconfirmed– Outsource the CAPTCHA solve [15]– Requires 10 good edits (or warnings/block)– Non-vetted namespaces; helpful bots; thesaurus attacks
• Conduct campaigns via API [1] at high-speed– “Anti-bot” software found ineffective
Prominent Placement
29
<p style="font-size:5em;font-weight:bolder">[http://www.example.com Example link]</p>
Distributed Attack
Two notions of “distributed”:
1.Need IP-agility to avoid IP (range) blocks– What spammer doesn’t?– Use open-proxies, existing botnet, etc.
2.There are many sites one can target– Wiki language editions; WMF sister-sites– Universal API [1] into MediaWiki installs
30
MODEL EFFECTIVENESS &DEFENSE STRATEGIES
31
User Responses• Administrative response
– Expected flow to campaign termination– Very conservative example:
• 1 min. account survival = 70 links placed• Top 70 articles @ 1 min. each = 2,100 active views
• Reader response– Sources of link exposure
• Active views: Link in default version• Inactive views: Version histories and watchlisters• Content scrapers and mashup apps.
– Click-through desensitization (email spam? [13])32
Economics• Cost of campaigns (about $1 marginal)
– Affiliate programs; 50% commissions [13]– CAPTCHA per account; $1 per thousand [15]– Domain names; $1-$2 each– Minimal labor costs (< 100 LOC)
• Expected return-on-investment; extrapolate from “male enhancement pharmacy” study [13]– 2100 exposures -> 20 click-through -> $5.20 gross– Affiliate fees: $5.20 -> $2.60 net >> $1 marginal
• Why not seen live? Naivety? Scale?33
Defense Strategies (1)• Ethical issues; WMF notification• Focus on technical defense (sociological aspects)
1. Require explicit approval– Prevent from going live until vetted– Controversial “Flagged Revisions” proposal
2. Privilege configuration – Edit count is a poor metric (see [8])– No human can do 70 edits/min. – maybe 5 edits/min.?– Tool-expedited users should have separate status
34
Defense Strategies (2)3. Autonomous signature-driven detection [19]
– Human latency gone (dwindling workforce [11])– Machine-learning classifier over:
• Wikipedia metadata [5] (URL addition rates, editor permissions)• Landing-site analysis [7] (Commercial intent, SEO)• Third-party data (Alexa web crawler, Google Safe Browsing)
– Implemented and operational on English Wikipedia• Offline analysis: 66% status-quo spam catch-rate at 0.5% FP-rate
35
Likely vandalism
Likely vandalism
-------------------
EditQueue
Likely vandalism
Likely innocent
STiki Client
Fetch Edit
Display
Classify
STiki Services
STiki Client
Wiki-API
Wikipedia
IRC#enwiki#
Anti-spam algorithm
Bot Logicif(score) > thresh:
REVERTelse:
Scoring
Maintain
References (1)
36
[01] MediaWiki API. http://en.wikipedia.org/w/api.php[02] Wikipedia: External links. http://en.wikipedia.org/wiki/WP:EL[03] Wikipedia spam blacklists. http://en.wikipedia.org/wiki/WP:BLACKLIST[04] WikiProject spam. http://en.wikipedia.org/wiki/WP:WPSPAM[05] B. Adler, et al. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing 2011. [06] J. Antin and C. Cheshire. Readers are not free-riders: Reading as a form of participation on Wikipedia. In CSCW 2010.[07] H. Dai, et al. Detecting online commercial intention (OCI). In WWW 2006.[08] P. K.-F. Fong and R. P. Biuk-Aghai. What did they do? Deriving high-level edit histories in wikis. In WikiSym 2010.[09] H. Gao, et al. Detecting and characterizing social spam campaigns. In CCS’10.[10] R. S. Geiger and D. Ribes. The work of sustaining order in Wikipedia: The banning of a vandal. In CSCW 2010.
References (2)
37
[11] E. Goldman. Wikipedia’s labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009.[12] P. Heymann, et al. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Comp., 11(6):36–45, 2007.[13] C. Kanich, et al. Spamalytics: An empirical market analysis of spam marketing conversion. In CCS 2008.[14] C. McCarthy. Amazon adds Wikipedia to book-shopping. http://news.cnet.com/8301-13577_3-20024297-36.html, 2010.[15] M. Motoyama, et al. Re: CAPTCHAs - Understanding CAPTCHA-solving services in an economic context. In USENIX Security 2010[16] R. Priedhorsky, et al. Creating, destroying, and restoring value in Wikipedia. In GROUP 2007, the ACM Conference on Supporting Group Group[17] Y. Shin, et al. The nuts and bolts of a forum spam automator. In LEET 2011.[18] B. E. Ur and V. Ganapathy. Evaluating attack amplification in online social networks. In W2SP 2009, Web 2.0 Security and Privacy[19] A. G. West, et al. Autonomous link spam detection in purely collaborative environments. In WikiSym 2011.