andrew g. west , jian chang, krishna venkatasubramanian, oleg sokolsky, and insup lee

Andrew G. West, Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

CEAS `11 – September 1, 2011

Link Spamming Wikipedia for Profit

Overview/Outline

• How do wikis/Wikipedia prevent link spam?

• How common is wiki/Wikipedia link spam?

• How “successful” are the attack vectors?

• Might there be more effective ones? (yes)

• How would one defend against them?

2

Defining Link Spam

• Any violation of external link policy [2]– Commercial– Non-notable sources:

fan pages, blogs, etc.

• Two dimensions– Destination (URL)– Presentation

• HTML nofollow3

Link spam example

Motivations

Spam not uncommon in collaborative/UGC apps

(surveyed in [9,12])

4

• Wikipedia/wikis are unique:• Edit-anywhere (no append-only semantics)• Global editing (not network limited)• Community-driven mitigation• Extremely high traffic (#7 in Alexa)

• Potential for traffic/profit (e.g., Amazon [14])

STATUS QUO OFDEFENSE MECHANISMS

5

Single-link Mitigation

Assume a “clean” account adds a “new” spam link:

6

Aggregate Mitigation

7

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system [10]• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software


8








9








10








11



• URL blacklist• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system• 4 warnings without consequence• 5th blocks account


Bots can be very fast1.Rate-limits2.CAPTCHAs3.Special software

TAKEAWAY:

* Only humans can catch “new” instances

* Aggregate mechanisms must wait for atomic instances to compound before they can take affect

HUMAN LATENCY!

STATUS QUO OFWIKIPEDIA SPAMMING

12

Corpus Creation• “Spam” edits are those that:

1. Added exactly one external link2. Made no changes outside context of that link3. Were “rolled-back” (expedited admin. undo)

• Edits meeting: if(1 && 2 && !3) = “Ham”• Edits meeting: if(3) = “Damaging”

13

Two months in mid-2010: 7.4 mil. edits

SPAM HAM DAMAGE

≈4,700 ≈182,000 ≈204,000

Corpus Example (1)

14

Because the link was the ONLY change made. The privileged user’s decision to roll-back that

edit speaks DIRECTLY to that link’s inappropriateness.

Corpus Example (2)

15

Spam Genres

TAKEAWAY: Spam is:• Categorically diverse• “Subtlety”: Info.

adjacent services • Not monetarily-driven?

16

Spam by ODP/DMOZ category

Spam Placement

TAKEAWAY:• Conventions

followed• Subtlety for

persistence?

17

Bad Domains + Blacklist

TAKEAWAYS:• Wiki spammers ≠ email spammers

18

Email Spam URLS

WikiSpamURLs

ø

• Domain statistics don’t suggest max. utility– Only 2 of 25 worst were blacklisted– Only 14 domains appear 10+ times in {SPAM}

Spam Perpetrators

• 57% of spam added by non-registered users– Yet we will show registered accounts beneficial

• Worst users map onto worst domains

19

Geo-locating spammers

• Dedicated spam accounts; most blocked

Spam Life/Impact

Spam lifespan• 19 minutes at median• 85 secs. for damage• Reason for difference

20

Spam page views• Proxy for “link views”• Metric of choice• 6.05 views per spam link

Broadening Search

• Maybe our corpus just missed something?– Archives show some abuse (but non-automated)– Deleted revisions; media coverage

• SUMMARY: Status quo strategies unsuccessful– ≈ 6 views/link not likely to be profitable– Patrollers un-fooled by subtle strategies which

seem to aim for “link persistence”– Cause or effect of unsophisticated strategies?

21

A NOVEL ATTACK MODEL(inspired by [15])

22

Attack Summary

MODEL: Abandon deception, aggressively exploit latency of human detection process.

Attack characterized by 4 vectors:1.Target high-traffic pages2.Autonomous attainment of privileged

accounts; mechanized operation thereof3.Prominent link placement/style4.Distributed

23

Popular Pages (1)

24

Popular Pages (1)

25

Popular Pages (1)

• Imagine 85 seconds on these pages!• Why not just protect these somehow?

– Next: Account-level vulnerabilities 26

Popular Pages (2)

27

Privileged Accounts

28

• Becoming autoconfirmed– Outsource the CAPTCHA solve [15]– Requires 10 good edits (or warnings/block)– Non-vetted namespaces; helpful bots; thesaurus attacks

• Conduct campaigns via API [1] at high-speed– “Anti-bot” software found ineffective

Prominent Placement

29

<p style="font-size:5em;font-weight:bolder">[http://www.example.com Example link]</p>

Distributed Attack

Two notions of “distributed”:

1.Need IP-agility to avoid IP (range) blocks– What spammer doesn’t?– Use open-proxies, existing botnet, etc.

2.There are many sites one can target– Wiki language editions; WMF sister-sites– Universal API [1] into MediaWiki installs

30

MODEL EFFECTIVENESS &DEFENSE STRATEGIES

31

User Responses• Administrative response

– Expected flow to campaign termination– Very conservative example:

• 1 min. account survival = 70 links placed• Top 70 articles @ 1 min. each = 2,100 active views

• Reader response– Sources of link exposure

• Active views: Link in default version• Inactive views: Version histories and watchlisters• Content scrapers and mashup apps.

– Click-through desensitization (email spam? [13])32

Economics• Cost of campaigns (about $1 marginal)

– Affiliate programs; 50% commissions [13]– CAPTCHA per account; $1 per thousand [15]– Domain names; $1-$2 each– Minimal labor costs (< 100 LOC)

• Expected return-on-investment; extrapolate from “male enhancement pharmacy” study [13]– 2100 exposures -> 20 click-through -> $5.20 gross– Affiliate fees: $5.20 -> $2.60 net >> $1 marginal

• Why not seen live? Naivety? Scale?33

Defense Strategies (1)• Ethical issues; WMF notification• Focus on technical defense (sociological aspects)

1. Require explicit approval– Prevent from going live until vetted– Controversial “Flagged Revisions” proposal

2. Privilege configuration – Edit count is a poor metric (see [8])– No human can do 70 edits/min. – maybe 5 edits/min.?– Tool-expedited users should have separate status

34

Defense Strategies (2)3. Autonomous signature-driven detection [19]

– Human latency gone (dwindling workforce [11])– Machine-learning classifier over:

• Wikipedia metadata [5] (URL addition rates, editor permissions)• Landing-site analysis [7] (Commercial intent, SEO)• Third-party data (Alexa web crawler, Google Safe Browsing)

– Implemented and operational on English Wikipedia• Offline analysis: 66% status-quo spam catch-rate at 0.5% FP-rate

35

Likely vandalism

Likely vandalism

-------------------

EditQueue

Likely vandalism

Likely innocent

STiki Client

Fetch Edit

Display

Classify

STiki Services

STiki Client

Wiki-API

Wikipedia

IRC#enwiki#

Anti-spam algorithm

Bot Logicif(score) > thresh:

REVERTelse:

Scoring

Maintain

References (1)

36

[01] MediaWiki API. http://en.wikipedia.org/w/api.php[02] Wikipedia: External links. http://en.wikipedia.org/wiki/WP:EL[03] Wikipedia spam blacklists. http://en.wikipedia.org/wiki/WP:BLACKLIST[04] WikiProject spam. http://en.wikipedia.org/wiki/WP:WPSPAM[05] B. Adler, et al. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing 2011. [06] J. Antin and C. Cheshire. Readers are not free-riders: Reading as a form of participation on Wikipedia. In CSCW 2010.[07] H. Dai, et al. Detecting online commercial intention (OCI). In WWW 2006.[08] P. K.-F. Fong and R. P. Biuk-Aghai. What did they do? Deriving high-level edit histories in wikis. In WikiSym 2010.[09] H. Gao, et al. Detecting and characterizing social spam campaigns. In CCS’10.[10] R. S. Geiger and D. Ribes. The work of sustaining order in Wikipedia: The banning of a vandal. In CSCW 2010.

References (2)

37

[11] E. Goldman. Wikipedia’s labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009.[12] P. Heymann, et al. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Comp., 11(6):36–45, 2007.[13] C. Kanich, et al. Spamalytics: An empirical market analysis of spam marketing conversion. In CCS 2008.[14] C. McCarthy. Amazon adds Wikipedia to book-shopping. http://news.cnet.com/8301-13577_3-20024297-36.html, 2010.[15] M. Motoyama, et al. Re: CAPTCHAs - Understanding CAPTCHA-solving services in an economic context. In USENIX Security 2010[16] R. Priedhorsky, et al. Creating, destroying, and restoring value in Wikipedia. In GROUP 2007, the ACM Conference on Supporting Group Group[17] Y. Shin, et al. The nuts and bolts of a forum spam automator. In LEET 2011.[18] B. E. Ur and V. Ganapathy. Evaluating attack amplification in online social networks. In W2SP 2009, Web 2.0 Security and Privacy[19] A. G. West, et al. Autonomous link spam detection in purely collaborative environments. In WikiSym 2011.

andrew g. west , jian chang, krishna venkatasubramanian, oleg sokolsky, and insup lee

Documents

ip correlation bots

blocks account

sybil sockpuppets

local global versions

warning system

new spam link

wikiwikipedia link spam

link spamming wikipedia