andrew g. west , jian chang, krishna venkatasubramanian, oleg sokolsky, and insup lee

37
Andrew G. West, Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee CEAS `11 – September 1, 2011 Link Spamming Wikipedia for Profit

Upload: limei

Post on 23-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee CEAS `11 – September 1, 2011. Link Spamming Wikipedia for Profit. Overview/Outline. How do wikis/Wikipedia prevent link spam? How common is wiki/Wikipedia link spam? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Andrew G. West, Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

CEAS `11 – September 1, 2011

Link Spamming Wikipedia for Profit

Page 2: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Overview/Outline

• How do wikis/Wikipedia prevent link spam?

• How common is wiki/Wikipedia link spam?

• How “successful” are the attack vectors?

• Might there be more effective ones? (yes)

• How would one defend against them?

2

Page 3: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Defining Link Spam

• Any violation of external link policy [2]– Commercial– Non-notable sources:

fan pages, blogs, etc.

• Two dimensions– Destination (URL)– Presentation

• HTML nofollow3

Link spam example

Page 4: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Motivations

Spam not uncommon in collaborative/UGC apps

(surveyed in [9,12])

4

• Wikipedia/wikis are unique:• Edit-anywhere (no append-only semantics)• Global editing (not network limited)• Community-driven mitigation• Extremely high traffic (#7 in Alexa)

• Potential for traffic/profit (e.g., Amazon [14])

Page 5: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

STATUS QUO OFDEFENSE MECHANISMS

5

Page 6: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Single-link Mitigation

Assume a “clean” account adds a “new” spam link:

6

Page 7: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Aggregate Mitigation

7

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system [10]• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software

Page 8: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Aggregate Mitigation

8

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system [10]• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software

Page 9: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Aggregate Mitigation

9

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system [10]• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software

Page 10: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Aggregate Mitigation

10

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklists [3]• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system [10]• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs [15]3.Special software

Page 11: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Aggregate Mitigation

11

Problematic URLsProblematic URLs Malicious AccountsMalicious Accounts

Malicious CollectivesMalicious Collectives Unauthorized BotsUnauthorized Bots

• URL blacklist• Manually maintained• Local + global versions• ≈17k entries in combo.

• Warning system• 4 warnings without consequence• 5th blocks account

• Either Sybil “sock-puppets” or actual collectives• Manual signature detection or IP correlation

Bots can be very fast1.Rate-limits2.CAPTCHAs3.Special software

TAKEAWAY:

* Only humans can catch “new” instances

* Aggregate mechanisms must wait for atomic instances to compound before they can take affect

HUMAN LATENCY!

Page 12: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

STATUS QUO OFWIKIPEDIA SPAMMING

12

Page 13: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Corpus Creation• “Spam” edits are those that:

1. Added exactly one external link2. Made no changes outside context of that link3. Were “rolled-back” (expedited admin. undo)

• Edits meeting: if(1 && 2 && !3) = “Ham”• Edits meeting: if(3) = “Damaging”

13

Two months in mid-2010: 7.4 mil. edits

SPAM HAM DAMAGE

≈4,700 ≈182,000 ≈204,000

Page 14: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Corpus Example (1)

14

Because the link was the ONLY change made. The privileged user’s decision to roll-back that

edit speaks DIRECTLY to that link’s inappropriateness.

Page 15: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Corpus Example (2)

15

Page 16: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Spam Genres

TAKEAWAY: Spam is:• Categorically diverse• “Subtlety”: Info.

adjacent services • Not monetarily-driven?

16

Spam by ODP/DMOZ category

Page 17: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Spam Placement

TAKEAWAY:• Conventions

followed• Subtlety for

persistence?

17

Page 18: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Bad Domains + Blacklist

TAKEAWAYS:• Wiki spammers ≠ email spammers

18

Email Spam URLS

WikiSpamURLs

ø

• Domain statistics don’t suggest max. utility– Only 2 of 25 worst were blacklisted– Only 14 domains appear 10+ times in {SPAM}

Page 19: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Spam Perpetrators

• 57% of spam added by non-registered users– Yet we will show registered accounts beneficial

• Worst users map onto worst domains

19

Geo-locating spammers

• Dedicated spam accounts; most blocked

Page 20: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Spam Life/Impact

Spam lifespan• 19 minutes at median• 85 secs. for damage• Reason for difference

20

Spam page views• Proxy for “link views”• Metric of choice• 6.05 views per spam link

Page 21: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Broadening Search

• Maybe our corpus just missed something?– Archives show some abuse (but non-automated)– Deleted revisions; media coverage

• SUMMARY: Status quo strategies unsuccessful– ≈ 6 views/link not likely to be profitable– Patrollers un-fooled by subtle strategies which

seem to aim for “link persistence”– Cause or effect of unsophisticated strategies?

21

Page 22: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

A NOVEL ATTACK MODEL(inspired by [15])

22

Page 23: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Attack Summary

MODEL: Abandon deception, aggressively exploit latency of human detection process.

Attack characterized by 4 vectors:1.Target high-traffic pages2.Autonomous attainment of privileged

accounts; mechanized operation thereof3.Prominent link placement/style4.Distributed

23

Page 24: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Popular Pages (1)

24

Page 25: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Popular Pages (1)

25

Page 26: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Popular Pages (1)

• Imagine 85 seconds on these pages!• Why not just protect these somehow?

– Next: Account-level vulnerabilities 26

Page 27: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Popular Pages (2)

27

Page 28: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Privileged Accounts

28

• Becoming autoconfirmed– Outsource the CAPTCHA solve [15]– Requires 10 good edits (or warnings/block)– Non-vetted namespaces; helpful bots; thesaurus attacks

• Conduct campaigns via API [1] at high-speed– “Anti-bot” software found ineffective

Page 29: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Prominent Placement

29

<p style="font-size:5em;font-weight:bolder">[http://www.example.com Example link]</p>

Page 30: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Distributed Attack

Two notions of “distributed”:

1.Need IP-agility to avoid IP (range) blocks– What spammer doesn’t?– Use open-proxies, existing botnet, etc.

2.There are many sites one can target– Wiki language editions; WMF sister-sites– Universal API [1] into MediaWiki installs

30

Page 31: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

MODEL EFFECTIVENESS &DEFENSE STRATEGIES

31

Page 32: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

User Responses• Administrative response

– Expected flow to campaign termination– Very conservative example:

• 1 min. account survival = 70 links placed• Top 70 articles @ 1 min. each = 2,100 active views

• Reader response– Sources of link exposure

• Active views: Link in default version• Inactive views: Version histories and watchlisters• Content scrapers and mashup apps.

– Click-through desensitization (email spam? [13])32

Page 33: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Economics• Cost of campaigns (about $1 marginal)

– Affiliate programs; 50% commissions [13]– CAPTCHA per account; $1 per thousand [15]– Domain names; $1-$2 each– Minimal labor costs (< 100 LOC)

• Expected return-on-investment; extrapolate from “male enhancement pharmacy” study [13]– 2100 exposures -> 20 click-through -> $5.20 gross– Affiliate fees: $5.20 -> $2.60 net >> $1 marginal

• Why not seen live? Naivety? Scale?33

Page 34: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Defense Strategies (1)• Ethical issues; WMF notification• Focus on technical defense (sociological aspects)

1. Require explicit approval– Prevent from going live until vetted– Controversial “Flagged Revisions” proposal

2. Privilege configuration – Edit count is a poor metric (see [8])– No human can do 70 edits/min. – maybe 5 edits/min.?– Tool-expedited users should have separate status

34

Page 35: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Defense Strategies (2)3. Autonomous signature-driven detection [19]

– Human latency gone (dwindling workforce [11])– Machine-learning classifier over:

• Wikipedia metadata [5] (URL addition rates, editor permissions)• Landing-site analysis [7] (Commercial intent, SEO)• Third-party data (Alexa web crawler, Google Safe Browsing)

– Implemented and operational on English Wikipedia• Offline analysis: 66% status-quo spam catch-rate at 0.5% FP-rate

35

Likely vandalism

Likely vandalism

-------------------

EditQueue

Likely vandalism

Likely innocent

STiki Client

Fetch Edit

Display

Classify

STiki Services

STiki Client

Wiki-API

Wikipedia

IRC#enwiki#

Anti-spam algorithm

Bot Logicif(score) > thresh:

REVERTelse:

Scoring

Maintain

Page 36: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

References (1)

36

[01] MediaWiki API. http://en.wikipedia.org/w/api.php[02] Wikipedia: External links. http://en.wikipedia.org/wiki/WP:EL[03] Wikipedia spam blacklists. http://en.wikipedia.org/wiki/WP:BLACKLIST[04] WikiProject spam. http://en.wikipedia.org/wiki/WP:WPSPAM[05] B. Adler, et al. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing 2011. [06] J. Antin and C. Cheshire. Readers are not free-riders: Reading as a form of participation on Wikipedia. In CSCW 2010.[07] H. Dai, et al. Detecting online commercial intention (OCI). In WWW 2006.[08] P. K.-F. Fong and R. P. Biuk-Aghai. What did they do? Deriving high-level edit histories in wikis. In WikiSym 2010.[09] H. Gao, et al. Detecting and characterizing social spam campaigns. In CCS’10.[10] R. S. Geiger and D. Ribes. The work of sustaining order in Wikipedia: The banning of a vandal. In CSCW 2010.

Page 37: Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

References (2)

37

[11] E. Goldman. Wikipedia’s labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009.[12] P. Heymann, et al. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Comp., 11(6):36–45, 2007.[13] C. Kanich, et al. Spamalytics: An empirical market analysis of spam marketing conversion. In CCS 2008.[14] C. McCarthy. Amazon adds Wikipedia to book-shopping. http://news.cnet.com/8301-13577_3-20024297-36.html, 2010.[15] M. Motoyama, et al. Re: CAPTCHAs - Understanding CAPTCHA-solving services in an economic context. In USENIX Security 2010[16] R. Priedhorsky, et al. Creating, destroying, and restoring value in Wikipedia. In GROUP 2007, the ACM Conference on Supporting Group Group[17] Y. Shin, et al. The nuts and bolts of a forum spam automator. In LEET 2011.[18] B. E. Ur and V. Ganapathy. Evaluating attack amplification in online social networks. In W2SP 2009, Web 2.0 Security and Privacy[19] A. G. West, et al. Autonomous link spam detection in purely collaborative environments. In WikiSym 2011.