using topology to identify spam (sigir 2007)
TRANSCRIPT
![Page 1: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/1.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Know your NeighborsWeb Spam Detection Using the Web Topology
Carlos Castillo1, Debora Donato1, Aristides Gionis1,Vanessa Murdock1, Fabrizio Silvestri2
1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,Italy
ACM SIGIR, 25 July 2007, Amsterdam
![Page 2: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/2.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web
2 Detecting Web Spam
3 Link-Based Detection
4 Content-Based Detection
5 Using Links and Contents
6 Using the Web Topology
7 Conclusions
![Page 3: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/3.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 4: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/4.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What is on the Web?
![Page 5: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/5.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What is on the Web [2.0]?
![Page 6: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/6.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What else is on the Web?
Source: www.milliondollarhomepage.com
![Page 7: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/7.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What’s happening on the Web?
There is a fierce competition
for your attention
![Page 8: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/8.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What’s happening on the Web?
Search engines are to some extent
arbiters of this competition
and they must watch it closely, otherwise ...
![Page 9: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/9.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Some cheating occurs
1986 FIFA World Cup, Argentina vs England
![Page 10: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/10.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Simple web spam
![Page 11: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/11.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hidden text
![Page 12: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/12.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Made for advertising
![Page 13: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/13.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Search engine?
![Page 14: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/14.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Fake search engine
![Page 15: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/15.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
“Normal” content in link farms
![Page 16: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/16.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
There are many attempts of cheating on the Web
Most of these are spam:
1,630,000 results for “free mp3 hilton viagra” in SE1
1,760,000 results for “credit vicodin loan” in SE2
1,320,000 results for “porn mortgage” in SE3
![Page 17: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/17.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Costs
Costs:
X Costs for users: lower precision for some queries
X Costs for search engines: wasted storage space,network resources, and processing cycles
X Costs for the publishers: resources invested in cheatingand not in improving their contents
![Page 18: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/18.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 19: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/19.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 20: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/20.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 21: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/21.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 22: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/22.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 23: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/23.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 24: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/24.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 25: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/25.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 26: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/26.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 27: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/27.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 28: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/28.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 29: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/29.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Research on Web spam detection
Web spam detection techniques
![Page 30: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/30.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Spam, damn spam and statistics
[Fetterly et al., 2004] propose to study statisticaldistributions: “in a number of these distributions, outliervalues are associated with web spam”
![Page 31: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/31.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Machine learning training
![Page 32: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/32.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Machine learning
![Page 33: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/33.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 34: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/34.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 35: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/35.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 36: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/36.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 37: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/37.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 38: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/38.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 39: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/39.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 40: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/40.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 41: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/41.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
![Page 42: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/42.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
![Page 43: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/43.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
![Page 44: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/44.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
![Page 45: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/45.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 46: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/46.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
“Link farms”
Web
Link farm
Spam page
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
![Page 47: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/47.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Handling large-graphs
Memory size enough to hold some data per-node
Disk size enough to hold some data per-edge
A small number of passes over the data
![Page 48: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/48.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
![Page 49: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/49.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
![Page 50: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/50.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
![Page 51: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/51.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Link-based features
Degree-related measures
PageRank
TrustRank [Gyongyi et al., 2004]
Truncated PageRank [Becchetti et al., 2006]
Estimation of supporters [Becchetti et al., 2006]
140 features per host (2 pages per host)
![Page 52: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/52.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Degree-Based
0.00
0.02
0.04
0.06
0.08
0.10
0.12
1968753460609107764252125899138032376184
NormalSpam
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
22009.92686.5327.940.04.90.60.10.00.00.0
NormalSpam
![Page 53: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/53.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
TrustRank
[Gyongyi et al., 2004]
![Page 54: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/54.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
TrustRank / PageRank
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
9e+033e+031e+033e+021e+024e+011e+01410.4
NormalSpam
![Page 55: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/55.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Truncated PageRank
Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:
damping(t) =
{0 t ≤ T
Cαt t > T
V No extra reading of the graph after PageRank
![Page 56: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/56.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Truncated PageRank
Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:
damping(t) =
{0 t ≤ T
Cαt t > T
V No extra reading of the graph after PageRank
![Page 57: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/57.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hop-plot
![Page 58: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/58.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0%−10%Top 40%−50%Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
![Page 59: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/59.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0%−10%Top 40%−50%Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
![Page 60: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/60.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
![Page 61: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/61.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
![Page 62: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/62.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number
bd(x) = minj≤d
{|Nj (x)||Nj−1(x)|
}.
Minimum rate of growth of the neighbors of x up to a certaindistance.
![Page 63: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/63.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number: spam
![Page 64: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/64.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number: normal
![Page 65: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/65.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number
bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}.
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
4.523.873.312.832.422.071.781.521.301.11
NormalSpam
![Page 66: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/66.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 67: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/67.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-Based Features
Most of the features reported in [Ntoulas et al., 2006]
Number of word in the page and title
Average word length
Fraction of anchor text
Fraction of visible text
Compression rate
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams
96 features per host
![Page 68: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/68.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-based features (entropy related)
T = {(w1, p1), . . . , (wk , pk)} the set of trigrams in a page,
where trigram wi has frequency pi
Features:
Entropy of trigrams H = −∑
wi∈T pi log pi
Also, compression rate, as measured by bzip
![Page 69: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/69.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-based features (related to popularkeywords)
F set of most frequent terms in the collection
Q set of most frequent terms in a query log
P set of terms in a page
Features:
Corpus “precision” |P ∩ F |/|P|Corpus “recall” |P ∩ F |/|F |Query “precision” |P ∩ Q|/|P|Query “recall” |P ∩ Q|/|Q|
![Page 70: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/70.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Average word length
0.00
0.02
0.04
0.06
0.08
0.10
0.12
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
NormalSpam
Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.
![Page 71: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/71.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Corpus precision
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
NormalSpam
Figure: Histogram of the corpus precision in non-spam vs. spampages.
![Page 72: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/72.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Query precision
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
NormalSpam
Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.
![Page 73: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/73.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 74: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/74.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cost-sensitive decision tree with bagging
Bagging of 10 decision trees, asymmetrical costs.
Cost ratio 1 10 20 30 50
True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%
F-Measure 0.712 0.703 0.704 0.723 0.692
![Page 75: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/75.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Link- and content-based features
Link-based and content-based
Both Link-only Content-only
True positive rate 78.7% 79.4% 64.9%False positive rate 5.7% 9.0% 3.7%
F-Measure 0.723 0.659 0.683
![Page 76: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/76.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 77: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/77.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hypothesis
Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.
Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]
![Page 78: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/78.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hypothesis
Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.
Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]
![Page 79: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/79.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
![Page 80: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/80.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Topological dependencies: in-links
Histogram of fraction of spam hosts in the in-links
0 = no in-link comes from spam hosts
1 = all of the in-links come from spam hosts
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.0 0.2 0.4 0.6 0.8 1.0
In-links of non spamIn-links of spam
![Page 81: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/81.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Topological dependencies: out-links
Histogram of fraction of spam hosts in the out-links
0 = none of the out-links points to spam hosts
1 = all of the out-links point to spam hosts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
Out-links of non spamOutlinks of spam
![Page 82: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/82.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering
Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting
![Page 83: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/83.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Initial prediction:
![Page 84: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/84.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Clustering:
![Page 85: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/85.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Final prediction:
![Page 86: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/86.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering – Results
Baseline Clustering
Without bagging
True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%
F-Measure 0.646 0.673With bagging
True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
![Page 87: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/87.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label
Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes
![Page 88: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/88.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Initial prediction:
![Page 89: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/89.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Propagation:
![Page 90: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/90.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Final prediction, applying a threshold:
![Page 91: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/91.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label – Results
Baseline Fwds. Backwds. Both
Classifier without bagging
True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%
F-Measure 0.646 0.665 0.664 0.676Classifier with bagging
True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
![Page 92: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/92.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning
Meta-learning scheme [Cohen and Kou, 2006]
Derive initial predictions
Generate an additional attribute for each object bycombining predictions on neighbors in the graph
Append additional attribute in the data and retrain
![Page 93: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/93.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features
Let N(x) be the set of pages related to x (in some way)
Compute
f (x) =
∑g∈N(x) p(g)
|N(x)|Add f (x) as an extra feature for instance x and learn anew model with k + 1 features
![Page 94: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/94.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Initial prediction:
![Page 95: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/95.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Computation of new feature:
![Page 96: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/96.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
New prediction with k + 1 features:
![Page 97: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/97.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning - Results
Avg. Avg. Avg.Baseline of in of out of both
True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%
F-Measure 0.723 0.733 0.742 0.750
V Increases detection rate
![Page 98: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/98.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning x2
And repeat ...
Baseline First pass Second pass
True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%
F-Measure 0.723 0.750 0.763
V Significant improvement over the baseline
![Page 99: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/99.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
![Page 100: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/100.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Concluding remarks
V Considering content-based and link-based attributesimproves the accuracy of the classifier
V Considering the links among pages improves the accuracy
![Page 101: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/101.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Concluding remarks
V Considering content-based and link-based attributesimproves the accuracy of the classifier
V Considering the links among pages improves the accuracy
![Page 102: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/102.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
i Web Spam Challenge I & II: http://webspam.lip6.fr/
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
B Newsletter: [email protected]
Thank you!
![Page 103: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/103.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
i Web Spam Challenge I & II: http://webspam.lip6.fr/
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
B Newsletter: [email protected]
Thank you!
![Page 104: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/104.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.
Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.
Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.
Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.
![Page 105: Using Topology to Identify Spam (SIGIR 2007)](https://reader034.vdocuments.site/reader034/viewer/2022042614/55530d4db4c9054e3f8b4f34/html5/thumbnails/105.jpg)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.
Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.