link analysis for web information retrieval
DESCRIPTION
Talk from February 2008 @ FADOC, Complutense University, MadridTRANSCRIPT
![Page 1: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/1.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link Analysis for Web Information RetrievalWith Applications to Adversarial IR
Carlos Castillo1
With: R. Baeza-Yates1,3, L. Becchetti2, P. Boldi5,D. Donato1, A. Gionis1, S. Leonardi2, V.Murdock1,
M. Santini5, F. Silvestri4, S. Vigna5
1. Yahoo! Research Barcelona – Catalunya, Spain2. Universita di Roma “La Sapienza” – Rome, Italy
3. Yahoo! Research Santiago – Chile4. ISTI-CNR –Pisa,Italy
5. Universita degli Studi di Milano – Milan, Italy
![Page 2: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/2.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
When you have a hammer
![Page 3: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/3.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Everything looks like a graph!
![Page 4: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/4.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 5: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/5.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Links are not placed at random
Topical locality hypothesis
Link endorsement hypothesis
![Page 6: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/6.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Links are not placed at random
Topical locality hypothesis
Link endorsement hypothesis
![Page 7: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/7.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topical locality hypothesis
“We found that pages are significantly more likely tobe related topically to pages to which they arelinked, as opposed to other pages selected atrandom or other nearby pages.” [Davison, 2000]
![Page 8: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/8.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5
Aver
age
text
sim
ilar
ity
Link distance
[Baeza-Yates et al., 2006], data from UK 2006
![Page 9: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/9.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link similarity cases
Link (geodesic) distance
Co-citation
Bibliographic coupling
![Page 10: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/10.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Co-citation
![Page 11: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/11.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Bibliographic coupling
![Page 12: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/12.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
(Both can be generalized)
(Both co-citation and bibliographic coupling can begeneralized. E.g.: SimRank [Jeh and Widom, 2002]:generalizes the idea of co-citation to several levels)
![Page 13: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/13.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link endorsement hypothesis
Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]
But they can represent:
Disagreement
Self citations
Nepotism
Citations to methodological documents
etc.
![Page 14: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/14.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link endorsement hypothesis
Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]
But they can represent:
Disagreement
Self citations
Nepotism
Citations to methodological documents
etc.
![Page 15: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/15.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 16: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/16.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 17: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/17.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 18: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/18.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 19: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/19.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 20: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/20.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
![Page 21: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/21.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Nevertheless
Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web
Analogy with Economy
Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences
![Page 22: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/22.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Nevertheless
Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web
Analogy with Economy
Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences
![Page 23: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/23.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 24: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/24.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
![Page 25: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/25.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
![Page 26: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/26.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
![Page 27: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/27.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
![Page 28: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/28.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Bow-tie
[Broder et al., 2000]
![Page 29: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/29.jpg)
![Page 30: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/30.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Jellyfish
[Tauro et al., 2001] - Internet Autonomous Systems (AS)Topology
![Page 31: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/31.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Jellyfish
![Page 32: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/32.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Microscopic view, e.g. Degree
[Barabasi, 2002] and others
![Page 33: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/33.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“While entirely of human design, the emergingnetwork appears to have more in common with a cellor an ecological system than with a Swisswatch.” [Barabasi, 2002]
![Page 34: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/34.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Other scale-free networks
Power grid designs
Sexual partners in humans
Collaboration of movie actors in films
Citations in scientific publications
Protein interactions
![Page 35: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/35.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Microscopic view, e.g. Degree
Greece Chile
Spain Korea
[Baeza-Yates et al., 2007] - compares this distribution in 8countries . . . guess what is the result?
![Page 36: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/36.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
![Page 37: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/37.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
![Page 38: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/38.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
.it (40M pages) .uk (18M pages)
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
.eu.int (800K pages) Synthetic graph (100K pages)
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
[Baeza-Yates et al., 2006]
![Page 39: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/39.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
![Page 40: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/40.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
![Page 41: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/41.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
![Page 42: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/42.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
![Page 43: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/43.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Preferential attachment
“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]
“rich get richer”
![Page 44: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/44.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Preferential attachment
“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]
“rich get richer”
![Page 45: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/45.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 46: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/46.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Counting in-links does not work
“With a simple program, huge numbers of pages canbe created easily, artificially inflating citation counts.Because the Web environment contains profitseeking ventures, attention getting strategies evolvein response to search engine algorithms. For thisreason, any evaluation strategy which countsreplicable features of web pages is prone tomanipulation” [Page et al., 1998]
![Page 47: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/47.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
PageRank: simplified version
PageRank ′(u) =∑
v∈Γ−(u)
PageRank ′(v)
|Γ+(v)|
Γ−(·): in-linksΓ+(·): out-links
![Page 48: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/48.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations with pseudo-PageRank
![Page 49: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/49.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations with pseudo-PageRank
![Page 50: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/50.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
So far, so good, but ...
The Web includes many pages with no out-links, thesewill accumulate all of the score
We would like Web pages to accumulate ranking
We add random jumps (teleportation)
![Page 51: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/51.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
PageRank
PageRank(u) =ǫ
N+ (1 − ǫ)
∑
v∈Γ−(u)
PageRank(v)
|Γ+(v)|
Γ−(·): in-linksΓ+(·): out-linksǫ/N: jump to a random page with probability ǫ ≈ 0.15
![Page 52: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/52.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
HITS
Two scores per page: “hub score” and “authority score”.
![Page 53: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/53.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
HITS
Two scores per page: “hub score” and “authority score”.
![Page 54: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/54.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
![Page 55: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/55.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations
Initialize:hub(u, 0) = auth(u, 0) = 0
Iterate:hub(u, t) =
∑v∈Γ+(u)
auth(v ,t−1)|Γ−(v)|
auth(u, t) =∑
v∈Γ−(u)hub(v ,t−1)|Γ+(v)|
![Page 56: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/56.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 57: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/57.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What is on the Web?
![Page 58: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/58.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What is on the Web [2.0]?
![Page 59: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/59.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What else is on the Web?
“The sum of all human knowledge plus porn” – Robert Gilbert
Source: www.milliondollarhomepage.com
![Page 60: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/60.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What’s happening on the Web?
There is a fierce competition
for your attention
![Page 61: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/61.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What’s happening on the Web?
Search engines are to some extent
arbiters of this competition
and they must watch it closely, otherwise ...
![Page 62: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/62.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Some cheating occurs
1986 FIFA World Cup, Argentina vs England
![Page 63: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/63.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Simple web spam
![Page 64: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/64.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Hidden text
![Page 65: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/65.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Made for advertising
![Page 66: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/66.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Search engine?
![Page 67: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/67.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Fake search engine
![Page 68: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/68.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“Normal” content in link farms
![Page 69: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/69.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“Normal” content in link farms
![Page 70: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/70.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Cloaking
![Page 71: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/71.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Redirection
![Page 72: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/72.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Redirects using Javascript
Simple redirect
<script>
document.location="http://www.topsearch10.com/";
</script>
“Hidden” redirect<script>
var1=24; var2=var1;
if(var1==var2) {document.location="http://www.topsearch10.com/";
}</script>
![Page 73: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/73.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Problem: obfuscated code
Obfuscated redirect<script>
var a1="win",a2="dow",a3="loca",a4="tion.",
a5="replace",a6="(’http://www.top10search.com/’)";
var i,str="";
for(i=1;i<=6;i++)
{str += eval("a"+i);
}eval(str);
</script>
![Page 74: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/74.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Problem: really obfuscated code
Encoded javascript
<script>
var s = "%5CBE0D%5C%05GDHJ BDE%16...%04%0E";
var e = ’’, i;
eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));
</script>
More examples: [Chellapilla and Maykov, 2007]
![Page 75: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/75.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
There are many attempts of cheating on the Web
Most of these are spam:
1,630,000 results for “free mp3 hilton viagra” in SE1
1,760,000 results for “credit vicodin loan” in SE2
1,320,000 results for “porn mortgage” in SE3
![Page 76: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/76.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Costs
Costs:
X Costs for users: lower precision for some queries
X Costs for search engines: wasted storage space,network resources, and processing cycles
X Costs for the publishers: resources invested in cheatingand not in improving their contents
![Page 77: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/77.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 78: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/78.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 79: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/79.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 80: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/80.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 81: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/81.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 82: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/82.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 83: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/83.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 84: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/84.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 85: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/85.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 86: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/86.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 87: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/87.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 88: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/88.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
![Page 89: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/89.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Opportunities for Web spam
X Spamdexing
Keyword stuffingLink farmsSpam blogs (splogs)Cloaking
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 90: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/90.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Opportunities for Web spam
X Spamdexing
Keyword stuffingLink farmsSpam blogs (splogs)Cloaking
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
![Page 91: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/91.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 92: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/92.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Motivation
[Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:
“in a number of these distributions, outlier values are
associated with web spam”
![Page 93: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/93.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Machine Learning
![Page 94: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/94.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Training of a Decision Tree
![Page 95: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/95.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Decision Tree (error = 15%)
![Page 96: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/96.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Decision Tree (error = 15% → 12%)
![Page 97: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/97.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Machine Learning (cont.)
![Page 98: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/98.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Feature Extraction
![Page 99: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/99.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
![Page 100: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/100.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
![Page 101: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/101.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
![Page 102: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/102.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
![Page 103: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/103.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
![Page 104: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/104.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
![Page 105: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/105.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
![Page 106: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/106.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
![Page 107: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/107.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Data
Data is difficult to collect
Data is expensive to label
Labels are sparse
Humans do not always agree
![Page 108: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/108.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Agreement
![Page 109: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/109.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Results
LabelsLabel Frequency Percentage
Normal 4,046 61.75%Borderline 709 10.82%
Spam 1,447 22.08%Can not classify 350 5.34%
Agreement
Category Kappa Interpretation
normal 0.62 Substantial agreementspam 0.63 Substantial agreementborderline 0.11 Slight agreement
global 0.56 Moderate agreement
Reference collection [Castillo et al., 2006]
![Page 110: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/110.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 111: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/111.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological spam: link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
![Page 112: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/112.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological spam: link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
![Page 113: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/113.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
![Page 114: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/114.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
![Page 115: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/115.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
![Page 116: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/116.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Semi-streaming model
Memory size enough to hold some data per-node
Disk size enough to hold some data per-edge
A small number of passes over the data
![Page 117: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/117.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
![Page 118: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/118.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
![Page 119: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/119.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
![Page 120: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/120.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link-Based Features
Degree-related measures
PageRank
TrustRank [Gyongyi et al., 2004]
Truncated PageRank [Becchetti et al., 2006]
Estimation of supporters [Becchetti et al., 2006]
140 features per host (2 pages per host)
![Page 121: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/121.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Degree-Based
0.00
0.02
0.04
0.06
0.08
0.10
0.12
1968753460609107764252125899138032376184
NormalSpam
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
22009.92686.5327.940.04.90.60.10.00.00.0
NormalSpam
![Page 122: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/122.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank
TrustRank [Gyongyi et al., 2004]
A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious
Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step
i Trusted nodes: data from http://www.dmoz.org/
![Page 123: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/123.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank
TrustRank [Gyongyi et al., 2004]
A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious
Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step
i Trusted nodes: data from http://www.dmoz.org/
![Page 124: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/124.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank Idea
![Page 125: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/125.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank / PageRank
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
9e+033e+031e+033e+021e+024e+011e+01410.4
NormalSpam
![Page 126: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/126.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Nu
mb
er
of
No
de
s
Top 0%−10%
Top 40%−50%
Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
![Page 127: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/127.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Nu
mb
er
of
No
de
s
Top 0%−10%
Top 40%−50%
Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
![Page 128: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/128.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
![Page 129: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/129.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
![Page 130: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/130.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Bottleneck number
bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}. Minimum rate of growthof the neighbors of x up to a certain distance. We expect thatspam pages form clusters that are somehow isolated from therest of the Web graph and they have smaller bottlenecknumbers than non-spam pages.
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
4.523.873.312.832.422.071.781.521.301.11
NormalSpam
![Page 131: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/131.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 132: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/132.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Content-Based Features
Most of these reported in [Ntoulas et al., 2006]:
Number of word in the page and title
Average word length
Fraction of anchor text
Fraction of visible text
Compression rate
From [Castillo et al., 2007]:
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams
![Page 133: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/133.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Average word length
0.00
0.02
0.04
0.06
0.08
0.10
0.12
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
NormalSpam
Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.
![Page 134: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/134.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Corpus precision
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
NormalSpam
Figure: Histogram of the corpus precision in non-spam vs. spampages.
![Page 135: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/135.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Query precision
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
NormalSpam
Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.
![Page 136: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/136.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 137: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/137.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
General hypothesis
Pages topologically close to each other are more likely to havethe same label (spam/nonspam) than random pairs of pages
Ideas for exploiting this: clustering, propagation, stackedlearning
![Page 138: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/138.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Castillo et al., 2007]
![Page 139: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/139.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological dependencies: in-links
Histogram of fraction of spam hosts in the in-links
0 = no in-link comes from spam hosts
1 = all of the in-links come from spam hosts
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.0 0.2 0.4 0.6 0.8 1.0
In-links of non spamIn-links of spam
![Page 140: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/140.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological dependencies: out-links
Histogram of fraction of spam hosts in the out-links
0 = none of the out-links points to spam hosts
1 = all of the out-links point to spam hosts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
Out-links of non spamOutlinks of spam
![Page 141: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/141.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering
Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting
![Page 142: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/142.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Initial prediction:
![Page 143: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/143.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Clustering:
![Page 144: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/144.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Final prediction:
![Page 145: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/145.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering – Results
Baseline Clustering
Without bagging
True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%
F-Measure 0.646 0.673
With bagging
True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
![Page 146: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/146.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label
Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes
![Page 147: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/147.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Initial prediction:
![Page 148: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/148.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Propagation:
![Page 149: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/149.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Final prediction, applying a threshold:
![Page 150: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/150.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label – Results
Baseline Fwds. Backwds. Both
Classifier without bagging
True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%
F-Measure 0.646 0.665 0.664 0.676
Classifier with bagging
True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
![Page 151: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/151.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning
Meta-learning scheme [Cohen and Kou, 2006]
Derive initial predictions
Generate an additional attribute for each object bycombining predictions on neighbors in the graph
Append additional attribute in the data and retrain
![Page 152: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/152.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features
Let N(x) be the set of pages related to x (in some way)
Compute
f (x) =
∑g∈N(x) p(g)
|N(x)|
Add f (x) as an extra feature for instance x and learn anew model with k + 1 features
![Page 153: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/153.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Initial prediction:
![Page 154: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/154.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Computation of new feature:
![Page 155: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/155.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
New prediction with k + 1 features:
![Page 156: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/156.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning - Results
Avg. Avg. Avg.Baseline of in of out of both
True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%
F-Measure 0.723 0.733 0.742 0.750
V Increases detection rate
![Page 157: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/157.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning x2
And repeat ...
Baseline First pass Second pass
True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%
F-Measure 0.723 0.750 0.763
V Significant improvement over the baseline
![Page 158: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/158.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
![Page 159: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/159.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
![Page 160: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/160.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
![Page 161: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/161.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
![Page 162: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/162.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Thank you!
![Page 163: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/163.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).Generalizing pagerank: Damping functions for link-based rankingalgorithms.In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.ACM Press.
Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).Characterization of national web domains.ACM Transactions on Internet Technology, 7(2).
Baeza-Yates, R. and Poblete, B. (2006).Dynamics of the chilean web structure.Comput. Networks, 50(10):1464–1473.
Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).Web structure, dynamics and page quality.In Proceedings of String Processing and Information Retrieval (SPIRE),volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.Springer.
Barabasi, A.-L. (2002).Linked: The New Science of Networks.Perseus Books Group.
Barabasi, A. L. and Albert, R. (1999).Emergence of scaling in random networks.Science, 286(5439):509–512.
![Page 164: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/164.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,Stata, R., Tomkins, A., and Wiener, J. (2000).Graph structure in the web: Experiments and models.In Proceedings of the Ninth Conference on World Wide Web, pages309–320, Amsterdam, Netherlands. ACM Press.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,and Vigna, S. (2006).A reference collection for web spam.SIGIR Forum, 40(2):11–24.
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).Know your neighbors: Web spam detection using the web topology.In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.
Chellapilla, K. and Maykov, A. (2007).A taxonomy of javascript redirection spam.In AIRWeb ’07: Proceedings of the 3rd international workshop onAdversarial information retrieval on the web, pages 81–88, New York, NY,USA. ACM Press.
![Page 165: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/165.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.
Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.
Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.
Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.
![Page 166: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/166.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating web spam with trustrank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Jeh, G. and Widom, J. (2002).Simrank: a measure of structural-context similarity.In KDD ’02: Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 538–543, NewYork, NY, USA. ACM Press.
Li, Y. (1998).Toward a qualitative search engine.IEEE Internet Computing.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).The PageRank citation ranking: bringing order to the Web.Technical report, Stanford Digital Library Technologies Project.
![Page 167: Link Analysis for Web Information Retrieval](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ba8b3b4c905b3618b51c9/html5/thumbnails/167.jpg)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.
Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).A simple conceptual model for the internet topology.In Global Internet, San Antonio, Texas, USA. IEEE CS Press.