link analysis

38
Link Analysis Rajendra Akerkar Vestlandsforsking, Norway

Upload: rajendra-akerkar

Post on 07-Dec-2014

719 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Link analysis

Link Analysis

Rajendra AkerkarVestlandsforsking, Norway

Page 2: Link analysis

ObjectivesObjectives

h l k l To review common approaches to link analysis

To calculate the popularity of a site based on link To calculate the popularity of a site based on link analysis

To model human judgments indirectly

2 R. Akerkar

Page 3: Link analysis

OutlineOutline

1 Early Approaches to Link Analysis1. Early Approaches to Link Analysis2. Hubs and Authorities: HITS3. Page Rank3. Page Rank4. Stability5. Probabilistic Link Analysis6. Limitation of Link Analysis

3 R. Akerkar

Page 4: Link analysis

Early ApproachesEarly ApproachesBasic Assumptions

• Hyperlinks contain information about the human judgment of a site

• The more incoming links to a site the more it is• The more incoming links to a site, the more it is judged important

Bray 1996y• The visibility of a site is measured by the number

of other sites pointing to it• The luminosity of a site is measured by the numberThe luminosity of a site is measured by the number

of other sites to which it points Limitation: failure to capture the relative

importance of different parents (children) sites

4

importance of different parents (children) sites

R. Akerkar

Page 5: Link analysis

Early ApproachesEarly Approaches

• To calculate the score S of a document at vertex vMark 1988

S(v) = s(v) +1

| ch[v] |Σ S(w)w Є |ch(v)|

i h h h G (V E)v: a vertex in the hypertext graph G = (V, E)S(v): the global score s(v): the score if the document is isolatedch(v): children of the document at vertex v

• Limitation:- Require G to be a directed acyclic graph (DAG)q y g p ( )- If v has a single link to w, S(v) > S(w)- If v has a long path to w and s(v) < s(w),

then S(v) > S (w)

5

then S(v) S (w) unreasonable

R. Akerkar

Page 6: Link analysis

Early ApproachesEarly ApproachesMarchiori (1997)

• Hyper information should complement textual

S(v) = s(v) + h(v)

Hyper information should complement textual information to obtain the overall information

( ) ( ) ( )- S(v): overall information- s(v): textual information - h(v): hyper informationh(v): hyper information

• h(v) = Σ F S(w)w Є |ch[v]|

r(v, w)

F a fading constant F Є (0 1)- F: a fading constant, F Є (0, 1) - r(v, w): the rank of w after sorting the children

of v by S(w)

6

a remedy of the previous approach (Mark 1988)R. Akerkar

Page 7: Link analysis

HITS Kleinberg’s AlgorithmHITS - Kleinberg’s Algorithm• HITS – Hypertext Induced Topic Selectionyp p

• For each vertex v Є V in a subgraph of interest:

a(v) the authority of v

A site is er a thoritati e if it recei es man

a(v) - the authority of vh(v) - the hubness of v

• A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites

• Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

7

y

R. Akerkar

Page 8: Link analysis

Authority and HubnessAuthority and Hubness

2 5

3 1 1 6

4 77

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

8 R. Akerkar

Page 9: Link analysis

A th it d H b C gAuthority and Hubness Convergence

R i d d• Recursive dependency:

a(v) Σ h(w)w Є pa[v]

h(v) Σ a(w)

w Є pa[v]

w Є ch[v]

• Using Linear Algebra, we can prove:

a(v) and h(v) converge

9 R. Akerkar

Page 10: Link analysis

HITS ExampleHITS ExampleFind a base subgraph:

• Start with a root set R {1, 2, 3, 4}

• {1, 2, 3, 4} - nodes relevant to the topic

• Expand the root set R to include all the children and a fixed number of parents of nodes in R

A new set S (base subgraph)

10 R. Akerkar

Page 11: Link analysis

HITS ExampleHITS Example

BaseSubgraph( R d)BaseSubgraph( R, d)1. S r2. for each v in R

3. do S S U ch[v]4. P pa[v]5. if |P| > df6. then P arbitrary subset of P having size d7. S S U P8 return S 8. return S

11 R. Akerkar

Page 12: Link analysis

HITS ExampleHITS Example

HubsAuthorities(G)Hubs and authorities: two n-dimensional a and h

HubsAuthorities(G)1 1 [1,…,1] Є R 2 a h 13 t 1

0 0

|V|

3 t 14 repeat5 for each v in V6 do a (v) Σ h (w)t6 do a (v) Σ h (w)

7 h (v) Σ a (w)8 a a / || a ||

t

tt tt

t -1

t -1

w Є pa[v]

w Є pa[v]|| ||

9 h h / || h ||10 t t + 111 until || a – a || + || h – h || < ε

t

t

ttt

t

t

t

t -1t -1

12

12 return (a , h )t t

ttt 1t 1

R. Akerkar

Page 13: Link analysis

HITS Example ResultsHITS Example ResultsAuthorityyHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights13

Authority and hubness weightsR. Akerkar

Page 14: Link analysis

HITS ImprovementsHITS ImprovementsBrarat and Henzinger (1998)

• HITS problems1) The document can contain many identical links to

the same document in another hostthe same document in another host2) Links are generated automatically (e.g. messages

posted on newsgroups)

• Solutions1) Assign weight to identical multiple edges, which

are inversely proportional to their multiplicity2) Prune irrelevant nodes or regulating the influence

of a node with a relevance weight

14 R. Akerkar

Page 15: Link analysis

PageRankPageRank Introduced by Page et al (1998)y g ( ) The weight is assigned by the rank of parents

Difference with HITS HITS takes Hubness & Authority weightsHITS takes Hubness & Authority weights The page rank is proportional to its parents’ rank, but inversely

proportional to its parents’ outdegree

15

Page 16: Link analysis

Matrix NotationMatrix Notation

Adj M iAdjacent Matrix

A =

16

* http://www.kusatro.kyoto-u.com

Page 17: Link analysis

Matrix NotationMatrix Notation

Matrix Notationr = α B r = M rα : eigenvaluer : eigenvector of B

A λA x = λ x

| A - λI | x = 0 B =

Finding Pagerank to find eigenvector of B with an associated eigenvalue α

17

to find eigenvector of B with an associated eigenvalue α

Page 18: Link analysis

Matrix NotationMatrix Notation

PageRank: eigenvector of P relative to max eigenvaluePageRank: eigenvector of P relative to max eigenvalue

B = P D P-1

D: diagonal matrix of eigenvalues {λ1, … λn}

P: regular matrix that consists of eigenvectors

PageRank r1 = normalized

18 R. Akerkar

Page 19: Link analysis

Matrix NotationMatrix Notation

• Confirm the result

# of inlinks from high ranked page

hard to explain about 5&2 6&7hard to explain about 5&2, 6&7

• Interesting Topic

How do you create your

19

How do you create your homepage highly ranked?

R. Akerkar

Page 20: Link analysis

Markov Chain NotationMarkov Chain Notation Random surfer model Description of a random walk through the Web graph Interpreted as a transition matrix with asymptotic probability that a

surfer is currently browsing that pagesurfer is currently browsing that page

rt = M rt 1

Does it converge to some sensible solution (as too)

rt = M rt-1M: transition matrix for a first-order Markov chain (stochastic)

20

g ( )regardless of the initial ranks ?

R. Akerkar

Page 21: Link analysis

ProblemProblem “Rank Sink” Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph

E.g.

no parent rank 0MT converges to a matrixM converges to a matrixwhose last column is all zero

no children no solutionMT converges to zero matrix

21 R. Akerkar

Page 22: Link analysis

ModificationModification Surfer will restart browsing by picking a new Web page at g y p g p g

random

M = ( B + E )E : escape matrix

M : stochastic matrix

S ll bl Still problem?

It is not guaranteed that M is primitive

If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M

22 R. Akerkar

Page 23: Link analysis

PageRank AlgorithmPageRank Algorithm

* Page et al 1998

23

Page et al, 1998

R. Akerkar

Page 24: Link analysis

Distribution of the Mixture ModelDistribution of the Mixture Model The probability distribution that results from combining the p y g

Markovian random walk distribution & the static rank source distribution

+ 1r = εe + (1- ε)x

ε: probability of selecting non-linked page

PageRank

Now, transition matrix [εH + (1- ε)M] is primitive and stochasticrt converges to the dominant eigenvector

24

rt converges to the dominant eigenvectorR. Akerkar

Page 25: Link analysis

StabilityStability Whether the link analysis algorithms based on eigenvectors

are stable in the sense that results don’t change significantly?

Th f f h h h d The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms?How will it affect the results of algorithms?

25 R. Akerkar

Page 26: Link analysis

Stability of HITSStability of HITSNg et al (2001)

• A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights

• It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector

δ: eigengap λ1 – λ2

26

g g pd: maximum outdegree of G

R. Akerkar

Page 27: Link analysis

Stability of PageRankStability of PageRank

Ng et al (2001)

V: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role

If the set of pages affected by the perturbation have a small rank, the overall

V: the set of vertices touched by the perturbation

p g y p ,change will also be small

tighter bound byBianchini et al (2001)

27

δ(j) >= 2 depends on the edges incident on j R. Akerkar

Page 28: Link analysis

SALSASALSA SALSA (Lempel, Moran 2001) Probabilistic extension of the HITS algorithm Random walk is carried out by following hyperlinks both in the

forward and in the backward directionforward and in the backward direction

Two separate random walks Hub walk Authority walk

28 R. Akerkar

Page 29: Link analysis

Forming a Bipartite Graph in SALSAForming a Bipartite Graph in SALSA

29 R. Akerkar

Page 30: Link analysis

Random WalksRandom Walks Hub walk Follow a Web link from a page uh to a page wa (a forward link)

and then Immediately traverse a backlink going from w to v where (u w) Immediately traverse a backlink going from wa to vh, where (u,w) Є E and (v,w) Є E

Authority Walky Follow a Web link from a page w(a) to a page u(h) (a backward

link) and thend l f d l k b k f Immediately traverse a forward link going back from vh to wa

where (u,w) Є E and (v,w) Є E

30 R. Akerkar

Page 31: Link analysis

Computing WeightsComputing Weights Hub weight computed from the sum of the product of the

inverse degree of the in-links and the out-links

31 R. Akerkar

Page 32: Link analysis

Why We CareWhy We Care Lempel and Moran (2001) showed theoretically that SALSA

weights are more robust that HITS weights in the presence of the weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. This effect occurs when a small collection of pages (related to a given topic)

is connected so that every hub links to every authority and includes as a special is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect

The pages in a community connected in this way can be ranked highly by HITS higher than pages in a much larger collection highly by HITS, higher than pages in a much larger collection where only some hubs link to some authorities

TKC could be exploited by spammers hoping to increase their i h ( li k f )page weight (e.g. link farms)

32 R. Akerkar

Page 33: Link analysis

A Similar ApproachA Similar Approach Rafiei and Mendelzon (2000) and Ng et al. (2001)

propose similar approaches using reset as in PageRank

Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a b k d li k tbackward link on even steps

The stability properties of these ranking distributions are similar to those of PageRank (Ng distributions are similar to those of PageRank (Ng et al. 2001)

33 R. Akerkar

Page 34: Link analysis

Overcoming TKCOvercoming TKC Similarity downweight sequencing and sequential y g g

clustering (Roberts and Rosenthal 2003)Consider the underlying structure of clustersy gSuggest downweight sequencing to avoid the

Tight Knit Community problemg y pResults indicate approach is effective for few

tested queries, but still untested on a large scaletested queries, but still untested on a large scale

34 R. Akerkar

Page 35: Link analysis

PHITS and MorePHITS and More PHITS: Cohn and Chang (2000) Only the principal eigenvector is extracted using SALSA, so the

authority along the remaining eigenvectors is completely neglectedneglected Account for more eigenvectors of the co-citation matrix

See also Lempel, Moran (2003)

35 R. Akerkar

Page 36: Link analysis

Limits of Link AnalysisLimits of Link Analysis META tags/ invisible text Search engines relying on meta tags in documents are often

misled (intentionally) by web developers

P f l Pay-for-place Search engine bias : organizations pay search engines and page

rank Advertisements: organizations pay high ranking pages for

advertising spaceW h ff f d b l d d d With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page

36 R. Akerkar

Page 37: Link analysis

Limits of Link AnalysisLimits of Link Analysis Stability Adding even a small number of nodes/edges to the graph has a

significant impact

T i d ift i il t TKC Topic drift – similar to TKC A top authority may be a hub of pages on a different topic

resulting in increased rank of the authority pageg y p g

Content evolution Adding/removing links/content can affect the intuitive

authority rank of a page requiring recalculation of page ranks

37 R. Akerkar

Page 38: Link analysis

Further ReadingFurther Reading

R Ak k d P L B ld I ll W b Th R. Akerkar and P. Lingras, Building an Intelligent Web: Theory and Practice, Jones & Bartlett, 2008

http://www.jbpub.com/catalog/9780763741372/ p j p g

38 R. Akerkar