Download - ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 1

Using Graphs in Unstructuredand Semistructured Data Mining

Soumen Chakrabarti

IIT Bombay

www.cse.iitb.ac.in/~soumen


Acknowledgments C. Faloutsos, CMU W. Cohen, CMU IBM Almaden (many colleagues) IIT Bombay (many students) S. Sarawagi, IIT Bombay S. Sudarshan, IIT Bombay


Graphs are everywhere Phone network, Internet, Web Databases, XML, email, blogs Web of trust (epinion) Text and language artifacts (WordNet) Commodity distribution networks

Internet Map [lumeta.com]

Food Web [Martinez1991]

Protein Interactions [genomebiology.com]


Why analyze graphs? What properties do real-life graphs have? How important is a node? What is

importance? Who is the best customer to target in a

social network? Who spread a raging rumor? How similar are two nodes? How do nodes influence each other? Can I predict some property of a node

based on its neighborhood?


Outline, some more detail Part 1 (Modeling graphs)

What do real-life graphs look like? What laws govern their formation, evolution and

properties? What structural analyses are useful?

Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining


Modeling and generatingrealistic graphs


Questions What do real graphs look like?

Edges, communities, clustering effects

What properties of nodes, edges are important to model? Degree, paths, cycles, …

What local and global properties are important to measure?

How to artificially generate realistic graphs?


Modeling: why care? Algorithm design

Can skewed degree distribution make our algorithm faster?

Extrapolation How well will Pagerank work on the Web 10

years from now?

Sampling Make sure scaled-down algorithm shows same

performance/behavior on large-scale data

Deviation detection Is this page trying to spam the search engine?


Laws – degree distributions Q: avg degree is ~10 - what is the most

probable degree?

degree

count ??

10


Laws – degree distributions Q: avg degree is ~10 - what is the most

probable degree?

degreedegree

count ??

10

count

10


Power-law: outdegree O

The plot is linear in log-log scale [FFF’99]freq = degree (-2.15)

O = -2.15

Exponent = slope

Outdegree

Frequency

Nov’97

-2.15


Power-law: rank R

The plot is a line in log-log scale

Exponent = slope

R = -0.74

R

outdegree

Rank: nodes in decreasing outdegree order

Dec’98


Eigenvalues Let A be the adjacency matrix of graph The eigenvalue satisfies

A v = v, where v is some vector

Eigenvalues are strongly related to graph topology

AB C

D

A B C D

A 1

B 1 1 1

C 1

D 1


Power-law: eigenvalues of E Eigenvalues in decreasing order

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

Dec’98


The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within

1,2,… h hops? Potential answer:

1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h




1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h WE HAVE DUPLICATES!




1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h ‘avg’ degree: meaningless!


Power-law: hop-plot H

Pairs of nodes as a function of hops N(h)= hH

H = 4.86

Dec 98Hops

# of Pairs # of Pairs

Hops Router level ’95

H = 2.83


Observation Q: Intuition behind ‘hop exponent’? A: ‘intrinsic=fractal dimensionality’ of the

network

N(h) ~ h1 N(h) ~ h2

...


Any other ‘laws’? The Web looks like a “bow-tie”

[Kumar+1999] IN, SCC, OUT, ‘tendrils’ Disconnected components


Generators How to generate graphs from a realistic

distribution? Difficulty: simultaneously preserving many

local and global properties seen in realistic graphs Erdos-Renyi: switch on each edge

independently with some probability Problem: degree distribution not power-law

Degree-based Process-based (“preferential attachment”)


Degree-based generator Fix the degree distribution (e.g., ‘Zipf’) Assign degrees to nodes

Add matching edges to satisfy degrees No direct control over

other properties “ACL model”

[AielloCL2000]


Process-based: Preferential attachment Start with a clique with m nodes Add one node v at every time step v makes m links to old nodes Suppose old node u has degree d(u) Let pu = d(u)/ wd(w)

v invokes a multinomial distribution defined by the set of p’s

And links to whichever u’s show up At time t, there are m+t nodes, mt links What is the degree distribution?


Preferential attachment: analysis ki(t) = degree of node i at time t

Discrete random variable Approximate as continuous random variable Let i(t) = E(ki(t)), expectation over random

linking choices At time t, the infinitesimal expected growth

rate of i(t) is, by linearity of expectation,

t

t

mt

tm

t

t iii

2

)(

2

)()(

m degrees to add Total degree at t

ii

t

tm t) (

Time at which node i was born


Preferential attachment, continued Expected degree of each node grows as

square-root of age:

Let the current time be t A node must be old enough for its degree to

be large; for i(t) > k, we need

Therefore, the fraction of nodes with degree larger than k is

Pr(degree = k) const/k3 (data closer to 2)

iit

t m t) (

2 2k t m ti

2

2

2

2

) (k

m

k t m

t mt


Bipartite cores Basic preferential attachment does not

Explain dense/complete bipartite cores (100,000’s in a O(20 million)-page crawl)

Account for influence of search engines

The story isn’t over yet

2:3 core(n:m core)

log m

Nu

mb

er

of c

ore

s

n=2

n=7


Other process-based generators “Copying model” [KumarRRTU2000]

New node v picks old reference node r u.a.r. v adds k new links to old nodes; for ith link: W.p. add a link to an old node picked u.a.r. W.p. 1– copy the ith link from r

More difficult to analyze Reference node compression

techniques! H.O.T.: connect to closest, high-connectivity

neighbor [Fabrikant+2002] Winner does not take all [Pennock+2002]


Reference-based graph compression Well-motivated: pack graph into limited fast

memory for, e.g., query-time Web analysis Standard approach

Assign integer IDs to URLs, lexicographic order Delta or Gamma encoding of outlink IDs

If link-copying is rampant, should be able to compress Outlinks(u) by recording A reference node r Outlinks(r) Outlinks(u) …the “correction”

Finding r : what’s optimal? practical? [Adler, Mitzenmacher, Boldi, Vigna 2002—2004]


Reference-based compression, cont’d r is a candidate reference for u if

Outlinks(r)Outlinks(u) is “large enough” Given G, construct G’ in which… Directed edge from r to u with edge cost =

number of bits needed to write down Outlinks(u) Outlinks(r)

Dummy node z, z has no outlinks in G z connected to each u in G’ cost(z,u) = #bits to write Outlinks(u) w/o ref

Shortest path tree rooted at z In practice, pick “recent” r… 2.58 bits/link


Summary: Power laws are everywhere

Bible: rank vs. word frequency Length of file transfers [Bestavros+] Web hit counts [Huberman] Click-stream data [Montgomery+01] Lotka’s law of publication count (CiteSeer data)

log(rank)

log(freq)“a”

“the”

log(#citations)

log(

coun

t)

J. Ullman


Resources Generators

R-MAT [email protected] BRITE www.cs.bu.edu/brite/ INET topology.eecs.umich.edu/inet

Visualization tools Graphviz www.graphviz.org Pajek vlado.fmf.uni-lj.si/pub/networks/pajek

Kevin Bacon web sitewww.cs.virginia.edu/oracle

Erdös numbers etc.


R-MAT: Recursive MATrix generator Goals

Power-law in- and out-degrees

Power-law eigenvalues Small diameter (“six

degrees of separation”) Simple, few parameters

Approach Subdivide the adjacency

matrix Choose a quadrant with

probability (a,b,c,d)

2n

2n

a (0.5)

d (0.25)

c (0.15)

b (0.1)

From To


R-MAT algorithm, cont’d Subdivide the adjacency

matrix Choose a quadrant with

probability (a,b,c,d) Recurse till we reach a

11 cell

By construction Rich gets richer for in-

and out-degree Self-similar

(communities within communities)

Small diameter

2n2

n

a

c d

a

c d

b


Evaluation on clickstream data

Count vs Indegree Count vs Outdegree Hop-plot

Left “Network value” Right “Network value”

R-MAT matches it well

Singular value vs Rank


Topic structure of the Web Measure correlations between link proximity

and content similarity How to characterize “topics”? Started with http://dmoz.org Keep pruning until all leaf topics

have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train a text classifier Characterize new document d as a vector of

probabilities pd = (Pr(c|d) c)

Classifier

Topic ProbArts 0.1Computers 0.3Science 0.6

Test doc


Sampling the background topic distrib. What fraction of Web pages are about

/Health/Diabetes? How to sample the Web? Invoke the “random surfer” model

(Pagerank) Walk from node to node Sample trail adjusting for Pagerank

Modify Web graph to do better sampling Self loops Bidirectional edges


Convergence

Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions

L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]

Below .05 —.2 within 300—400 physical pages

Background distribution

0

0.1

0.2

0.3

0.4

Art

s

Bu

sin

ess

Co

mp

ute

rs

Ga

me

s

He

alth

Ho

me

Re

cre

atio

n

Re

fere

nce

Sci

en

ce

Sh

op

pin

g

So

cie

ty

Sp

ort

s00.20.40.60.8

1

0 500 1000#hops

Dis

trib

utio

n di

ffere

nce

Stride=30k

Stride=75k


Biases in topic directories Use Dmoz to train a

classifier Sample the Web Classify samples Diff Dmoz topic distribution

from Web sample topic distribution

Report maximum deviation in fractions

NOTE: Not exactly Dmoz

Dmoz over-representsGames.Video_GamesSociety.PeopleArts.Celebrities...Education.Colleges...Travel.ReservationsDmoz under-represents…WWW…Directories!Sports.HockeySociety.PhilosophyEducation…K12…Recreation…Camping


Topic-specific degree distribution Preferential attachment:

connect u to v w.p. proportional to the degree of v, regardless of topic

More realistic: u has a topic, and links to v with related topics

Unclear if power-law should be upheld

Intra-topiclinkage

Inter-topiclinkage


Random forward walk without jumps/Arts/Music

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

/Sports/Soccer

0.4

0.6

0.8

1

1.2

1.4

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

Sampling walk is designed to mix topics well How about walking forward without jumping?

Start from a page u0 on a specific topic

Forward random walk (u0, u1, …, ui, …)

Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution


Forward walks wander away fromstarting topic slowly

But do not converge to thebackground distribution

Global PageRank ok alsofor topic-specific queries Jump parameter d=.1—.2 Topic drift not too bad within

path length of 5—10 Prestige conferred mostly by

same-topic neighbors

Also explains why focused crawling works

Observations and implications

W.p. d jump toa random node

W.p. (1-d)jump to anout-neighboru.a.r.

High-prestige

node

Jump


Citation matrix Given a page is about topic i, how likely is it

to link to topic j? Matrix C[i,j] = probability that page about topic i

links to page about topic j Soft counting: C[i,j] += Pr(i|u)Pr(j|v)

Applications Classifying Web pages into topics Focused crawling for topic-specific pages Finding relations between topics in a directory

u v


Citation, confusion, correctionFrom topic

True topic From topic

To topic

Guessed topic

To topic

Art

sB

usin

ess

Com

put

ers

Gam

esH

ealth

Hom

eR

ecre

atio

nR

efe

renc

eS

cien

ceS

hopp

ing

Soc

iety

Spo

rts

Classifier’s confusion on held-out documents can be used to correct confusion matrix


Fine-grained views of citation

Clear block-structure derived from coarse-grain topics

Strong diagonals reflecttightly-knit topic communities

Prominent off-diagonalentries raise designissues for taxonomyeditors and maintainers


Outline, some more detail Part 1 (Modeling graphs)

What do real-life graphs look like? What laws govern their formation, evolution and

properties? What structural analyses are useful?

Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining


Centrality and prestige


How important is a node? Degree, min-max radius, … Pagerank Maximum entropy network flows HITS and stochastic variants Stability and susceptibility to spamming Hypergraphs and nonlinear systems Using other hypertext properties Applications: Ranking, crawling, clustering,

detecting obsolete pages


Importance/prestige as Pagerank A node is important if it is connected to

important nodes “Random surfer” walks along links for ever If current node has 3 outlinks, take each

with probability 1/3 Importance = steady-state

probability (ssp) of visit “Maxwell’s equation

for the Web”

u

vOutDegree(u)=3

Evu u

uPRvPR

),( )OutDegree()(

)(


(Simplified) Pagerank algorithm Let A be the node adjacency matrix Column normalize AT

Want vector p such that AT p = p I.e. p is the eigenvector corresponding to the

largest eigenvalue, which is 1

1 2 3

45

p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

=

To From

AT

1

1 1

1/2 1/2

1/2

1/2


Intuition

A as vector transformation

2 11 3

A

10

x

21

x’

= x

x’

2

1

1

3


Intuition

By defn., eigenvectors remain parallel to themselves (‘fixed points’)

2 11 3

A0.52

0.85

v1v1

=

0.52

0.853.62 *

1


Convergence

Usually, fast: depends on ratio

1 : 21

2


Eigenvalues and epidemic networks Will a virus spread across an arbitrary

network create an epidemic? Susceptible-Infected-Susceptible (SIS)

Was healthy, did not got infected Was infected, got cured without further attack Was infected, got cured immediately after attack Cured nodes immediately become susceptible

Susceptible/

healthy

Infected &

infectious

Infected by neighbor

Cured internally


The infection model (virus) Birth rate : probability than an

infected neighbor attacks (virus) Death rate : probability that an

infected node heals

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob.


Prob. that node i does NOT receive an infection from neighbors

Working parameters pi,t = prob. node i is infected at time t

Was healthy, did not got infected Was infected, got cured without further

attack Was infected, got attacks from neighbors

and yet cured itself in this time step … somewhat arbitrary

titip ,1,1

titip ,1,

Eji

tjEji

tjtjti ppp),(

1,),(

1,1,, 1)1()1(

titip ,1,21 1


Time recurrence for pi,t Assuming various probabilities are “suitably

small”

Recurrence of the probability of infection can be approximated by this linear form:

In other words, the (symmetric) transition matrix is

Eji tj

Ejitjti pp

),( 1,),(

1,, 11

Eji

tjtiti ppp),(

1,1,, 1

AIS 1


Epidemic threshold Virus ‘strength’ s = /

Epidemic threshold of a graph is defined as the value of , such that if strength s = / < then an epidemic can not happen

Problem: compute epidemic threshold

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ


Epidemic threshold What should depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree?


Analysis Eigenvectors of S and A are the same Eigenvalues are shifted and scaled

From spectral decomposition

A sufficient condition for infection dying down is that

AS ,, 1 ii

i

Tii

ni

nn uu SSSpSp ,,,)0()(

AA

AS

,1,1

,1,1

/oror

11


Epidemic threshold

An epidemic must die down if

β/δ <τ = 1/ λ1,A

largest eigenvalueof adj. matrix A

attack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]


Experiments

/ > τ (above threshold)

/ = τ (at the threshold)

/ < τ (below threshold)


Remarks “Primal” problem: topology design

Design a network resilient to infection

“Dual” problem: viral marketing Selectively convert important customers who

can influence many others

Will come back to this later


Back to the random surfer Practical issues with Pagerank [BrinP1997] PR converges only if E is aperiodic and

irreducible; make it so:

d is the (tuned) probability of “teleporting” to one of N pages uniformly at random (0.1—0.2 in practice)

(Possibly) unintended consequences: topic sensitivity, stability

Evu u

uPRd

Nd

vPR),( )OutDegree(

)()1()(


Prestige as network flow yij =#surfers clicking from i to j per unit time

Hits per unit time on page j is Flow is conserved at The total traffic is Normalize:

Can interpret pij as a probability

Standard Pagerank corresponds to one solution:

Many other solutions possible

Eji ijj yH),(

Ekj jkEji ij yyj),(),(

:

ji ijj j yHY

,Yyp ijij /

))OutDegree(( iYHp iij


Maximum entropy flow [Tomlin2003] Flow conservation modeled using feature

And the constraints Goal is to maximize

subject to Solution has form i is the “hotness” of page i

otherwise

),(,

),(,

0

1

1

)(:,,1 Ejrri

Erirj

xfNr ijr

ji ijrijijr xfpxfE

,)())((0

ji ijij pp,

log

ji ijp,

1

)exp( 0 jiijp


Maxent flow results

Two IBMintranetdata setswith knowntop URLs

Averagerank (106) of knowntop URLswhensorted byPagerank

i

Hi

(Smaller rank is better)

i ranking is better than Pagerank; Hi

ranking is worse

Depth up to whichdmoz.org URLs areused as ground truth

Average rank (108)


HITS [Kleinberg1997] Two kinds of prestige

Good hubs link to good authorities Good authorities are linked to by good hubs

In matrix notation, iterations amount to

with interleaved normalization of h and a Note that scores are copied not divided

EvuEvuvauhuhva

),(),()()();()(

EaEa

Eah

hEa

T

T

i.e.,


HITS graph acquisition Steps

Get root set via keyword query Expand by one move forward and backward Drop same-site links (why?)

Each node assigned both ahub and an auth score

Graph tends to be “topic-specific”

Whereas Pagerank is usually run on “the entire Web” (but need not be)

IN Root set OUT


HITS vs. Pagerank, more comments Dominant eigenvectors of different matrices HITS: Eigensystems of EET (h) and ETE (a) Pagerank: eigensystem of

where

HITS copies scores, Pagerank distributes HITS drops same-site links, Pagerank does

(can?) not Implications?

TLddJ )1(

otherwise

),(

0

)e(1/OutDegre),(and

/1/1

/1/1Ejii

jiL

NN

NN

J


HITS: Dyadic interpretation [CohnC2000] Graph includes many communities z

Query=“Jaguar” gets auto, game, animal links

Each URL is represented as two things A document d A citation c

Max

Guess number of aspects zs and use [Hofmann 1999] to estimate Pr(c|z)

These are the most authoritative URLs

Ecd z

EcdEcd

zcdzd

dcdcd

),(

),(),(

)|Pr()|Pr()Pr(

)|Pr()Pr(),Pr(


Dyadic results for “Machine learning”

Clustering based on citations + ranking within clusters


Spamming link-based ranking Recipe for spamming HITS

Create a hub linking to genuine authorities Then mix in links to your customers’ sites Highly susceptible to adversarial behavior

Recipe for spamming Pagerank Buy a bunch of domains, cloak IP addresses Host a site at each domain Sprinkle a few links at random per page to other

sites you own Takes more work than spamming HITS


Stability of link analysis [NgZJ2001] Compute HITS authority

scores and Pagerank Delete 30% of nodes/links at

random Recompute and compare

ranks; repeat Pagerank ranks more stable

than HITS authority ranks Why? How to design more stable

algorithms?

1 3 1 1 12 5 3 3 23 12 6 6 34 52 20 23 45 171 119 99 56 135 56 40 8

10 179 159 100 78 316 141 170 6

1 1 1 1 12 2 2 2 23 5 6 4 54 3 5 5 45 6 3 6 36 4 4 3 67 7 7 7 78 8 8 8 9

HIT

S A

utho

rity

Pag

eran

k


Stability depends on graph and params Auth score is eigenvector for ETE = S, say Let 1 > 2 be the first two eigenvalues

There exists an S’ such that S and S’ are close ||S–S’||F = O(1 –2)

But ||u1 – u’1||2 = (1)

Pagerank p is eigenvector of U is a matrix full of 1/N and is the jump prob If set C of nodes are changed in any way, the

new Pagerank vector p’ satisfies

TEU ))1((

/2'2

Cu uppp


Randomized HITS Each half-step, with

probability , teleport to a node chosen uniformly at random

Much more stable than HITS Results more meaningful

near 1 will always stabilize Here was 0.2

1 3 3 2 14 1 1 1 22 2 2 3 43 4 4 4 35 6 6 6 56 5 5 5 67 7 7 7 78 8 8 8 8

1 1 1 1 23 2 2 2 12 3 3 3 34 4 4 4 45 6 7 5 56 7 6 6 67 5 5 7 78 9 9 9 11

)(col

)1(

)(row

)1(

)1(1

)1(1tt

tTt

aEh

hEa

Pag

eran

kR

ando

miz

ed H

ITS


1/3

1/3

1/3

Another random walk variation of HITS SALSA: Stochastic HITS [Lempel+2000] Two separate random walks

From authority to authority via hub From hub to hub via authority

Transition probability Pr(aiaj) =

If transition graph is irreducible, For disconnected components, depends on

relative size of bipartite cores Avoids dominance of larger cores

1/2

1/2

a1

a2

)OutDegree(1

)InDegree(1

),(),,(: haEahahhi

ji

)InDegree(aa


SALSA sample result (“movies”)

HITS: The Tightly-Knit Community (TKC) effect

SALSA: Less TKC influence (but no reinforcement!)


Links in relational data [GibsonKR1998] (Attribute, value) pair is a node

Each node v has weight wv

Each tuple is a hyperedge Tuple r has weight xr

HITS-like iterations to update weight wv

For each tuple

Update weight

Combining operator can be sum, max, product, Lp avg, etc.

),,(1 kuur wwx

),,,( 1 kuuvr

r rv xw


Distilling links in relational data

Author Author Forum Year

Dat

abas

eT

heor

y


Searching and annotating graph data


Searching graph data Nodes in graph contain text

RandomIntelligent surfer [RichardsonD2001] Topic-sensitive Pagerank [Haveliwala2002] Assigning image captions using random walks

[PanYFD2004] Rotting pages and links [BarYossefBKT2004]

Query is a set of keywords All keywords may not match a single node Implicit joins [Hulgeri+2001, Agrawal+2002] Or rank aggregation [Balmin+2004] required


Intelligent Web surfer

Eji qqqq jiijj),(

)(Pr)(Pr)(Pr')1()(Pr

Keyword Probabilityof teleportingto node j

Probability of walkingfrom i to j wrt q

Vk q

qq kR

jRj

)(

)()(Pr'

Relevanceof node k wrt q

Eki q

qq kR

jRji

),()(

)()(Pr'

Pick out-link to walk on inproportion to relevance oftarget out-neighbor

Qq qQ jqj )(Pr)Pr()(PR

Query=setof words

Pick a query word per some distribution, e.g. IDF


Implementing the intelligent surfer

PRQ(j) approximates a walk that picks a query keyword using Pr(q) at every step

Precompute and store Prq(j) for each keyword q in lexicon: space blowup = avg doc length

Query-dependent PR rated better by volunteers


Topic-sensitive Pagerank High overhead for per-word Pagerank Instead, compute Pageranks for some

collection of broad topics PRc(j) Topic c has sample page set Sc

Walk as in Pagerank Jump to a node in Sc uniformly at random

“Project” query onto set of topics

Rank responses by projection-weighted Pageranks

QqcqcQc )|Pr()Pr()|Pr(

c c jQcjQ )(PR)|Pr(),Score(


Topic-sensitive Pagerank results

Users prefer topic-sensitive Pagerank on most queries to global Pagerank + keyword filter


Image captioning Segment images

into regions Image has

caption words Three-layer

graph: image, regions, caption words

Threshold on region similarity to connect regions (dotted)


Random walks with restarts

Find regions in test image Connect regions to other nodes in the region layer

using region similarity Random walk, restarting at test image node Pick words with largest visit probability

Regions

Images

Words

Testimage


More random walks: “Link rot” How stale is a page?

Last-mod unreliable Automatic dead-link cleaners mask disuse

A page is “completely stale” if it is “dead” Let D be the set of pages which cannot be

accessed (404 and other problems)

How stale is a page u? Start with p u If pD declare decay value of u to be 1, else With probability declare decay value of u = 0 W.p. 1– choose outlink v, set pv, loop


Page staleness results

Decay score is correlated with, but generally larger than the fraction of dead outlinks on a page

Removing direct dead links automatically does not eliminate live but “rotting” pages

404s

Decay


Graph proximity search: two paradigms A single node as query response

Find node that matches query terms… …or is “near” nodes matching query terms

[Goldman+ 1998]

A connected subgraph as query response Single node may not match all keywords No natural “page boundary” [Bhalotia+2002]

[Agrawal+2002]


Single-node response examples Travolta, Cage

Actor, Face/Off

Travolta, Cage, Movie Face/Off

Kleiser, Movie Gathering, Grease

Kleiser, Woo, Actor Travolta

Travolta Cage

Face/OffGrease

“acted-in”

Movie“is-a”

Actor

“is-a”

Kleiser

Director

“is-a”

Gathering

“dir

ect

ed

”

Woo

A3


Basic search strategy Node subset A activated because they

match query keyword(s) Look for node near nodes that are activated Goodness of response node depends

Directly on degree of activation Inversely on distance from activated node(s)


Proximity query: screenshot

http://www.cse.iitb.ac.in/banks/


Ranking a single node response Activated node set A Rank node r in “response set” R based on

proximity to nodes a in A Nodes have relevance R and A in [0,1] Edge costs are “specified by the system”

d(a,r) = cost of shortest path from a to r Bond between a and r

Parameter t tunes relative emphasis on distance and relevance score

Several ad-hoc choices

tRA

rad

rarab

),(

)()(),(


Scoring single response nodes Additive

Belief

Goal: list a limited number of find nodes with the largest scores

Performance issues Assume the graph is in memory? Precompute all-pairs shortest path (|V |3)? Prune unpromising candidates?

Aarabr ),()score(

Aarabr ),(11)score(


Hub indexing Decompose APSP problem using sparse

vertex cuts |A|+|B | shortest paths to p |A|+|B | shortest paths to q d(p,q)

To find d(a,b) compare d(apb) not through q d(aqb) not through p d(apqb) d(aqpb)

Greatest savings when |A||B| Heuristics to find cuts, e.g. large-degree nodes

A B

a bp

q


ObjectRank [Balmin+2004] Given a data graph with nodes having text For each keyword precompute a keyword-

sensitive Pagerank [RichardsonD2001] Score of a node for multiple keyword search

based on fuzzy AND/OR Approximation to Pagerank of node with restarts

to nodes matching keywords

Use Fagin-merge [Fagin2002] to get best nodes in data graph


Connected subgraph as response Single node may not match all keywords

No natural “page boundary” On-the-fly joins make up a “response page”

Two scenarios Keyword search on relational data

• Keywords spread among normalized relations

Keyword search on XML-like or Web data• Keywords spread among DOM nodes and subtrees


Keyword search on relational data Tuple = node Some columns have text Foreign key constraints =

edges in schema graph Query = set of terms No natural notion

of a document Normalization Join may be needed

to generate results Cycles may exist in schema

graph: ‘Cites’

Cites

CitingCited

Paper

PaperIDPaperName

Writes

AuthorIDPaperID

Author

AuthorIDAuthorName

PaperID PaperNameP1 DBXplorerP2 BANKS

AuthorID AuthorNameA1 ChaudhuriA2 SudarshanA3 Hulgeri

Citing CitedP2 P1

AuthorID PaperIDA1 P1A2 P2A3 P2


DBXplorer and DISCOVER Enumerate subsets of relations in schema graph

which, when joined, may contain rows which have all keywords in the query “Join trees” derived from schema graph

Output SQL query for each join tree Generate joins, checking rows for matches

[Agrawal+2001], [Hristidis+2002]

T1 T2 T3

T4

T5

K1,K2,K3

K2

K3

T2

T2 T3

T4 T2 T3

T5

T2 T3

T4

T5


Discussion Exploits relational

schema information to contain search

Pushes final extraction of joined tuples into RDBMS

Faster than dealing with full data graph directly

Coarse-grained ranking based on schema tree

Does not model proximity or (dis) similarity of individual tuples

No recipe for data with less regular (e.g. XML) or ill-defined schema


Motivation from Web search “Linux modem driver

for a Thinkpad A22p” Hyperlink path matches

query collectively Conjunction query

would fail

Projects where X and P work together Conjunction may

retrieve wrong page

General notion of graph proximity

IBM Thinkpads•A20m•A22p Thinkpad

Drivers•Windows XP•Linux

Home Page ofProfessor XPapers•VLDB…Students•P•Q

P’s home pageI work on the Bproject.

The B SystemGroup members•P•S•X

DownloadInstallation tips•Modem•Ethernet


Data structures for search Answer = tree with at least one leaf

containing each keyword in query Group Steiner tree problem, NP-hard

Query term t found in source nodes St

Single-source-shortest-path SSSP iterator Initialize with a source (near-) node Consider edges backwards getNext() returns next nearest node

For each iterator, each visited node v maintains for each t a set v.Rt of nodes in St which have reached v


Generic expanding search Near node sets St with S = t St

For all source nodes S create a SSSP iterator with source

While more results required Get next iterator and its next-nearest node v Let t be the term for the iterator’s source s crossProduct = {s} t ’tv.Rt’

For each tuple of nodes in crossProduct• Create an answer tree rooted at v with paths to each

source node in the tuple

Add s to v.Rt


Search example (“Vu Kleinberg”)

author writes cites

Quoc Vu Jon Kleinberg

paper

writeswrites

writes

A metriclabeling problem

Authoritative sources in ahyperlinked environment

Organizing Web pagesby “Information Unit”

Divyakant Agrawal

Eva Tardos

writes

writes

cites

cites

cites


First response

author writes cites

Quoc Vu Jon Kleinberg

paper

writeswrites

writes

A metriclabeling problem

Authoritative sources in ahyperlinked environment

Organizing Web pagesby “Information Unit”

Divyakant Agrawal

Eva Tardos

writes

writes

cites

cites

cites


Subgraph search: screenshot

http://www.cse.iitb.ac.in/banks/


Similarity, neighborhood, influence


Why are two nodes similar? What is/are the best paths connecting two nodes

explaining why/how they are related? Graph of co-starring, citation, telephone call, …

Graph with nodes s and t; budget of b nodes Find “best” b nodes capturing relationship between

s and t [FaloutsosMT2004]: Proposing a definition of goodness How to efficiently select best connections

NegropontePalmisano

Esther Dyson Gerstner


Simple proposals that do not work Shortest path

Pizza boy p gets sameattention as g

Network flow sabt is as good as sgt

Voltage Connect +1V at s, ground t Both g and p will be at +0.5V

Observations Must reward parallel paths Must reward short paths Must penalize/tax pizza boys

a b

s g t

p


Resistive network with universal sink Connect +1V to s Ground t Introduce universal sink

Grounded Connected to every node

Universal sink is a “tax collector” Penalizes pizza boys Penalizes long paths

Goodness of a path is the electric current it carries

a b

s g t

p

Connectedto every node


Resistive network algorithm Ohm’s law: Kirchhoff’s current law: Boundary conditions (without sink): Solution:

Here C(u,v) is the conductance from u to v Add grounded universal sink z with V(z)=0 Set Display subgraph carrying high current

vuvVuVvuCvuI ,)]()()[,(),(

u

vuItsv 0),(:,

0)(,1)( tVsVtsu

wuC

vuCvVuV

w

v ,for,),(

),()()(

zwwuCzuCu ),(),(:


Distributions influenced via graphs Directed or undirected graph; nodes have

Observable properties Some unobservable (random) state

Edges indicate that distributions over unobservable states are coupled

Many applications Hypertext classification (topics are clustered) Social network of customers buying products Hierarchical classification Labeling categorical sequences: pos/ne

tagging, sense disambiguation, linkage analysis


Basic techniques Directed (acyclic) graphs: Bayesian network Markov networks

(Loopy) belief propagation

Conditional Markov networks Avoid modeling joint distribution over

observable and hidden properties of nodes

Some computationally simple special cases


Hypertext classification Want to assign labels to Web pages Text on a single page may be too little or

misleading Page is not an isolated instance by itself Problem setup

Web graph G=(V,E) Node uV is a page having text uT

Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels

Probabilistic model? Benefits?


Luckily we don’t need to worry about this

Seek a labeling f of all unlabeled nodes so as to maximize

Let VK be the nodes with known labels and f(VK) their known label assignments

Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels

Markov assumption: f(v) is conditionally independent of rest of G given f(N(v))

)|}:{,Pr()Pr(

))(|}:{,Pr())(Pr(}):{,|)(Pr(

VuuE

VfVuuEVfVuuEVf

T

TT

Graph labeling model


Markov assumption: label of v does not depend onunknown labels outside the set NU(v)

Sum over all possible labelingsof unknown neighbors of v…given edges, text,

parital labeling

Probability of labelingspecific node v…

Markov graph labeling

Circularity between f(v) and f(NU(v)) Some form of iterative Gibb’s sampling or MCMC

)(,,)),((|)(Pr)(,,|))((Pr

)(,,|))((),(Pr))(,,|)(Pr(

))((

))((

KTU

vNf

KTU

vNf

KTUKT

VfVEvNfvfVfVEvNf

VfVEvNfvfVfVEvf

vU

vU


Take the expectation of thisterm over NU(v) labelings

Joint distribution over neighborsapproximated as product of marginals

Label estimates of v in the next iteration

Iterative labeling

)(,,)),((|)(Pr)(,,|)(Pr

)(,,|)(Pr

))(()( )(

)1(

KTU

vNfvNw

KTr

KTr

VfVEvNfvfVfVEwf

VfVEvf

vU

U

Sum over all possible NU(v) labelings still too expensive to compute

In practice, prune to most likely configurations Let us look at the last term more carefully


A generative node model

TKTU vvNfvfVfEVvNfvf )),((|)(Pr)(,,)),((|)(Pr

Can use Bayes classifier as with ordinary text: estimate a parametric model for

the class-conditional joint distribution between the text on the page v and the labels of neighbors of v

Must make naïve Bayes assumptions to keep practical

By the Markov assumption, finally we need a distributioncoupling f(v) and vT (the text on v) and f(N(v))

)(|)),((Pr vfvvNf T


Pictorially… c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c] (hurts)

Better model:Pr[t, c(N) | c]

Estimate histograms and update based on neighbors’ histograms

?


Generative model: results 9600 patents from 12

classes marked by USPTO

Patents have text and cite other patents

Expand test patent to include neighborhood

‘Forget’ fraction of neighbors’ classes

0

10

20

30

40

0 50 100

%Neighborhood known

%E

rro

rText Link Text+Link


Detour: generative vs. discriminative x = feature vector, y = label {0,1} say Generative method models Pr(x|y) or Pr(x,y)

“the generation of data x given label y” Use Bayes rule to get Pr(y|x) Inaccurate; x may have many dimensions

Discriminative: directly estimates Pr(y|x) Simple linear case: want w s.t.

w.x > 0 if y=1 and w.x 0 if y=0 Cannot differentiate for w; instead pick some

smooth “loss function” Works very well in practice xw

x

exp1

1)|Pr(y


Discriminative node model OA(X) = direct “own” attributes of node X LD(X) = link-derived attributes of node X

Mode-link: most frequent label of neighbors(X) Count-link: histogram of neighbor labels Binary-link: 0/1 histogram of neighbor labels

Iterate as in generative case

)1)OA(exp(1))OA(,|Pr( XcwXwc Too

Local model params

)1)LD(exp(1))LD(,|Pr( XcwXwc Tll

Neighborhood model params

))LD(|Pr())OA(|Pr(maxarg)(ˆ XcXcXC c


Discriminative model: results [Li+2003]

Binary-link and count-link outperform content-only at 95% confidence

Better to separately estimate wl and wo

In+Out+Cocitation better than any subset for LD


Undirected Markov networks Clique cC(G) a set of

completely connected nodes

Clique potential c(Vc) a function over all possible configurations of nodes in Vc

Decompose Pr(v) as (1/Z)cC(G)c(Vc)

Parametric formLabel variable

Local feature variable

Instance

Label coupling

)(exp)( ccccc f vwv Params of model Feature functions


Conditional and relational networks x = vector of observed features at all nodes y = vector of labels

A set of “clique templates”specifying links to use

Other features: “in the sameHTML section”

)(

),()(

1)|Pr(

GCccccZ

yxx

xy

' )(

' ),()( wherey

yxxGCc

cccZ

),(exp),( ccccccc F yxwyx


“Toy problem”: Hierarchical classification Obvious approaches

Flatten to leaf topics, losing hierarchy info Level-by-level, compounding error probability

Cascaded generative model

Pr(c|d,r) estimated as Pr(c|r)Pr(d|c)/Z(r) Estimate of Pr(d|c) makes naïve independence

assumptions if d has high dimensionality Pr(c|d,r) tends to 0/1 for large dimensions and Mistake made at shallow levels become

irrevocable

),|Pr()|Pr()|Pr( rdcdrdc r

c


Global discriminative model Each node has an associated bit X Propose a parametric form

Each training instance sets one path to 1, all other nodes have X=0

)),(exp(1

)),(exp(),|1Pr(

rc

rcrc xdFw

xdFwxdX

dT

2T+1

xr

xr=0 xr=1F(d,xr)

wc

SVM 1v1 CtreeReuters 1% 41.9 47.3Reuters 5% 68 71News20 1% 17.2 21.8

%Accuracy


Network value of customers Customer = node X in graph, neighbors N M is a marketing action (promotion, coupon) Want to predict

Broader objective is to design action M Again, we approximate as

),,|Pr( MYXKiX

Response of customer i

Known response ofother customers

Product attributes

Aggregatemarketing action

),,|Pr(),,|Pr()(

MYXMYN K

NC

Uiii

Ui

NXSum over unknown neighbor configurations


Network value, continued Let the action be boolean

c is the cost of marketing r0 is the revenue without marketing, r1 with

Expected lift in profit by marketing to customer i in isolation

Global effect

cfXr

fXrELP

iK

i

iK

iK

i

))(,,|1Pr(

))(,,|1Pr(,,0

0

11

MYX

MYXMYX


Special case: sequential networks Text modeled as sequence of tokens drawn

from a large but finite vocabulary Each token has attributes

Visible: allCaps, noCaps, hasXx, allDigits, hasDigit, isAbbrev, (part-of-speech, wnSense)

Not visible: part-of-speech, (isPersonName, isOrgName, isLocation, isDateTime), {starts|continues|ends}-noun-phrase

Visible (symbols) and invisible (states) attributes of nearby tokens are dependent

Application decides what is (not) visible Goal: Estimate invisible attributes


Hidden Markov model A generative sequential model for the joint

distribution of states (s) and symbols (o)St-1 St

Ot

St+1

Ot+1Ot-1

...

...

nn oooossss ,...,,..., 2110

||

11 )Pr()Pr(),Pr(

o

ttttt osssos


Using redundant token features Each o is usually a vector of features

extracted from a token Might have high dependence/redundancy:

hasCap, hasDigit, isNoun, isPreposition Parametric model for Pr(stot) needs to

make naïve assumptions to be practical Overall joint model Pr(s,o) can be very

inaccurate (Same argument as in naïve Bayes vs. SVM

or maximum entropy text classifiers)


Discriminative graphical model Assume one-stage Markov dependence Propose direct parametric form for

conditional probability of state sequence given symbol sequence

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

k ttkko tossft ),,,(exp)( 1

||

11 )|Pr()|Pr(

)Pr(

1)|Pr(

o

ttttt soss

oos

||

111 ),,,(),(

)(

1 o

tttotts tossss

oZ

Model

Log-linear form

Parameters to fitFeature function; mightdepend on whole o


Feature functions and parameters

Find L/k for each k and perform a gradient-based numerical optimization

Efficient for linear state dependence structure

k

k

Dos

o

t kttkk tossf

oZL

2

2

,

||

11 2

),,,(exp)(

1log

Maximize total conditional likelihood over all instances

Penalizelarge params


Conditional vs. joint: results

The asbestos fiber , crocidolite, is unusually resilient once

it enters the lungs , with even brief exposures to it causing

symptoms that show up decades later , researchers said .

DT NN NN , NN , VBZ RB JJ IN

PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG

NNS WDT VBP RP NNS JJ , NNS VBD .

Penn Treebank: 45 tags, 1M words training data

Algorithm/Features %Error %OOVE* HMM with words 5.69 45.99CRF with words 5.55 48.05CRF with words and orthography 4.27 23.76

Out

-of-

voca

bula

ry e

rror

Orthography: Use words, plus overlapping features: isCap, startsWithDigit, hasHyphen, endsWith… -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies


Summary Graphs provide a powerful way to model

many kinds of data, at multiple levels Web pages, XML, relational data, images… Words, senses, phrases, parse trees…

A few broad paradigms for analysis Factors affecting graph evolution over time Eigen analysis, conductance, random walks Coupled distributions between node attributes

and graph neighborhood

Several new classes of model estimation and inferencing algorithms


References [BrinP1998] The Anatomy of a Large-Scale

Hypertextual Web Search Engine, WWW. [GoldmanSVG1998] Proximity search in databases.

VLDB, 26—37. [ChakrabartiDI1998] Enhanced hypertext

categorization using hyperlinks. SIGMOD. [BikelSW1999] An Algorithm that Learns What’s in a

Name. Machine Learning Journal. [GibsonKR1999]

Clustering categorical data: An approach based on dynamical systems. VLDB.

[Kleinberg1999] Authoritative sources in a hyperlinked environment. JACM 46.

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm



http://www.cs.cornell.edu/home/kleinber/vldb98.ps

http://www.cs.cornell.edu/home/kleinber/vldb98.ps

http://www.cs.cornell.edu/home/kleinber/auth.ps




References [CohnC2000] Probabilistically Identifying

Authoritative Documents, ICML. [LempelM2000] The stochastic approach for link-

structure analysis (SALSA) and the TKC effect. Computer Networks 33 (1-6): 387-401

[RichardsonD2001] The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. NIPS 14 (1441-1448).

[LaffertyMP2001] Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML.

[BorkarDS2001] Automatic text segmentation for extracting structured records. SIGMOD.


References [NgZJ2001] Stable algorithms for link analysis.

SIGIR. [Hulgeri+2001] Keyword Search in Databases.

IEEE Data Engineering Bulletin 24(3): 22-32. [Hristidis+2002] DISCOVER: Keyword Search in

Relational Databases. VLDB. [Agrawal+2002] DBXplorer: A system for keyword-

based search over relational databases. ICDE. [TaskarAK2002] Discriminative probabilistic

models for relational data. [Fagin2002] Combining fuzzy information: an

overview. SIGMOD Record 31(2), 109–118.


References [Chakrabarti2002] Mining the Web: Discovering

Knowledge from Hypertext Data [Tomlin2003] A New Paradigm for Ranking Pages

on the World Wide Web. WWW. [Haveliwala2003] Topic-Sensitive Pagerank: A

Context-Sensitive Ranking Algorithm for Web Search. IEEE TKDE.

[LuG2003] Link-based Classification. ICML. [FaloutsosMT2004] Connection Subgraphs in

Social Networks. SIAM-DM workshop. [PanYFD2004] GCap: Graph-based Automatic

Image Captioning. MDDE/CVPR.


References [Balmin+2004] Authority-Based Keyword Queries

in Databases using ObjectRank. VLDB. [BarYossefBKT2004] Sic transit gloria telae:

Towards an understanding of the Web’s decay. WWW2004.

Download - ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

Top Related