Transcript
Page 1: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 1

Using Graphs in Unstructuredand Semistructured Data Mining

Soumen Chakrabarti

IIT Bombay

www.cse.iitb.ac.in/~soumen

Page 2: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 2

Acknowledgments C. Faloutsos, CMU W. Cohen, CMU IBM Almaden (many colleagues) IIT Bombay (many students) S. Sarawagi, IIT Bombay S. Sudarshan, IIT Bombay

Page 3: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 3

Graphs are everywhere Phone network, Internet, Web Databases, XML, email, blogs Web of trust (epinion) Text and language artifacts (WordNet) Commodity distribution networks

Internet Map [lumeta.com]

Food Web [Martinez1991]

Protein Interactions [genomebiology.com]

Page 4: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 4

Why analyze graphs? What properties do real-life graphs have? How important is a node? What is

importance? Who is the best customer to target in a

social network? Who spread a raging rumor? How similar are two nodes? How do nodes influence each other? Can I predict some property of a node

based on its neighborhood?

Page 5: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 5

Outline, some more detail Part 1 (Modeling graphs)

What do real-life graphs look like? What laws govern their formation, evolution and

properties? What structural analyses are useful?

Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining

Page 6: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 6

Modeling and generatingrealistic graphs

Page 7: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 7

Questions What do real graphs look like?

Edges, communities, clustering effects

What properties of nodes, edges are important to model? Degree, paths, cycles, …

What local and global properties are important to measure?

How to artificially generate realistic graphs?

Page 8: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 8

Modeling: why care? Algorithm design

Can skewed degree distribution make our algorithm faster?

Extrapolation How well will Pagerank work on the Web 10

years from now?

Sampling Make sure scaled-down algorithm shows same

performance/behavior on large-scale data

Deviation detection Is this page trying to spam the search engine?

Page 9: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 9

Laws – degree distributions Q: avg degree is ~10 - what is the most

probable degree?

degree

count ??

10

Page 10: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 10

Laws – degree distributions Q: avg degree is ~10 - what is the most

probable degree?

degreedegree

count ??

10

count

10

Page 11: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 11

Power-law: outdegree O

The plot is linear in log-log scale [FFF’99]freq = degree (-2.15)

O = -2.15

Exponent = slope

Outdegree

Frequency

Nov’97

-2.15

Page 12: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 12

Power-law: rank R

The plot is a line in log-log scale

Exponent = slope

R = -0.74

R

outdegree

Rank: nodes in decreasing outdegree order

Dec’98

Page 13: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 13

Eigenvalues Let A be the adjacency matrix of graph The eigenvalue satisfies

A v = v, where v is some vector

Eigenvalues are strongly related to graph topology

AB C

D

A B C D

A 1

B 1 1 1

C 1

D 1

Page 14: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 14

Power-law: eigenvalues of E Eigenvalues in decreasing order

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

Dec’98

Page 15: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 15

The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within

1,2,… h hops? Potential answer:

1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h

Page 16: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 16

The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within

1,2,… h hops? Potential answer:

1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h WE HAVE DUPLICATES!

Page 17: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 17

The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within

1,2,… h hops? Potential answer:

1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h ‘avg’ degree: meaningless!

Page 18: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 18

Power-law: hop-plot H

Pairs of nodes as a function of hops N(h)= hH

H = 4.86

Dec 98Hops

# of Pairs # of Pairs

Hops Router level ’95

H = 2.83

Page 19: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 19

Observation Q: Intuition behind ‘hop exponent’? A: ‘intrinsic=fractal dimensionality’ of the

network

N(h) ~ h1 N(h) ~ h2

...

Page 20: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 20

Any other ‘laws’? The Web looks like a “bow-tie”

[Kumar+1999] IN, SCC, OUT, ‘tendrils’ Disconnected components

Page 21: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 21

Generators How to generate graphs from a realistic

distribution? Difficulty: simultaneously preserving many

local and global properties seen in realistic graphs Erdos-Renyi: switch on each edge

independently with some probability Problem: degree distribution not power-law

Degree-based Process-based (“preferential attachment”)

Page 22: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 22

Degree-based generator Fix the degree distribution (e.g., ‘Zipf’) Assign degrees to nodes

Add matching edges to satisfy degrees No direct control over

other properties “ACL model”

[AielloCL2000]

Page 23: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 23

Process-based: Preferential attachment Start with a clique with m nodes Add one node v at every time step v makes m links to old nodes Suppose old node u has degree d(u) Let pu = d(u)/ wd(w)

v invokes a multinomial distribution defined by the set of p’s

And links to whichever u’s show up At time t, there are m+t nodes, mt links What is the degree distribution?

Page 24: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 24

Preferential attachment: analysis ki(t) = degree of node i at time t

Discrete random variable Approximate as continuous random variable Let i(t) = E(ki(t)), expectation over random

linking choices At time t, the infinitesimal expected growth

rate of i(t) is, by linearity of expectation,

t

t

mt

tm

t

t iii

2

)(

2

)()(

m degrees to add Total degree at t

ii

t

tm t) (

Time at which node i was born

Page 25: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 25

Preferential attachment, continued Expected degree of each node grows as

square-root of age:

Let the current time be t A node must be old enough for its degree to

be large; for i(t) > k, we need

Therefore, the fraction of nodes with degree larger than k is

Pr(degree = k) const/k3 (data closer to 2)

iit

t m t) (

2 2k t m ti

2

2

2

2

) (k

m

k t m

t mt

Page 26: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 26

Bipartite cores Basic preferential attachment does not

Explain dense/complete bipartite cores (100,000’s in a O(20 million)-page crawl)

Account for influence of search engines

The story isn’t over yet

2:3 core(n:m core)

log m

Nu

mb

er

of c

ore

s

n=2

n=7

Page 27: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 27

Other process-based generators “Copying model” [KumarRRTU2000]

New node v picks old reference node r u.a.r. v adds k new links to old nodes; for ith link: W.p. add a link to an old node picked u.a.r. W.p. 1– copy the ith link from r

More difficult to analyze Reference node compression

techniques! H.O.T.: connect to closest, high-connectivity

neighbor [Fabrikant+2002] Winner does not take all [Pennock+2002]

Page 28: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 28

Reference-based graph compression Well-motivated: pack graph into limited fast

memory for, e.g., query-time Web analysis Standard approach

Assign integer IDs to URLs, lexicographic order Delta or Gamma encoding of outlink IDs

If link-copying is rampant, should be able to compress Outlinks(u) by recording A reference node r Outlinks(r) Outlinks(u) …the “correction”

Finding r : what’s optimal? practical? [Adler, Mitzenmacher, Boldi, Vigna 2002—2004]

Page 29: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 29

Reference-based compression, cont’d r is a candidate reference for u if

Outlinks(r)Outlinks(u) is “large enough” Given G, construct G’ in which… Directed edge from r to u with edge cost =

number of bits needed to write down Outlinks(u) Outlinks(r)

Dummy node z, z has no outlinks in G z connected to each u in G’ cost(z,u) = #bits to write Outlinks(u) w/o ref

Shortest path tree rooted at z In practice, pick “recent” r… 2.58 bits/link

Page 30: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 30

Summary: Power laws are everywhere

Bible: rank vs. word frequency Length of file transfers [Bestavros+] Web hit counts [Huberman] Click-stream data [Montgomery+01] Lotka’s law of publication count (CiteSeer data)

log(rank)

log(freq)“a”

“the”

log(#citations)

log(

coun

t)

J. Ullman

Page 31: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 31

Resources Generators

R-MAT [email protected] BRITE www.cs.bu.edu/brite/ INET topology.eecs.umich.edu/inet

Visualization tools Graphviz www.graphviz.org Pajek vlado.fmf.uni-lj.si/pub/networks/pajek

Kevin Bacon web sitewww.cs.virginia.edu/oracle

Erdös numbers etc.

Page 32: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 32

R-MAT: Recursive MATrix generator Goals

Power-law in- and out-degrees

Power-law eigenvalues Small diameter (“six

degrees of separation”) Simple, few parameters

Approach Subdivide the adjacency

matrix Choose a quadrant with

probability (a,b,c,d)

2n

2n

a (0.5)

d (0.25)

c (0.15)

b (0.1)

From To

Page 33: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 33

R-MAT algorithm, cont’d Subdivide the adjacency

matrix Choose a quadrant with

probability (a,b,c,d) Recurse till we reach a

11 cell

By construction Rich gets richer for in-

and out-degree Self-similar

(communities within communities)

Small diameter

2n2

n

a

c d

a

c d

b

Page 34: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 34

Evaluation on clickstream data

Count vs Indegree Count vs Outdegree Hop-plot

Left “Network value” Right “Network value”

R-MAT matches it well

Singular value vs Rank

Page 35: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 35

Topic structure of the Web Measure correlations between link proximity

and content similarity How to characterize “topics”? Started with http://dmoz.org Keep pruning until all leaf topics

have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train a text classifier Characterize new document d as a vector of

probabilities pd = (Pr(c|d) c)

Classifier

Topic ProbArts 0.1Computers 0.3Science 0.6

Test doc

Page 36: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 36

Sampling the background topic distrib. What fraction of Web pages are about

/Health/Diabetes? How to sample the Web? Invoke the “random surfer” model

(Pagerank) Walk from node to node Sample trail adjusting for Pagerank

Modify Web graph to do better sampling Self loops Bidirectional edges

Page 37: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 37

Convergence

Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions

L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]

Below .05 —.2 within 300—400 physical pages

Background distribution

0

0.1

0.2

0.3

0.4

Art

s

Bu

sin

ess

Co

mp

ute

rs

Ga

me

s

He

alth

Ho

me

Re

cre

atio

n

Re

fere

nce

Sci

en

ce

Sh

op

pin

g

So

cie

ty

Sp

ort

s00.20.40.60.8

1

0 500 1000#hops

Dis

trib

utio

n di

ffere

nce

Stride=30k

Stride=75k

Page 38: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 38

Biases in topic directories Use Dmoz to train a

classifier Sample the Web Classify samples Diff Dmoz topic distribution

from Web sample topic distribution

Report maximum deviation in fractions

NOTE: Not exactly Dmoz

Dmoz over-representsGames.Video_GamesSociety.PeopleArts.Celebrities...Education.Colleges...Travel.ReservationsDmoz under-represents…WWW…Directories!Sports.HockeySociety.PhilosophyEducation…K12…Recreation…Camping

Page 39: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 39

Topic-specific degree distribution Preferential attachment:

connect u to v w.p. proportional to the degree of v, regardless of topic

More realistic: u has a topic, and links to v with related topics

Unclear if power-law should be upheld

Intra-topiclinkage

Inter-topiclinkage

Page 40: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 40

Random forward walk without jumps/Arts/Music

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

/Sports/Soccer

0.4

0.6

0.8

1

1.2

1.4

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

Sampling walk is designed to mix topics well How about walking forward without jumping?

Start from a page u0 on a specific topic

Forward random walk (u0, u1, …, ui, …)

Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution

Page 41: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 41

Forward walks wander away fromstarting topic slowly

But do not converge to thebackground distribution

Global PageRank ok alsofor topic-specific queries Jump parameter d=.1—.2 Topic drift not too bad within

path length of 5—10 Prestige conferred mostly by

same-topic neighbors

Also explains why focused crawling works

Observations and implications

W.p. d jump toa random node

W.p. (1-d)jump to anout-neighboru.a.r.

High-prestige

node

Jump

Page 42: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 42

Citation matrix Given a page is about topic i, how likely is it

to link to topic j? Matrix C[i,j] = probability that page about topic i

links to page about topic j Soft counting: C[i,j] += Pr(i|u)Pr(j|v)

Applications Classifying Web pages into topics Focused crawling for topic-specific pages Finding relations between topics in a directory

u v

Page 43: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 43

Citation, confusion, correctionFrom topic

True topic From topic

To topic

Guessed topic

To topic

Art

sB

usin

ess

Com

put

ers

Gam

esH

ealth

Hom

eR

ecre

atio

nR

efe

renc

eS

cien

ceS

hopp

ing

Soc

iety

Spo

rts

Classifier’s confusion on held-out documents can be used to correct confusion matrix

Page 44: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 44

Fine-grained views of citation

Clear block-structure derived from coarse-grain topics

Strong diagonals reflecttightly-knit topic communities

Prominent off-diagonalentries raise designissues for taxonomyeditors and maintainers

Page 45: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 45

Outline, some more detail Part 1 (Modeling graphs)

What do real-life graphs look like? What laws govern their formation, evolution and

properties? What structural analyses are useful?

Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining

Page 46: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 46

Centrality and prestige

Page 47: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 47

How important is a node? Degree, min-max radius, … Pagerank Maximum entropy network flows HITS and stochastic variants Stability and susceptibility to spamming Hypergraphs and nonlinear systems Using other hypertext properties Applications: Ranking, crawling, clustering,

detecting obsolete pages

Page 48: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 48

Importance/prestige as Pagerank A node is important if it is connected to

important nodes “Random surfer” walks along links for ever If current node has 3 outlinks, take each

with probability 1/3 Importance = steady-state

probability (ssp) of visit “Maxwell’s equation

for the Web”

u

vOutDegree(u)=3

Evu u

uPRvPR

),( )OutDegree()(

)(

Page 49: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 49

(Simplified) Pagerank algorithm Let A be the node adjacency matrix Column normalize AT

Want vector p such that AT p = p I.e. p is the eigenvector corresponding to the

largest eigenvalue, which is 1

1 2 3

45

p1

p2

p3

p4

p5

p1

p2

p3

p4

p5

=

To From

AT

1

1 1

1/2 1/2

1/2

1/2

Page 50: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 50

Intuition

A as vector transformation

2 11 3

A

10

x

21

x’

= x

x’

2

1

1

3

Page 51: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 51

Intuition

By defn., eigenvectors remain parallel to themselves (‘fixed points’)

2 11 3

A0.52

0.85

v1v1

=

0.52

0.853.62 *

1

Page 52: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 52

Convergence

Usually, fast: depends on ratio

1 : 21

2

Page 53: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 53

Eigenvalues and epidemic networks Will a virus spread across an arbitrary

network create an epidemic? Susceptible-Infected-Susceptible (SIS)

Was healthy, did not got infected Was infected, got cured without further attack Was infected, got cured immediately after attack Cured nodes immediately become susceptible

Susceptible/

healthy

Infected &

infectious

Infected by neighbor

Cured internally

Page 54: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 54

The infection model (virus) Birth rate : probability than an

infected neighbor attacks (virus) Death rate : probability that an

infected node heals

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob.

Page 55: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 55

Prob. that node i does NOT receive an infection from neighbors

Working parameters pi,t = prob. node i is infected at time t

Was healthy, did not got infected Was infected, got cured without further

attack Was infected, got attacks from neighbors

and yet cured itself in this time step … somewhat arbitrary

titip ,1,1

titip ,1,

Eji

tjEji

tjtjti ppp),(

1,),(

1,1,, 1)1()1(

titip ,1,21 1

Page 56: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 56

Time recurrence for pi,t Assuming various probabilities are “suitably

small”

Recurrence of the probability of infection can be approximated by this linear form:

In other words, the (symmetric) transition matrix is

Eji tj

Ejitjti pp

),( 1,),(

1,, 11

Eji

tjtiti ppp),(

1,1,, 1

AIS 1

Page 57: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 57

Epidemic threshold Virus ‘strength’ s = /

Epidemic threshold of a graph is defined as the value of , such that if strength s = / < then an epidemic can not happen

Problem: compute epidemic threshold

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ

Page 58: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 58

Epidemic threshold What should depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree?

Page 59: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 59

Analysis Eigenvectors of S and A are the same Eigenvalues are shifted and scaled

From spectral decomposition

A sufficient condition for infection dying down is that

AS ,, 1 ii

i

Tii

ni

nn uu SSSpSp ,,,)0()(

AA

AS

,1,1

,1,1

/oror

11

Page 60: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 60

Epidemic threshold

An epidemic must die down if

β/δ <τ = 1/ λ1,A

largest eigenvalueof adj. matrix A

attack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]

Page 61: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 61

Experiments

/ > τ (above threshold)

/ = τ (at the threshold)

/ < τ (below threshold)

Page 62: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 62

Remarks “Primal” problem: topology design

Design a network resilient to infection

“Dual” problem: viral marketing Selectively convert important customers who

can influence many others

Will come back to this later

Page 63: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 63

Back to the random surfer Practical issues with Pagerank [BrinP1997] PR converges only if E is aperiodic and

irreducible; make it so:

d is the (tuned) probability of “teleporting” to one of N pages uniformly at random (0.1—0.2 in practice)

(Possibly) unintended consequences: topic sensitivity, stability

Evu u

uPRd

Nd

vPR),( )OutDegree(

)()1()(

Page 64: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 64

Prestige as network flow yij =#surfers clicking from i to j per unit time

Hits per unit time on page j is Flow is conserved at The total traffic is Normalize:

Can interpret pij as a probability

Standard Pagerank corresponds to one solution:

Many other solutions possible

Eji ijj yH),(

Ekj jkEji ij yyj),(),(

:

ji ijj j yHY

,Yyp ijij /

))OutDegree(( iYHp iij

Page 65: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 65

Maximum entropy flow [Tomlin2003] Flow conservation modeled using feature

And the constraints Goal is to maximize

subject to Solution has form i is the “hotness” of page i

otherwise

),(,

),(,

0

1

1

)(:,,1 Ejrri

Erirj

xfNr ijr

ji ijrijijr xfpxfE

,)())((0

ji ijij pp,

log

ji ijp,

1

)exp( 0 jiijp

Page 66: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 66

Maxent flow results

Two IBMintranetdata setswith knowntop URLs

Averagerank (106) of knowntop URLswhensorted byPagerank

i

Hi

(Smaller rank is better)

i ranking is better than Pagerank; Hi

ranking is worse

Depth up to whichdmoz.org URLs areused as ground truth

Average rank (108)

Page 67: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 67

HITS [Kleinberg1997] Two kinds of prestige

Good hubs link to good authorities Good authorities are linked to by good hubs

In matrix notation, iterations amount to

with interleaved normalization of h and a Note that scores are copied not divided

EvuEvuvauhuhva

),(),()()();()(

EaEa

Eah

hEa

T

T

i.e.,

Page 68: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 68

HITS graph acquisition Steps

Get root set via keyword query Expand by one move forward and backward Drop same-site links (why?)

Each node assigned both ahub and an auth score

Graph tends to be “topic-specific”

Whereas Pagerank is usually run on “the entire Web” (but need not be)

IN Root set OUT

Page 69: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 69

HITS vs. Pagerank, more comments Dominant eigenvectors of different matrices HITS: Eigensystems of EET (h) and ETE (a) Pagerank: eigensystem of

where

HITS copies scores, Pagerank distributes HITS drops same-site links, Pagerank does

(can?) not Implications?

TLddJ )1(

otherwise

),(

0

)e(1/OutDegre),(and

/1/1

/1/1Ejii

jiL

NN

NN

J

Page 70: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 70

HITS: Dyadic interpretation [CohnC2000] Graph includes many communities z

Query=“Jaguar” gets auto, game, animal links

Each URL is represented as two things A document d A citation c

Max

Guess number of aspects zs and use [Hofmann 1999] to estimate Pr(c|z)

These are the most authoritative URLs

Ecd z

EcdEcd

zcdzd

dcdcd

),(

),(),(

)|Pr()|Pr()Pr(

)|Pr()Pr(),Pr(

Page 71: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 71

Dyadic results for “Machine learning”

Clustering based on citations + ranking within clusters

Page 72: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 72

Spamming link-based ranking Recipe for spamming HITS

Create a hub linking to genuine authorities Then mix in links to your customers’ sites Highly susceptible to adversarial behavior

Recipe for spamming Pagerank Buy a bunch of domains, cloak IP addresses Host a site at each domain Sprinkle a few links at random per page to other

sites you own Takes more work than spamming HITS

Page 73: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 73

Stability of link analysis [NgZJ2001] Compute HITS authority

scores and Pagerank Delete 30% of nodes/links at

random Recompute and compare

ranks; repeat Pagerank ranks more stable

than HITS authority ranks Why? How to design more stable

algorithms?

1 3 1 1 12 5 3 3 23 12 6 6 34 52 20 23 45 171 119 99 56 135 56 40 8

10 179 159 100 78 316 141 170 6

1 1 1 1 12 2 2 2 23 5 6 4 54 3 5 5 45 6 3 6 36 4 4 3 67 7 7 7 78 8 8 8 9

HIT

S A

utho

rity

Pag

eran

k

Page 74: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 74

Stability depends on graph and params Auth score is eigenvector for ETE = S, say Let 1 > 2 be the first two eigenvalues

There exists an S’ such that S and S’ are close ||S–S’||F = O(1 –2)

But ||u1 – u’1||2 = (1)

Pagerank p is eigenvector of U is a matrix full of 1/N and is the jump prob If set C of nodes are changed in any way, the

new Pagerank vector p’ satisfies

TEU ))1((

/2'2

Cu uppp

Page 75: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 75

Randomized HITS Each half-step, with

probability , teleport to a node chosen uniformly at random

Much more stable than HITS Results more meaningful

near 1 will always stabilize Here was 0.2

1 3 3 2 14 1 1 1 22 2 2 3 43 4 4 4 35 6 6 6 56 5 5 5 67 7 7 7 78 8 8 8 8

1 1 1 1 23 2 2 2 12 3 3 3 34 4 4 4 45 6 7 5 56 7 6 6 67 5 5 7 78 9 9 9 11

)(col

)1(

)(row

)1(

)1(1

)1(1tt

tTt

aEh

hEa

Pag

eran

kR

ando

miz

ed H

ITS

Page 76: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 76

1/3

1/3

1/3

Another random walk variation of HITS SALSA: Stochastic HITS [Lempel+2000] Two separate random walks

From authority to authority via hub From hub to hub via authority

Transition probability Pr(aiaj) =

If transition graph is irreducible, For disconnected components, depends on

relative size of bipartite cores Avoids dominance of larger cores

1/2

1/2

a1

a2

)OutDegree(1

)InDegree(1

),(),,(: haEahahhi

ji

)InDegree(aa

Page 77: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 77

SALSA sample result (“movies”)

HITS: The Tightly-Knit Community (TKC) effect

SALSA: Less TKC influence (but no reinforcement!)

Page 78: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 78

Links in relational data [GibsonKR1998] (Attribute, value) pair is a node

Each node v has weight wv

Each tuple is a hyperedge Tuple r has weight xr

HITS-like iterations to update weight wv

For each tuple

Update weight

Combining operator can be sum, max, product, Lp avg, etc.

),,(1 kuur wwx

),,,( 1 kuuvr

r rv xw

Page 79: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 79

Distilling links in relational data

Author Author Forum Year

Dat

abas

eT

heor

y

Page 80: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 80

Searching and annotating graph data

Page 81: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 81

Searching graph data Nodes in graph contain text

RandomIntelligent surfer [RichardsonD2001] Topic-sensitive Pagerank [Haveliwala2002] Assigning image captions using random walks

[PanYFD2004] Rotting pages and links [BarYossefBKT2004]

Query is a set of keywords All keywords may not match a single node Implicit joins [Hulgeri+2001, Agrawal+2002] Or rank aggregation [Balmin+2004] required

Page 82: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 82

Intelligent Web surfer

Eji qqqq jiijj),(

)(Pr)(Pr)(Pr')1()(Pr

Keyword Probabilityof teleportingto node j

Probability of walkingfrom i to j wrt q

Vk q

qq kR

jRj

)(

)()(Pr'

Relevanceof node k wrt q

Eki q

qq kR

jRji

),()(

)()(Pr'

Pick out-link to walk on inproportion to relevance oftarget out-neighbor

Qq qQ jqj )(Pr)Pr()(PR

Query=setof words

Pick a query word per some distribution, e.g. IDF

Page 83: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 83

Implementing the intelligent surfer

PRQ(j) approximates a walk that picks a query keyword using Pr(q) at every step

Precompute and store Prq(j) for each keyword q in lexicon: space blowup = avg doc length

Query-dependent PR rated better by volunteers

Page 84: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 84

Topic-sensitive Pagerank High overhead for per-word Pagerank Instead, compute Pageranks for some

collection of broad topics PRc(j) Topic c has sample page set Sc

Walk as in Pagerank Jump to a node in Sc uniformly at random

“Project” query onto set of topics

Rank responses by projection-weighted Pageranks

QqcqcQc )|Pr()Pr()|Pr(

c c jQcjQ )(PR)|Pr(),Score(

Page 85: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 85

Topic-sensitive Pagerank results

Users prefer topic-sensitive Pagerank on most queries to global Pagerank + keyword filter

Page 86: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 86

Image captioning Segment images

into regions Image has

caption words Three-layer

graph: image, regions, caption words

Threshold on region similarity to connect regions (dotted)

Page 87: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 87

Random walks with restarts

Find regions in test image Connect regions to other nodes in the region layer

using region similarity Random walk, restarting at test image node Pick words with largest visit probability

Regions

Images

Words

Testimage

Page 88: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 88

More random walks: “Link rot” How stale is a page?

Last-mod unreliable Automatic dead-link cleaners mask disuse

A page is “completely stale” if it is “dead” Let D be the set of pages which cannot be

accessed (404 and other problems)

How stale is a page u? Start with p u If pD declare decay value of u to be 1, else With probability declare decay value of u = 0 W.p. 1– choose outlink v, set pv, loop

Page 89: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 89

Page staleness results

Decay score is correlated with, but generally larger than the fraction of dead outlinks on a page

Removing direct dead links automatically does not eliminate live but “rotting” pages

404s

Decay

Page 90: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 90

Graph proximity search: two paradigms A single node as query response

Find node that matches query terms… …or is “near” nodes matching query terms

[Goldman+ 1998]

A connected subgraph as query response Single node may not match all keywords No natural “page boundary” [Bhalotia+2002]

[Agrawal+2002]

Page 91: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 91

Single-node response examples Travolta, Cage

Actor, Face/Off

Travolta, Cage, Movie Face/Off

Kleiser, Movie Gathering, Grease

Kleiser, Woo, Actor Travolta

Travolta Cage

Face/OffGrease

“acted-in”

Movie“is-a”

Actor

“is-a”

Kleiser

Director

“is-a”

Gathering

“dir

ect

ed

Woo

A3

Page 92: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 92

Basic search strategy Node subset A activated because they

match query keyword(s) Look for node near nodes that are activated Goodness of response node depends

Directly on degree of activation Inversely on distance from activated node(s)

Page 93: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 93

Proximity query: screenshot

http://www.cse.iitb.ac.in/banks/

Page 94: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 94

Ranking a single node response Activated node set A Rank node r in “response set” R based on

proximity to nodes a in A Nodes have relevance R and A in [0,1] Edge costs are “specified by the system”

d(a,r) = cost of shortest path from a to r Bond between a and r

Parameter t tunes relative emphasis on distance and relevance score

Several ad-hoc choices

tRA

rad

rarab

),(

)()(),(

Page 95: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 95

Scoring single response nodes Additive

Belief

Goal: list a limited number of find nodes with the largest scores

Performance issues Assume the graph is in memory? Precompute all-pairs shortest path (|V |3)? Prune unpromising candidates?

Aarabr ),()score(

Aarabr ),(11)score(

Page 96: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 96

Hub indexing Decompose APSP problem using sparse

vertex cuts |A|+|B | shortest paths to p |A|+|B | shortest paths to q d(p,q)

To find d(a,b) compare d(apb) not through q d(aqb) not through p d(apqb) d(aqpb)

Greatest savings when |A||B| Heuristics to find cuts, e.g. large-degree nodes

A B

a bp

q

Page 97: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 97

ObjectRank [Balmin+2004] Given a data graph with nodes having text For each keyword precompute a keyword-

sensitive Pagerank [RichardsonD2001] Score of a node for multiple keyword search

based on fuzzy AND/OR Approximation to Pagerank of node with restarts

to nodes matching keywords

Use Fagin-merge [Fagin2002] to get best nodes in data graph

Page 98: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 98

Connected subgraph as response Single node may not match all keywords

No natural “page boundary” On-the-fly joins make up a “response page”

Two scenarios Keyword search on relational data

• Keywords spread among normalized relations

Keyword search on XML-like or Web data• Keywords spread among DOM nodes and subtrees

Page 99: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 99

Keyword search on relational data Tuple = node Some columns have text Foreign key constraints =

edges in schema graph Query = set of terms No natural notion

of a document Normalization Join may be needed

to generate results Cycles may exist in schema

graph: ‘Cites’

Cites

CitingCited

Paper

PaperIDPaperName

Writes

AuthorIDPaperID

Author

AuthorIDAuthorName

PaperID PaperNameP1 DBXplorerP2 BANKS

AuthorID AuthorNameA1 ChaudhuriA2 SudarshanA3 Hulgeri

Citing CitedP2 P1

AuthorID PaperIDA1 P1A2 P2A3 P2

Page 100: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 100

DBXplorer and DISCOVER Enumerate subsets of relations in schema graph

which, when joined, may contain rows which have all keywords in the query “Join trees” derived from schema graph

Output SQL query for each join tree Generate joins, checking rows for matches

[Agrawal+2001], [Hristidis+2002]

T1 T2 T3

T4

T5

K1,K2,K3

K2

K3

T2

T2 T3

T4 T2 T3

T5

T2 T3

T4

T5

Page 101: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 101

Discussion Exploits relational

schema information to contain search

Pushes final extraction of joined tuples into RDBMS

Faster than dealing with full data graph directly

Coarse-grained ranking based on schema tree

Does not model proximity or (dis) similarity of individual tuples

No recipe for data with less regular (e.g. XML) or ill-defined schema

Page 102: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 102

Motivation from Web search “Linux modem driver

for a Thinkpad A22p” Hyperlink path matches

query collectively Conjunction query

would fail

Projects where X and P work together Conjunction may

retrieve wrong page

General notion of graph proximity

IBM Thinkpads•A20m•A22p Thinkpad

Drivers•Windows XP•Linux

Home Page ofProfessor XPapers•VLDB…Students•P•Q

P’s home pageI work on the Bproject.

The B SystemGroup members•P•S•X

DownloadInstallation tips•Modem•Ethernet

Page 103: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 103

Data structures for search Answer = tree with at least one leaf

containing each keyword in query Group Steiner tree problem, NP-hard

Query term t found in source nodes St

Single-source-shortest-path SSSP iterator Initialize with a source (near-) node Consider edges backwards getNext() returns next nearest node

For each iterator, each visited node v maintains for each t a set v.Rt of nodes in St which have reached v

Page 104: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 104

Generic expanding search Near node sets St with S = t St

For all source nodes S create a SSSP iterator with source

While more results required Get next iterator and its next-nearest node v Let t be the term for the iterator’s source s crossProduct = {s} t ’tv.Rt’

For each tuple of nodes in crossProduct• Create an answer tree rooted at v with paths to each

source node in the tuple

Add s to v.Rt

Page 105: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 105

Search example (“Vu Kleinberg”)

author writes cites

Quoc Vu Jon Kleinberg

paper

writeswrites

writes

A metriclabeling problem

Authoritative sources in ahyperlinked environment

Organizing Web pagesby “Information Unit”

Divyakant Agrawal

Eva Tardos

writes

writes

cites

cites

cites

Page 106: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 106

First response

author writes cites

Quoc Vu Jon Kleinberg

paper

writeswrites

writes

A metriclabeling problem

Authoritative sources in ahyperlinked environment

Organizing Web pagesby “Information Unit”

Divyakant Agrawal

Eva Tardos

writes

writes

cites

cites

cites

Page 107: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 107

Subgraph search: screenshot

http://www.cse.iitb.ac.in/banks/

Page 108: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 108

Similarity, neighborhood, influence

Page 109: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 109

Why are two nodes similar? What is/are the best paths connecting two nodes

explaining why/how they are related? Graph of co-starring, citation, telephone call, …

Graph with nodes s and t; budget of b nodes Find “best” b nodes capturing relationship between

s and t [FaloutsosMT2004]: Proposing a definition of goodness How to efficiently select best connections

NegropontePalmisano

Esther Dyson Gerstner

Page 110: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 110

Simple proposals that do not work Shortest path

Pizza boy p gets sameattention as g

Network flow sabt is as good as sgt

Voltage Connect +1V at s, ground t Both g and p will be at +0.5V

Observations Must reward parallel paths Must reward short paths Must penalize/tax pizza boys

a b

s g t

p

Page 111: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 111

Resistive network with universal sink Connect +1V to s Ground t Introduce universal sink

Grounded Connected to every node

Universal sink is a “tax collector” Penalizes pizza boys Penalizes long paths

Goodness of a path is the electric current it carries

a b

s g t

p

Connectedto every node

Page 112: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 112

Resistive network algorithm Ohm’s law: Kirchhoff’s current law: Boundary conditions (without sink): Solution:

Here C(u,v) is the conductance from u to v Add grounded universal sink z with V(z)=0 Set Display subgraph carrying high current

vuvVuVvuCvuI ,)]()()[,(),(

u

vuItsv 0),(:,

0)(,1)( tVsVtsu

wuC

vuCvVuV

w

v ,for,),(

),()()(

zwwuCzuCu ),(),(:

Page 113: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 113

Distributions influenced via graphs Directed or undirected graph; nodes have

Observable properties Some unobservable (random) state

Edges indicate that distributions over unobservable states are coupled

Many applications Hypertext classification (topics are clustered) Social network of customers buying products Hierarchical classification Labeling categorical sequences: pos/ne

tagging, sense disambiguation, linkage analysis

Page 114: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 114

Basic techniques Directed (acyclic) graphs: Bayesian network Markov networks

(Loopy) belief propagation

Conditional Markov networks Avoid modeling joint distribution over

observable and hidden properties of nodes

Some computationally simple special cases

Page 115: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 115

Hypertext classification Want to assign labels to Web pages Text on a single page may be too little or

misleading Page is not an isolated instance by itself Problem setup

Web graph G=(V,E) Node uV is a page having text uT

Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels

Probabilistic model? Benefits?

Page 116: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 116

Luckily we don’t need to worry about this

Seek a labeling f of all unlabeled nodes so as to maximize

Let VK be the nodes with known labels and f(VK) their known label assignments

Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels

Markov assumption: f(v) is conditionally independent of rest of G given f(N(v))

)|}:{,Pr()Pr(

))(|}:{,Pr())(Pr(}):{,|)(Pr(

VuuE

VfVuuEVfVuuEVf

T

TT

Graph labeling model

Page 117: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 117

Markov assumption: label of v does not depend onunknown labels outside the set NU(v)

Sum over all possible labelingsof unknown neighbors of v…given edges, text,

parital labeling

Probability of labelingspecific node v…

Markov graph labeling

Circularity between f(v) and f(NU(v)) Some form of iterative Gibb’s sampling or MCMC

)(,,)),((|)(Pr)(,,|))((Pr

)(,,|))((),(Pr))(,,|)(Pr(

))((

))((

KTU

vNf

KTU

vNf

KTUKT

VfVEvNfvfVfVEvNf

VfVEvNfvfVfVEvf

vU

vU

Page 118: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 118

Take the expectation of thisterm over NU(v) labelings

Joint distribution over neighborsapproximated as product of marginals

Label estimates of v in the next iteration

Iterative labeling

)(,,)),((|)(Pr)(,,|)(Pr

)(,,|)(Pr

))(()( )(

)1(

KTU

vNfvNw

KTr

KTr

VfVEvNfvfVfVEwf

VfVEvf

vU

U

Sum over all possible NU(v) labelings still too expensive to compute

In practice, prune to most likely configurations Let us look at the last term more carefully

Page 119: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 119

A generative node model

TKTU vvNfvfVfEVvNfvf )),((|)(Pr)(,,)),((|)(Pr

Can use Bayes classifier as with ordinary text: estimate a parametric model for

the class-conditional joint distribution between the text on the page v and the labels of neighbors of v

Must make naïve Bayes assumptions to keep practical

By the Markov assumption, finally we need a distributioncoupling f(v) and vT (the text on v) and f(N(v))

)(|)),((Pr vfvvNf T

Page 120: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 120

Pictorially… c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c] (hurts)

Better model:Pr[t, c(N) | c]

Estimate histograms and update based on neighbors’ histograms

?

Page 121: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 121

Generative model: results 9600 patents from 12

classes marked by USPTO

Patents have text and cite other patents

Expand test patent to include neighborhood

‘Forget’ fraction of neighbors’ classes

0

10

20

30

40

0 50 100

%Neighborhood known

%E

rro

rText Link Text+Link

Page 122: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 122

Detour: generative vs. discriminative x = feature vector, y = label {0,1} say Generative method models Pr(x|y) or Pr(x,y)

“the generation of data x given label y” Use Bayes rule to get Pr(y|x) Inaccurate; x may have many dimensions

Discriminative: directly estimates Pr(y|x) Simple linear case: want w s.t.

w.x > 0 if y=1 and w.x 0 if y=0 Cannot differentiate for w; instead pick some

smooth “loss function” Works very well in practice xw

x

exp1

1)|Pr(y

Page 123: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 123

Discriminative node model OA(X) = direct “own” attributes of node X LD(X) = link-derived attributes of node X

Mode-link: most frequent label of neighbors(X) Count-link: histogram of neighbor labels Binary-link: 0/1 histogram of neighbor labels

Iterate as in generative case

)1)OA(exp(1))OA(,|Pr( XcwXwc Too

Local model params

)1)LD(exp(1))LD(,|Pr( XcwXwc Tll

Neighborhood model params

))LD(|Pr())OA(|Pr(maxarg)(ˆ XcXcXC c

Page 124: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 124

Discriminative model: results [Li+2003]

Binary-link and count-link outperform content-only at 95% confidence

Better to separately estimate wl and wo

In+Out+Cocitation better than any subset for LD

Page 125: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 125

Undirected Markov networks Clique cC(G) a set of

completely connected nodes

Clique potential c(Vc) a function over all possible configurations of nodes in Vc

Decompose Pr(v) as (1/Z)cC(G)c(Vc)

Parametric formLabel variable

Local feature variable

Instance

Label coupling

)(exp)( ccccc f vwv Params of model Feature functions

Page 126: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 126

Conditional and relational networks x = vector of observed features at all nodes y = vector of labels

A set of “clique templates”specifying links to use

Other features: “in the sameHTML section”

)(

),()(

1)|Pr(

GCccccZ

yxx

xy

' )(

' ),()( wherey

yxxGCc

cccZ

),(exp),( ccccccc F yxwyx

Page 127: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 127

“Toy problem”: Hierarchical classification Obvious approaches

Flatten to leaf topics, losing hierarchy info Level-by-level, compounding error probability

Cascaded generative model

Pr(c|d,r) estimated as Pr(c|r)Pr(d|c)/Z(r) Estimate of Pr(d|c) makes naïve independence

assumptions if d has high dimensionality Pr(c|d,r) tends to 0/1 for large dimensions and Mistake made at shallow levels become

irrevocable

),|Pr()|Pr()|Pr( rdcdrdc r

c

Page 128: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 128

Global discriminative model Each node has an associated bit X Propose a parametric form

Each training instance sets one path to 1, all other nodes have X=0

)),(exp(1

)),(exp(),|1Pr(

rc

rcrc xdFw

xdFwxdX

dT

2T+1

xr

xr=0 xr=1F(d,xr)

wc

SVM 1v1 CtreeReuters 1% 41.9 47.3Reuters 5% 68 71News20 1% 17.2 21.8

%Accuracy

Page 129: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 129

Network value of customers Customer = node X in graph, neighbors N M is a marketing action (promotion, coupon) Want to predict

Broader objective is to design action M Again, we approximate as

),,|Pr( MYXKiX

Response of customer i

Known response ofother customers

Product attributes

Aggregatemarketing action

),,|Pr(),,|Pr()(

MYXMYN K

NC

Uiii

Ui

NXSum over unknown neighbor configurations

Page 130: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 130

Network value, continued Let the action be boolean

c is the cost of marketing r0 is the revenue without marketing, r1 with

Expected lift in profit by marketing to customer i in isolation

Global effect

cfXr

fXrELP

iK

i

iK

iK

i

))(,,|1Pr(

))(,,|1Pr(,,0

0

11

MYX

MYXMYX

Page 131: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 131

Special case: sequential networks Text modeled as sequence of tokens drawn

from a large but finite vocabulary Each token has attributes

Visible: allCaps, noCaps, hasXx, allDigits, hasDigit, isAbbrev, (part-of-speech, wnSense)

Not visible: part-of-speech, (isPersonName, isOrgName, isLocation, isDateTime), {starts|continues|ends}-noun-phrase

Visible (symbols) and invisible (states) attributes of nearby tokens are dependent

Application decides what is (not) visible Goal: Estimate invisible attributes

Page 132: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 132

Hidden Markov model A generative sequential model for the joint

distribution of states (s) and symbols (o)St-1 St

Ot

St+1

Ot+1Ot-1

...

...

nn oooossss ,...,,..., 2110

||

11 )Pr()Pr(),Pr(

o

ttttt osssos

Page 133: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 133

Using redundant token features Each o is usually a vector of features

extracted from a token Might have high dependence/redundancy:

hasCap, hasDigit, isNoun, isPreposition Parametric model for Pr(stot) needs to

make naïve assumptions to be practical Overall joint model Pr(s,o) can be very

inaccurate (Same argument as in naïve Bayes vs. SVM

or maximum entropy text classifiers)

Page 134: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 134

Discriminative graphical model Assume one-stage Markov dependence Propose direct parametric form for

conditional probability of state sequence given symbol sequence

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

k ttkko tossft ),,,(exp)( 1

||

11 )|Pr()|Pr(

)Pr(

1)|Pr(

o

ttttt soss

oos

||

111 ),,,(),(

)(

1 o

tttotts tossss

oZ

Model

Log-linear form

Parameters to fitFeature function; mightdepend on whole o

Page 135: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 135

Feature functions and parameters

Find L/k for each k and perform a gradient-based numerical optimization

Efficient for linear state dependence structure

k

k

Dos

o

t kttkk tossf

oZL

2

2

,

||

11 2

),,,(exp)(

1log

Maximize total conditional likelihood over all instances

Penalizelarge params

Page 136: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 136

Conditional vs. joint: results

The asbestos fiber , crocidolite, is unusually resilient once

it enters the lungs , with even brief exposures to it causing

symptoms that show up decades later , researchers said .

DT NN NN , NN , VBZ RB JJ IN

PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG

NNS WDT VBP RP NNS JJ , NNS VBD .

Penn Treebank: 45 tags, 1M words training data

Algorithm/Features %Error %OOVE* HMM with words 5.69 45.99CRF with words 5.55 48.05CRF with words and orthography 4.27 23.76

Out

-of-

voca

bula

ry e

rror

Orthography: Use words, plus overlapping features: isCap, startsWithDigit, hasHyphen, endsWith… -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies

Page 137: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 137

Summary Graphs provide a powerful way to model

many kinds of data, at multiple levels Web pages, XML, relational data, images… Words, senses, phrases, parse trees…

A few broad paradigms for analysis Factors affecting graph evolution over time Eigen analysis, conductance, random walks Coupled distributions between node attributes

and graph neighborhood

Several new classes of model estimation and inferencing algorithms

Page 138: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 138

References [BrinP1998] The Anatomy of a Large-Scale

Hypertextual Web Search Engine, WWW. [GoldmanSVG1998] Proximity search in databases.

VLDB, 26—37. [ChakrabartiDI1998] Enhanced hypertext

categorization using hyperlinks. SIGMOD. [BikelSW1999] An Algorithm that Learns What’s in a

Name. Machine Learning Journal. [GibsonKR1999]

Clustering categorical data: An approach based on dynamical systems. VLDB.

[Kleinberg1999] Authoritative sources in a hyperlinked environment. JACM 46.

Page 139: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 139

References [CohnC2000] Probabilistically Identifying

Authoritative Documents, ICML. [LempelM2000] The stochastic approach for link-

structure analysis (SALSA) and the TKC effect. Computer Networks 33 (1-6): 387-401

[RichardsonD2001] The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. NIPS 14 (1441-1448).

[LaffertyMP2001] Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML.

[BorkarDS2001] Automatic text segmentation for extracting structured records. SIGMOD.

Page 140: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 140

References [NgZJ2001] Stable algorithms for link analysis.

SIGIR. [Hulgeri+2001] Keyword Search in Databases.

IEEE Data Engineering Bulletin 24(3): 22-32. [Hristidis+2002] DISCOVER: Keyword Search in

Relational Databases. VLDB. [Agrawal+2002] DBXplorer: A system for keyword-

based search over relational databases. ICDE. [TaskarAK2002] Discriminative probabilistic

models for relational data. [Fagin2002] Combining fuzzy information: an

overview. SIGMOD Record 31(2), 109–118.

Page 141: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 141

References [Chakrabarti2002] Mining the Web: Discovering

Knowledge from Hypertext Data [Tomlin2003] A New Paradigm for Ranking Pages

on the World Wide Web. WWW. [Haveliwala2003] Topic-Sensitive Pagerank: A

Context-Sensitive Ranking Algorithm for Web Search. IEEE TKDE.

[LuG2003] Link-based Classification. ICML. [FaloutsosMT2004] Connection Subgraphs in

Social Networks. SIAM-DM workshop. [PanYFD2004] GCap: Graph-based Automatic

Image Captioning. MDDE/CVPR.

Page 142: ACFOCS 2004Chakrabarti1 Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay soumen

ACFOCS 2004 Chakrabarti 142

References [Balmin+2004] Authority-Based Keyword Queries

in Databases using ObjectRank. VLDB. [BarYossefBKT2004] Sic transit gloria telae:

Towards an understanding of the Web’s decay. WWW2004.


Top Related