the structure of broad topics on the web soumen chakrabarti mukul m. joshi kunal punera (iit bombay)...

16
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Upload: jodie-robertson

Post on 19-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

The Structure of Broad Topicson the Web

Soumen ChakrabartiMukul M. JoshiKunal Punera(IIT Bombay)

David M. Pennock(NEC Research Institute)

Page 2: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Graph structure of the Web

Over two billion nodes, two trillion links Power-law degree distribution

• Pr(degree = k) 1/k2.1

Looks like a “bow-tie” at large scale

IN OUTStrongly

connectedcore (SCC)

“This isthe Web”

Page 3: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

The need for content-based models

Why does a radius-1 expansion help in topic distillation?

Why does topic-specific focused crawling work?

Why is a global PageRank useful for specific queries?

Searchengine

QueryRootset

Classifier

Crawler

Checkfrontier topic

Prune if irrelevant

vu

u

upd

N

dvp

)OutDegree(

)()1()(

Uniformjump

Walk toout-neighbor

Page 4: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

The need for content-based models

How are different topics linked to each other? Are topic directories representative of Web

topic populations? Are standard collections (e.g., TREC W10G)

representative of Web topics?

“This isthe Webwith topics”

Page 5: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

How to characterize “topics”

Web directories—most natural choice Started with http://dmoz.org Keep pruning until all leaf topics

have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train text classifier (Rainbow) Characterize new document d as a

vector of probabilities pd = (Pr(c|d) c)

Classifier

Topic ProbArts 0.1Computers 0.3Science 0.6

Test doc

Page 6: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Critique and defense

Cannot capture fine-grained or emerging topics• Emerging topics most often specialize

existing broad topics• Broad topics rarely change

Classifier may be inaccurate• Adequate if much better than random

guessing of topic label• Can compensate errors using held-out

validation data

Page 7: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Background topic distribution

What fraction of Web pages are about Health?

Sampling via random walk• PageRank walk (Henzinger et al.)• Undirected regular walk (Bar-

Yossef et al.)

Make graph undirected Add self-loops so that all nodes

have the same degree Sample with large stride Collect topic histograms

Page 8: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Convergence

Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions

• L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]

• Below .05 —.2 within 300—400 physical pages

Background distribution

0

0.1

0.2

0.3

0.4

Art

s

Bu

sin

ess

Co

mp

ute

rs

Ga

me

s

He

alth

Ho

me

Re

cre

atio

n

Re

fere

nce

Sci

en

ce

Sh

op

pin

g

So

cie

ty

Sp

ort

s00.20.40.60.8

1

0 500 1000#hops

Dis

trib

utio

n di

ffere

nce

Stride=30k

Stride=75k

Page 9: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Biases in topic directories

Use Dmoz to train a classifier

Sample the Web Classify samples Diff Dmoz topic

distribution from Web sample topic distribution

Report maximum deviation in fractions

NOTE: Not exactly Dmoz

Dmoz over-representsGames.Video_GamesSociety.PeopleArts.Celebrities...Education.Colleges...Travel.ReservationsDmoz under-represents…WWW…Directories!Sports.HockeySociety.PhilosophyEducation…K12…Recreation…Camping

Page 10: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Topic-specific degree distribution

Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic

More realistic: u has a topic, and links to v with related topics

Unclear if power-law should be upheld

Intra-topiclinkage

Inter-topiclinkage

Page 11: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Random forward walk without jumps/Arts/Music

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

/Sports/Soccer

0.4

0.6

0.8

1

1.2

1.4

0 5 10 15 20Wander hops

L_

1 D

ista

nce

From backgroundFrom hop0

Sampling walk is designed to mix topics well How about walking forward without jumping?

• Start from a page u0 on a specific topic• Forward random walk (u0, u1, …, ui, …)• Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with

the background distribution

Page 12: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Forward walks wander away fromstarting topic slowly

But do not converge to thebackground distribution

Global PageRank ok alsofor topic-specific queries• Jump parameter d=.1—.2• Topic drift not too bad within

path length of 5—10• Prestige conferred mostly by

same-topic neighbors Also explains why focused crawling works

Observations and implicationsW.p. d jump toa random node

W.p. (1-d)jump to anout-neighboru.a.r.

High-prestige

node

Jump

Page 13: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Citation matrix

Given a page is about topic i, how likely is it to link to topic j?• Matrix C[i,j] = probability that page about

topic i links to page about topic j• Soft counting: C[i,j] += Pr(i|u)Pr(j|v)

Applications• Classifying Web pages into topics• Focused crawling for topic-specific pages• Finding relations between topics in a

directory

u v

Page 14: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Citation, confusion, correctionFrom topic

True topic From topic

To topic

Guessed topic

To topic

Art

sB

usin

ess

Com

put

ers

Gam

esH

ealth

Hom

eR

ecre

atio

nR

efe

renc

eS

cien

ceS

hopp

ing

Soc

iety

Spo

rts

Classifier’s confusion on held-out documents can be used to correct confusion matrix

Page 15: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Fine-grained views of citation

Clear block-structure derived from coarse-grain topics

Strong diagonals reflecttightly-knit topic communities

Prominent off-diagonalentries raise designissues for taxonomyeditors and maintainers

Page 16: The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Concluding remarks

A model for content-based communities• New characterization and measurement of

topical locality on the Web• How to set the PageRank jump parameter?• Topical stability of topic distillation• Better crawling and classification

A tool for Web directory maintenance• Fair sampling and representation of topics• Block-structure and off-diagonals• Taxonomy inversion