pagerank without hyperlinks: structural re-ranking using links induced by language models

23
PageRank without hyperli nks: Structural re-ranki ng using links induced b y language models Oren Kurland and Lilia n Lee Cornell SIGIR 2005

Upload: mayes

Post on 23-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

PageRank without hyperlinks: Structural re-ranking using links induced by language models. Oren Kurland and Lilian Lee Cornell SIGIR 2005. Objective. IR re-ranking on non-hypertext documents using PageRank Use language-model-based weights in the PageRank matrix. Method Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PageRank without hyperlinks: Structural re-ranking using links induced by language models

PageRank without hyperlinks: Structural re-ranking using links induced by language models

Oren Kurland and Lilian Lee

Cornell

SIGIR 2005

Page 2: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Objective

IR re-ranking on non-hypertext documents using PageRank

Use language-model-based weights in the PageRank matrix

Page 3: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Method Outline

Initial retrieval using KL-Divergence model (use Lemur)

Generate PageRank matrix from top k retrieved documents according to the paper’s model

Do the PageRank iterations

Re-rank the documents

Page 4: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 1: Generation Probability

The probability of a word w occurring in a document x or document collection x acccording to the maximum likelihood model is

tf is the term frequency

Page 5: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 1: Generation Probability (Cont.)

Using the Dirichlet-smoothed model, we get

pcMLE(w) is the MLE probability of w in t

he entire document collection c

controls the influence of pcMLE(w)

Page 6: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Two ways of defining the probability of a document x generating a sequence of words w1w2…wn are

Concept 1: Generation Probability(Cont.)

Page 7: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 1: Generation Probability(Cont.)

KL-Divergence combines the previous two functions into

That’s the generation probability function for this paper

The probability of document d generating word sequence s

Page 8: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 2: Top Generators

The top generators of a document s are the documents d with the highest generation probabilities

Page 9: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Graph Generation

We can construct a graph from a collection of documents

Two ways of defining the edges and edge weights are

Page 10: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Graph Generation (Cont.)

og means an edge from document o to document gThe first definition assigns a uniform weight of 1 to all edges pointing from a document to its top generatorThe second definition uses generation probability as weight

Page 11: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Weight-Smoothing

We can smooth the edge weights to give non-zero weights for all edges

Dinit is the set of documents we wish to re-rank controls the influence of the components

Page 12: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 3: Graph Centrality

Now that we have a graph, how do we define the centrality (importance) of each node (document)?

Influx version:

The centrality of a node is simply the weight of the edges pointing to it

Page 13: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 3: Graph Centrality (Cont.)

Recursive Influx Version:

Centrality is recursively defined

This is the PageRank version

Page 14: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Concept 3: Graph Centrality (Cont.)

We get a total of 4 models if we consider uniform/non-uniform weights and non-recursive/recursive influxesRecall that uniform weights mean edge weights with values 0 or 1

Recursive Non-recursive

Uniform weight

R-U-In U-In

Non-uniform weight

R-W-In W-In

Page 15: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Combining Centrality with Initial Relevance Score

Centrality scores are computed on the set of initially retrieved documentsInitially retrieved documents also have relevance score assigned by KL-divergence retrieval modelWe can combine the two scores:

Cen(d;G) is centrality scorepd(q) is retrieval scoreJust a simple product of the two scores

Page 16: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Final combinations of models

Now we have 8 models:U-InW-InU-In+LM (centrality * retrieval score)W-In+LMR-U-InR-W-InR-U-In+LMR-W-In+LM

Page 17: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 1: Model Comparison

4 TREC corporaRe-rank top 50 retrieved documentsUpper-bound performance: place all relevant documents in the top 50 documents to the frontInitial ranking: optimize parameter for best precision at 1000Optimal baseline: performance of best parameter

Page 18: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 1 Results

Highlighted values indicate the best performances

The R-W-In+LM model has the best performance on average

Page 19: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 2: Cosine Similarity

Top Generators and edge weights are computed using language model pd(s)

Replace pd(s) by tf*idf cosine similarity between 2 documents

Page 20: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 2: Results

means language model is better than cosine similarity by at least 5%

means cosine similarity is better than language model by 5%

Language model is better overall

Page 21: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 3: Centrality Alternatives

The best re-ranking model so far is R-W-In+LM:

What if we replace Cen(d;G) by other scores

Page 22: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Experiment 3: Results

Again, R-W-In+LM wins

Page 23: PageRank without hyperlinks: Structural re-ranking using links induced by language models

Conclusion

PageRank on documents without explicit links