sigir’09 boston 1 entropy-biased models for query representation on the click graph hongbo deng,...

34
SIGIR’0 SIGIR’0 9 9 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong July 21st, 2009

Upload: aron-dwight-freeman

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

SIGIR’09SIGIR’09Boston

1

Entropy-biased Models for Query Representation on the Click Graph

Hongbo Deng, Irwin King and Michael R. Lyu

Department of Computer Science and Engineering

The Chinese University of Hong Kong

July 21st, 2009

Page 2: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

2Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Introduction

Query suggestion Query classification

Targeted advertising Ranking

Query log analysis – improve search engine’s capabilities

Page 3: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

3Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Introduction

Click graph – an important technique A bipartite graph between queries and URLs Edges connect a query with the URLs Capture some semantic relations, e.g., “map” and “travel”

How to utilize and model the click graph to represent queries?

Robustness: Some queries with

skewed click count may exclusively

influence the click graph Spam: Raw CF can be easily

manipulated

Traditional model based on the raw click frequency (CF)

Propose an entropy-biased framework

Page 4: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

4Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Motivation

Basic idea Various query-URL pairs should be treated differently

Intuition Common clicks on less frequent but more specific URLs are

of greater value than common clicks on frequent and general URLs

Is a single click on different URLs equally important?

General URL

Specific URL

Page 5: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

5Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Outline

Introduction Related Work Methodology

Preliminaries Click Frequency Model Entropy-biased Model

Experiments Conclusion

Page 6: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

6Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Modeling queries and URLs

Click entropy& result entropy

Related Work

Using click graph Query clustering (Befferman and

Berger, KDD’00, Wen et al., WWW’ 01)

Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

Query classification from regularized click graph (Li et al., SIGIR’08)

Using click graph

These methods are proposed based on the click graph, while our objective

is to investigate a better model to utilize and represent the click graph.

Using click graph

Page 7: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

7Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Modeling queries and URLs

Click entropy& result entropy

Related Work

Using click graph Query clustering (Befferman and

Berger, KDD’00, Wen et al., WWW’ 01)

Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

Query classification from regularized click graph (Li et al., SIGIR’08)

Using click graph

These existing methods do not distinguish the variation on different query-URL pairs

Using click graph Modeling the representation

Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004)

Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07)

Propose the query-set document model to represent documents by mining frequent query patterns rather than the content information of the documents (Poblete et al., WWW’08)

Modeling queries and URLs

Page 8: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

8Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Using click graph Query clustering (Befferman and

Berger, KDD’00, Wen et al., WWW’ 01)

Random walks for relevance rank in image search (Craswell and Szummer, SIGIR’05)

Query suggestion by computing the hitting time on a click graph (Mei et al., CIKM’08)

Query classification from regularized click graph (Li et al., SIGIR’08)

Modeling queries and URLs

Click entropy& result entropy

Related Work

Using click graph

These methods are focused on personalization for different queries, while our entropy-

biased models are focused on the weighting scheme of various query-URL pairs

Using click graph Modeling the representation

Use the content of clicked Web pages to define a term-weight vector model for a query (Baeza-Yates et al., 2004)

Represent query as a vector of documents (URLs) without considering the content information (Baeza-Yates and Tiberi, KDD’07)

Propose the query-set document model to represent documents by mining frequency query patterns rather than the content information of the documents (Qin et al., WWW’08)

Modeling queries and URLs

For personalization Explore click entropy to measure the

variability in click results (Dou et al., WWW’ 07)

Propose result entropy to capture how often results change (Teevan et al., SIGIR’08)

Click entropy& result entropy

Page 9: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

9Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Outline

Introduction Related Work Methodology

Preliminaries Click Frequency Model Entropy-biased Model

Experiments Conclusion

Page 10: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

10Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Preliminaries

Query:

URL:

User:

Query instance:

Page 11: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

11Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Traditional Click Frequency Model

Edges of click graph: Weighted by the raw click frequency (CF)

Transition probability Normalize CF

From query to URL: From URL to query:

Based on the transition probabilities, the query and document can be represented by the vector of transition probabilities respectively.

Page 12: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

12Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Traditional Click Frequency Model

Measure the similarity between queries The most similar query

q2 (“map”) q1 (“Yahoo”) More reasonable

q2 (“map”) q3 (“travel”)

Cosine similarity:

The CF model only considers the raw click frequency, and treats different

query-URL pairs equally, even if some URLs are heavily clicked.

Page 13: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

13Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Methodology

M

Traditional click frequency model

Entropy-biased models

Page 14: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

14Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Entropy-biased Model

The more general and highly ranked URL Connect with more queries Increase the ambiguity and uncertainty

The entropy of a URL: Suppose

Tend to be proportional to the n(dj)

It would be more reasonable to weight these two edges differently because of the variation of the connected URLs.

Page 15: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

15Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Entropy Discriminative Ability

Entropy increase, discriminative ability decrease Be inversely proportional to each other A URL with a high query frequency is less

discriminative overall Inverse query frequency

Measure the discriminative ability of the URL

Benefits Constrain the influence of some heavily-clicked URLs Balance the inherent bias of clicks for those highly ranked Incorporate with other factors to tune the model

Page 16: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

16Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

CF-IQF Model

Incorporate the IQF with the click frequency

A high click frequency

A low query frequency

“A” is weighted higher than “B”

Page 17: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

17Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

CF-IQF Model

Transition probability

The most similar query

q2 (“map”) q3 (“travel”)

The most similar query

q2 (“map”) q1 (“Yahoo”)

Page 18: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

18Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

UF Model and UF-IQF Model

Drawback of CF model Prone to spam by some malicious clicks (if a single

user clicks on a certain URL thousands of times) UF model

Weight by user frequency instead of click frequency Improve the resistance against malicious click

UF-IQF model

Page 19: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

19Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Connection with TF-IDF

TF-IDF has been extensively and successfully used in the vector space model for text retrieval

Several researchers have tried to interpret IDF based on binary independence retrieval (BIR), Possion, information entropy and LM

TF-IDF has never been explored to bipartite graphs, and the IQF is new. The CF-IQF is a simplified version of the entropy-biased model

The entropy-biased model is employed to identify the edge weighting of the click graph, which can be applied to other bipartite graphs

Page 20: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

20Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Mining Query Log on Click Graph

Query clustering

Query-to-query similarity

Query suggestion

Models

Query-to-query similarity

Query suggestion

Page 21: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

21Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Similarity Measurement

Cosine similarity

Jaccard coefficient

The similarity results are reported and analyzed

Page 22: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

22Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Graph-based Random Walk

Query-to-query graph The transition probability from qi to qj

The personalized PageRank

Page 23: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

23Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Outline

Introduction Related Work Methodology

Preliminaries Click Frequency Model Entropy-biased Model

Experiments Conclusion

Page 24: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

24Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Evaluation

Data collection AOL query log data

Cleaning the data Removing the queries that appear less than 2 times Combining the near-duplicated queries 883,913 queries and 967,174 URLs 4,900,387 edges

Page 25: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

25Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Distributions

Page 26: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

26Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Evaluation: ODP Similarity

A simple measure of similarity among queries using ODP categories (query category)

Definition:

Example: Q1: “United States” “Regional > North America >

United States” Q2: “National Parks” “Regional > North America >

United States > Travel and Tourism > National Parks and Monuments”

Precision at rank n (P@n):

300 distinct queries

3/5

Page 27: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

27Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Query similarity analysis1. CF-IQF is better than CF

UF-IQF > UF

The results support our intuition of the entropy-biased framework about treating various query-URL pairs differently

2. UF is better than CF

UF-IQF > CF-IQF

The results indicates the user frequency associated with the query-URL pair is more robust than the click frequency for modeling the click graph.

Results:

Page 28: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

28Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Query similarity analysis 3. TF-IDF is better than TF

The improvements of CF-IQF over CF and UF-IQF over UF models are consistent with the improvement of TF-IDF over TF model.

The reason: they share the same key point to identify and tune the importance of a term or a query-URL edge.

Page 29: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

29Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Query similarity analysis

4. Jaccard coefficient

The improvements are consistent with the Cosine similarity

Page 30: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

30Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Query similarity analysis5. UF-IQF achieves best

performance in most cases.

It is very essential and promising to consider the entropy-biased models for the click graph.

6. CF and UF models > TF

CF-IQF, UF-IQF > TF-IDF

The click graph catches more

semantic relations between

queries than the query terms

Page 31: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

31Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Random Walk Evaluation

Results:

1. With the increase of n, both models improve their performance.

2. CF-IQF model always performs better than the CF mode.

Page 32: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

32Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Experimental Results

Random Walk Evaluation

In general, the results generated by the CF and the CF-IQF models are similar, and mostly semantically relative to the original query,

such as “American airline”.

Another important observation is that the CF-IQF model can boost more relevant queries as suggestion and reduce some irrelevant queries.

Page 33: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

33Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Conclusions

Introduce the inverse query frequency (IQF) to measure the discriminative ability of a URL

Identify a new source, user frequency, for diminishing the manipulation of the malicious clicks

Propose the entropy-biased models to combine the IQF with the CF as well as UF for click graphs

Experimental results show that the improvements of our proposed models are consistent and promising

Page 34: SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science

34Hongbo Deng, Irwin King, and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong SIGIR’09SIGIR’09Boston

Q&A

Thanks!