personalized query classification bin cao, qiang yang, derek hao hu, et al. computer science and...

40
Personalized Query Classification Bin Cao, Qiang Yang, Derek Hao Hu, et al. Computer Science and Engineering Hong Kong UST

Upload: cynthia-canfield

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Personalized Query Classification

Bin Cao, Qiang Yang, Derek Hao Hu, et al.Computer Science and Engineering

Hong Kong UST

Query Classification and Online Advertisement

QC as Machine Learning

Inspired by the KDDCUP’05 competition– Classify a query into a ranked list of categories– Queries are collected from real search engines– Target categories are organized in a tree with each node

being a category

Our QC Demo

• http://q2c.cs.ust.hk/q2c/

Personalization• The aim of Personalized Query Classification

is to classify a user query Q to a ranked list of predefined categories for different users

Queries Categories

golf CarSportsPlaces

bass Entertainment/MusicLiving/Fishing

Michael Jordan

Information/ResearchSports/BasketballShopping

PQC: Personalized Query Classification • classify a user query Q to a ranked list of

categories for different users

Queries Categories

golf CarSportsPlaces

bass Entertainment/MusicLiving/Fishing

Michael Jordan

Information/ResearchSports/BasketballShopping

Question:Can we personalize search without

user registration info?

• Profile based PQC

• Context based PQC

• Conclusion

Difficulties• Web Queries are

– Short, sparse: “adi”, ”cs”, “ps”– Noisy: “contnt”, “gogle”– New words are emerging all the time: “windows7”• Training data are hard for human to label– Experts may have different understandings for the

same ambiguous query• E.g. “Apple”, “Office”, etc.

Method 1: Profile Based• Profile (U) = { <Q, Search-Result, Clicked-URL>} in

the past– Profile based Personalized Query Classification

-….√ ….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….√ ….√ ….

-….√ ….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….√ ….√ ….

-….√ ….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….√ ….√ ….

Michael Jordan

Method 2: Context Based• Context = a session of user submitted queries

Graphical Model

Machine Learning

UCB

Michael Jordan

Outline

• Introduction

• Profile based PQC

• Context based PQC

• Conclusion

How to construct a user profile?

• To achieve personalized query classification, under independence assumption

• ACM KDDCUP 2005 Solution: estimating: p(q|c)• Focus: estimating p(u|c) for personalization• Difficulty: sparseness

– Too many possible categories– Limited information for each user

p(c|q,u) ∝ p(q|c)p(u|c)p(c)

Categorized Clickthrough Data:Too Few!

• Clickthrough Data

Person 1

“SIGIR”

Search Engine

s

Collaborative Classification

• Leverage information from similar users: user-class matrix

C1 C2 C3 C4 C5

User A √ X √ ? X

User B √ √ ? X √

User C X X √ ? X

User D √ ? √ √ X

√ interested inX not interested in

Also can be a value indicate degree of interests

Extending Collaborative Filtering (CF) Model to Ranking (Liu and Yang, SIGIR 008)• Previous method for CF:

– Memory based approach: Finding users having similar interests to help predicting missing values

– Model based approach: estimating probability based on new user’s known values

• We propose a collaborative ranking model to improve model based approach– Using preference or ranking instead of values

• better at estimating the preference for users

Nathan Liu and Qiang Yang. EigenRank: Collaborative Filtering via Rank

Aggregation. In ACM SIGIR Conference (ACM SIGIR 08), Singapore, 2008

y1 y2 y3 y4

a 1 5 2 5

Predicted Ratings

y1 y2 y3 y4

U1 5 4 ? ?U2 ? 5 2 5U3 4 ? 4 3u4 1 5 ? 5

Rating Database

y1 y2 y3 y4 a 1 ? ? 5

Active User Ratings

Rating Prediction

1. Item y2

2. Item y3

Item List

Sor

t

Ranking

• Collaborative Ranking Framework

Collaborative Ranking for Intention Mining

Interest Score MatrixP(U|C)

|user,or user group|Preference

Matrix

|Category|

|Preference={(URL1<URL2)}|

|User|

Our objective is to uncover the interest probability P(U|C) consistent with the given

observed preference for each query

Input Output

|Intention category|

Solution: Automatically Generate Labeled Data (to assist human

labelers)• Clickthrough

– Connects queries and urls

– Contains users’ personal interpretation for query

url aQuery

url bQuery

User A

User B

||

C1

C2We need the category information for urls …

Experimental Results: F1 metric

How to enlarge training set?

1….2….3….

1….2….3….

1….2….3….

A few human labeled data

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

-….√ ….√ ….

- ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….√ ….√ ….

A HUGE number of clickthrough logs

without labels

Online Knowledge Bases, such as ODP, Wikipedia

Online Knowledge Base such as WiKi

Knowledge BaseKnowledge Base

Plentiful Documents

Links

Meaningful Ontology

“Label” Retrieval from Online KB

Wikipedia Concept Graph

Labels on result pages:

Shopping: Commercial

Sports: non-Commercial

Video Games: Commercial

Research:non-Commercial

Use labeled result pages as “Seeds” to retrieve the most

relevant documents as training data

Taking Online Commercial

Intention as an example

Obtain “Pseudo-Relevance” Data

1….2….3….

1….2….3….

1….2….3….

A few human labeled data

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

-….√ ….√ ….

- ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

√ ….-….√ ….

- ….√ ….√ ….

A HUGE number of clickthrough logs

We learn a classifier using the retrieved “labeled”

documents

We apply the classifier to “label” the HUGE clickthrough log

We can use the HUGE “label” clickthrough log

for evaluation

Preliminary results on F(URL)C

• We evaluated the performance of the classifier trained with the relevant documents retrieved from Wikipedia

• AOL query data set, 10,000 held out for testF1 for 18 classes on AOL Query Classification task

Number of labeled query Seed

Training Queries enriched by search snippets

Training documents retrieved from Wikipedia

100 12% 28%(5,000 Instances)

200 21% 36%(10,000 Instances)

400 31% 38%(15,000 Instaces)

Outline

• Introduction

• Profile based PQC

• Context based PQC: Hao Hu, Huanhuan Cao, et al. @ SIGIR 2009, ACML 2009.

• Conclusion

Context based PQC for Online Commercial Intention

• The commercial intention of the same query can be identified given its context informationAllan Iverson

shoes

T-short

Michael Jordan Commercial!Offer ads!

Context based PQC for Online Commercial Intention [Cao etc. SIGIR’09]

• The commercial intention of the same query can be identified given its context informationGraphical Model

Machine Learning

UCB

Michael Jordan Non-Commercial!Redirect to scholar

Search!

Two questions:

• How do we model query context?

• How do we detect whether two queries are semantically similar?

Feature Generation/Enrichment

Graphical Models

Conditional Random Field

Motivation: model the query logs as a conditional random field. Therefore, the relationships between consecutive and even skip queries can be modeled.

Question: How do we decide whether two “skip queries” (non-consecutive queries) are related and should be linked?

Semantic Relationship between queries

• Given Query A and Query B, how do we determine the degrees of relevancy of these two queries in a semantic level?– Send queries to search engines– Obtain search results– Determine distance between search results

Context based PQC for Online Commercial Intention

• The commercial intention of the same query can be identified given its context informationAllan Iverson

shoes

T-short

Michael Jordan Commercial!Offer ads!

Context based PQC for Online Commercial Intention

• The commercial intention of the same query can be identified given its context informationGraphical Model

Machine Learning

UCB

Michael Jordan Non-Commercial!Redirect to scholar

Search!

Evaluation

• Using context information• Vs• Not using context information

Preliminary Experimental Results of PQC for Online Commercial Intention• Dataset

– AOL Query Log data– Around ~20M Web Queries– Around 650K Web users– Data is sorted by anonymous UserID and

sequentially arranged.• Each item of clickthrough log data contains

– {AnonID, Query, QueryTime, ItemRank, ClickURL}

Preliminary ResultsIn our preliminary experimental studies, we annotated four users with the OCI (commercial / non-commercial) status in their clickthrough logs.

More larger-scale experimental studies to be followed.

Evaluation Metric: Standard F1-measure

Baseline classifier: the classifier in Dai’s WWW 2006 work (http://adlab.msn.com/OCI/OCI.aspx)

F1 for users on AOL Data

Model User 1 User 2 User 3 User 4

Baseline (non-context) 83.4% 82.3% 84.0% 83.1%

Context base PQC 92.7% 94.2% 91.3% 92.6%

Preliminary Results

The parameter we tune is the threshold we use to determine whether we add the “skip edges” in the CRF model or not.

Ongoing work: Personalized Query Classification

• Efficiency

• More ground truth data for evaluation

PQC and Personalized Search

• Similar input:– Query Log, Clickthrough Data, IP Address, etc.

• Different output:– Personalized Search

• ranked results

– PQC • Discrete intention categories, • Application: advertisements etc.

Conclusions: PQC

• Have user profile information?• Profile = <User, Query, URLs>• Output=Class• Method = Collaborative Ranking

• Have query stream information?• Context = <User, Query-Stream, URLs>• Output=Class• Method = CRF-based method

Q & A