cmpe 493 introduction information retrieval personalized query expansion for the web chirita, p.a.,...

CMPE 493INTRODUCTION INFORMATION

RETRIEVAL

PERSONALIZED QUERY EXPANSION FOR THE WEB

Chirita, P.A., Firan, C.S., and Nejdl, W.SIGIR, 2007, pp. 7-14

Bahtiyar Kaba2007102824

Introduction

• Aim: improve the search output by expanding the query with exploiting users’ PIR(Personal Information Repository).

• Why? – Inherent ambiguity of short queries.

• Ex: “language ambiguity” => a computer scientist or a linguistics scientist probably search for sth different.

• So, help them formulating a bettery query by expansion. “language ambiguity in computing”.

• Come up with the latter term by investigating the user’s desktop(PIR).

• Studies show 80% of users prefer personalized outputs for their search.

• What we will use?– The personal collection of all documents, text

documents, emails, cached Web pages etc.• By personalizing this way, we have 2

advantages:– Better description of the users interest, there is a

large amount of information.– Privacy: “profile” information is extracted and

exploited locally. we should not track the URLs clicked or queries issued.

Algorithms

• Local desktop query context:– Determine expansion terms from those personal

documents matching the query best.– Keyword, expression, summary based techniques.

• Global desktop collection:– Investigate expansions based on co occurence

metrics and external theasuri through entire personal directory.

• Before details of these, a glance at previous work.

Previous Work

• Two IR research areas: Search Personalization and Automatic Query Expansion

• A lot of algorithms for both domains, but not as much for combining them.

• Personalized search: ranking search results according to user profiles (ex. By means of past history)

• Query Expansion: derive a better formulation of the query to enhance retrieval.Based on exploiting social or collection specific characteristics.

Personalized search

• Two major components:– User Profiles: generated with as features of visited

pages. • Topic preference vectors -> Topic-Sensitive page rank.• Advantage of being easy to obtain and process.• But, may not suffice to obtain a good understanding of

user’s interests and concerns about privacy.

– Personalization Algorithm itself:• Topic oriented page rank. Pagerank vectors accrdingly,

then bias the results according to these vector and search term similarity.

Query Expansion

• Relevance Feedback:– Useful information for the expansion terms can be extracted

from the relevant documents returned.– Extract such keywords based on term frequency, document

frequency, summarization of top-ranked documents.• Co-occurence:

– Terms highly co-occuring together were shown to incrase precision. Assess term relationship levels.

• Theasurus:– Expand the query with new terms having close meaning. – Can be extracted from a large theasuri, ex: Wordnet.

Query Expansion with PIR

• We have a rich, personal collection but the data is very unstructured in format, content etc.

• So, we analyze PIR at various granularity levels, from term frequency withing Desktop documents to global co-occurence statistics.

• Then an empirical analysis of the algorithms is proposed.

Local Desktop Analysis

• Similar to relevance feedback method for query expansion, this time we use PIR best hits.

• Investigate in 3 granularity levels:– Term and document frequency:• Advantage of being fast to compute as we have a

previous offline computation.• Independently associate a score with each term based

on two statistics.


• Term Frequency:– Use actual frequency information and position of

the term first appears.– TermScore = [1/2 + ½*(nrWords-pos/nrWords)]*log(1+TF)

–Position information is used as more informative terms appear earlier in the document.


• Document frequency– Given the set of top-k relevant documents, generate

snippets focusing on the original search request, then order by their DF scores.

– Focusing on the query is necessary since DF scores are calculated through entire PIR.

• TFxIDF weighting may not be good for local desktop analysis, since a term with hight DF in desktop may be rare in web. – Ex: page-rank may have high DF in a IR scientists PIR

having a low tfxidf while it resolves good in the web.


• Lexical Dispersion Hypothesis: an expression’s lexical dispersion can be used to identify key concepts.

• {adjective?noun+}• Generate such compound expressions offline

and use them for query expansion on runtime.• Further improvements by ordering according

to lexical dispersion.


• Summarization:– The set of relevant desktop documents identified– Then a summary containing most important

sentences generated as output.– Most comprehensive output but not efficient as it

can not be computed offline.– Rank the documents according to their salience

scores computed as follows:


• Summarization:– SalienceScore = square(SW)/TW + PS + square(TQ)/NQ

– SW : significant terms, decided if its TF is above a threshold value ms as:• Ms=7-0.1*(25-NS) ;if NS < 25

7 ;if 25<NS<407 + 0.1*(NS -40) ;if NS>40

– PS: position score • (Avg(NS)-SentenceIndex)/square(avg(NS))• Scaling it this way, short documents are not effected, as they do not

have summaries in the beginning.

– Final term is for balancing towards original query.• The more query terms a sentence, the more related it is.

Global Desktop Analysis

• Previous techniques were based on relevant documents for the query.

• Now, we rely on information across the entire PIR of the user.

• We have two techniques:– Co-occurence Statistics:– Theasurus Based Expansion:


• For each term, we compute terms co-occuring most frequently with it in our PIR collection, then use this info at runtime to expand our queries.

Global Desktop Analysis• Algorithm:• Off-line computation:• 1: Filter potential keywords k with DF 2 [10, . . . , 20% · N]• 2: For each keyword ki

• 3: For each keyword kj

• 4: Compute SCki,kj , the similarity coefficient of (k i, kj )• On-line computation:• 1: Let S be the set of keywords,• potentially similar to an input expression E.• 2: For each keyword k of E:• 3: S S [ TSC(k), where TSC(k) contains the• Top-K terms most similar to k• 4: For each term t of S:• 5a: Let Score(t) Qk2E(0.01 + SCt,k)• 5b: Let Score(t) #DesktopHits(E|t)• 6: Select Top-K terms of S with the highest scores.


• We have each terms’ correlated terms calculated offline. At run time we need to calculate correlation of every output term with the eniter query. Two approaches:– Product of correlation between term and all keywords – The number of documents the proposed term co-occurs

with entire query.• Similarity coefficients are calculate using:– Cosine similarity : (correlation coefficient)– Mutual information– Likelihood Ratio


• Theasurus Based Expansion:– Identify the set of terms related to query terms

(using theasurus information), then calculate each co-occurence level of possible expansions (i.e original search query and the new term). Select the ones with the highest frequency.

Theasurus Base Expansion

• 1: For each keyword k of an input query Q:– 2: Select the following sets of related terms

• 2a: Syn: All Synonyms• 2b: Sub: All sub-concepts residing one level below k• 2c: Super: All super-concepts residing one level above k

• 3: For each set Si of the above mentioned sets:– 4: For each term t of Si:

• 5: Search the PIR with (Q|t), i.e.,– the original query, as expanded with t

• 6: Let H be the number of hits of the above search– (i.e., the co-occurence level of t with Q)

– 7: Return Top-K terms as ordered by their H values.

Experiments

• 18 subjects indexed their content with their selected paths: Emails, docs,webcache.

• Types of Queries– Random log query, hitting 10 docs in PIR.– Self selected specific query, subject think having

one meaning– Self selected ambigious query, subject think

having more than one meaning.• We set the number of expanded terms to 4.

Experiments

• Measure– Discounted Cumulative Gain:• DCG = G(1) ; if i = 1

DCG(i-1) + G(i)/log(i) ;otherwise.• Giving more weight to highly ranked documents, and

incorporating relevance levels.

Experiments

• Labelings for the following results tables:– Google: Actual google result– TF,DF : as regular, term and document frequency– LC, LC[O]:regular and optimized lexical compounds– SS : sentence selection (summarization)– TC[CS], TC[MI],TC[LR]: term co-occurence statistics with

cosing similarity, mutual information, and likelihood ratio respectively.

– WN[SYN],WN[SUB],WN[SUP]: wordnet based theasurus expansion with synoyums, sub concepts and super concepts respectively.

Results for log queries

Results for selected queries

Results

• For log queries the best performance achieved with TF, LC[O] and TC[LR].

• We get good results with simple keyword and expression oriented (TF, LC[O]) techniques, whereas more complicated ones does not show significant improvements.

• For unambigious selected queries, we do not have much improvement, but for ambigious we have a clear benefit.

• For clear(unambigious) queries decreasing the number of expanded terms can bring further improvements. İdea of adaptive algorithms.

Adaptivity• An optimal personalized query expansion algorithm should adapt itself according

to the initial query.• How should we measure this, i.e. How much personal data be fed into our search.• Query Length:

– the number of words in the user query, not efficient -> there are short or long complicated queries.

• Query Scope:– IDF of the entire query.

• Log(#docuemntsincollection/#hitsforquery)• Performs well collection focused on a single topic.

• Query Clarity:– Measures the diveregence between language model of the query and the language

model of the collection(PIR).

–ΣP(w | Query) * log (P(w | Query)/ P(w)) where w is a word in query, P(w | Query) is the probabilty of the word in query and P(w) the probability in the entire collection.

• Calculate “scope” for the PIR and “clarity” for the web.

• We will use LC[O] (best performance in the previous experiment), TF, and WN[SYN] which produced good first and second expansion terms.

• Tailor the amount of expansion terms as a function of its ambiguity in Web and clarity in the web.

• The scores for combination of scope and clarity levels as follows:

Clarity Levels

Experiments

• Similar approach taken as the previous experiments.• For top log queries, an improvement over google and even

on static methods (term number = 4).• For random queries, again better results than Google, but

behind the static methods. We may need a better selection of the number of expansion terms.

• For self-selected queries:– A clear improvement for ambigious queries.– Slight performance increase for clear queries.

• Results tell adaptivity is further step for research in web search personalization.

Conclusion

• Five techniques for determining expansion terms generated from personal documents.

• Empirical analysis show 51.28% improvement.• Further works to adapt search queries.• An additional improvement of 8.47%.

Further Work

• Investigations on how to optimally select the number of expansion terms.

• Other query expansion suggestion approaches: Latent Semantic Analysis.

Thank you…