chapter 5: query operations

Chapter 5:

Query Operations

Hassan Bashiri

April 20091

Cross-Language

• What is CLIR?

• Users enter their query in one language and the search engine retrieves relevant documents in other languages.

Retrieval SystemEnglish Query French Documents

2

11

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based

3

Query Language

Visual languages Example: library shown on the screen. Act:

take books, open catalogs, etc.

Better Boolean queries: “I need books by

Cervantes AND Lope de Vega”?!

4

IR Interface

Query interface

Selection interface

Examination interface

Document delivery

5

Retrieval System Model

Query Formulation

Detection

Delivery

Selection

Examination

Index

Docs

User

Indexing

6

Starfield

7

Query Formulation

No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for good retrieval

First formulation: naïve attempt to retrieve relevant information Documents initially retrieved:

Examined for relevance information Improved query formulations for retrieving additional relevant

documents

Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query

8

Three approaches

Approaches based on feedback from users

(relevance feedback)

Approaches based on information derived from

set of initially retrieved documents (local set of

documents)

Approaches based on global information

derived from document collection

9

User relevance feedback

Most popular query reformulation strategy

Cycle: User presented with list of retrieved documents User marks those which are relevant

In practice: top 10-20 ranked documents are examined Incremental

Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query

Expected: New query moves towards relevant documents and away from non-

relevant documents For Instance

Q1:US Open Q2:US Open Robocup

10

User relevance feedback

Two basic techniques Query expansion

Add new terms from relevant documents

Term reweighting

Modify term weights based on user relevance judgements

11

Query Expansion and Term Reweighting for the Vector Model

basic idea Relevant documents resemble each other Non-relevant documents have term-weight

vectors which are dissimilar from the ones for the relevant documents

The reformulated query is moved to closer to the term-weight vector space of relevant documents

12

Query Expansion and Term Reweighting for the Vector Model (Continued)

Cr: set of relevant documents

set of non-relevant documents

Dr: set of relevant documents, as identified by the user Dn: set of non-relevant documents

the retrieveddocuments collection

14

User relevance feedback: Vector Space Model

rj rjCd Cd

j

r

j

ropt d

CNd

Cq

||

1

||

1

rD

nD

rC

| |,| |,| |r n rD D C

, ,

: set of relevant documents, as identified by the user, among the retrieved documents;

: set of non-relevant documents among the retrieved documents;

: set of relevant documents among all documents in the collection;

: number of documents in the sets , respectively;

: tuning constants.

15

Standard-Rochio

Ide-Regular

Ide-Dec-Hi

, , : tuning constants (usually, >) =1 (Rochio, 1971) ===1 (Ide, 1971) =0: positive feedback

Dnd

jnDrd

jr

mjj

dD

dD

qq||||

Dnd

jDrd

jmjj

ddqq

)(max jrelevantnonDrd

jm ddqqj

the highest ranked non-relevant document

Similar performance

Calculate the modified query

16

Analysis

advantages simplicity good results

disadvantages No optimality criterion is adopted

17

User relevance feedback: Probabilistic Model

The similarity of a document dj to a query q

))|(

))|(1(log

))|(1(

)|((log),(

1,, RkP

RkP

RkP

RkPwwqdsim

i

i

i

it

ijiqij

)|( RkP i : the probability of observing the term ki in the set R of relevant documents

)|( RkP i : the probability of observing the term ki in the set R of non-relevant documents

5.0)|( RkP iInitial search:N

nRkP i

i )|(

18

i

it

ijiqi

i

it

ijiqi

i

i

i

it

ijiqij

n

nNww

NnNn

ww

RkP

RkP

RkP

RkPwwqdsim

log

))1(

log)5.01(

5.0(log

))|(

))|(1(log

))|(1(

)|((log),(

1,,

1,,

1,,

Feedback search:

||

||)|( ,

r

irii D

D

V

VRkP

||

||)|( ,

r

iriiii DN

Dn

VN

VnRkP


19

Feedback search:||

||)|( ,

r

iri D

DRkP

||

||)|( ,

r

irii DN

DnRkP

||

|)|(||

||||

||log

||

||||

||1

||

||1

||

||

log

))|(

))|(1(log

))|(1(

)|((log),(

,

,

,

,

1,,

,

,

,

,

1,,

1,,

iri

irir

irr

irt

ijiqi

r

iri

r

iri

r

ir

r

irt

ijiqi

i

i

i

it

ijiqij

Dn

DnDN

DD

Dww

DN

DnDN

Dn

D

DD

D

ww

RkP

RkP

RkP

RkPwwqdsim

No query expansion occurs


20

1||

5.0||

1

5.0)|(

1||

5.0||

1

5.0)|(

,

,

r

iriiii

r

irii

DN

Dn

VN

VnRkP

D

D

V

VRkP

For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0)

Alternative 1:

Alternative 2:

1||

||)|(

1||

||)|(

,

,

r

iiri

i

r

iir

i

DNNn

DnRkP

DNn

DRkP


21

Analysis

advantages Feedback process is directly related to the derivation of

new weights for query terms The term reweighting is optimal

disadvantages Document term weights are not considered No query expansion is used

22

Query Expansion

Query Expansion

Global

Local

Similarity Thesaurus

Statistical Thesaurus

Context Analysis

Clustering

Scalar Clustering

Metric Clustering

Association Clustering

23

Automatic Local Analysis

user relevance feedback Known relevant documents contain terms which can be

used to describe a larger cluster of relevant documents with assistance from the user (clustering)

automatic analysis Obtain a description (i.t.o terms) for a larger cluster of

relevant documents automatically global strategy: global thesaurus-like structure is

trained from all documents before querying local strategy: terms from the documents retrieved for a

given query are selected at query time24

Query Expansion based on a Similarity Thesaurus

Query expansion is done in three steps as follows:

Represent the query in the concept space used for

representation of the index terms

2 Based on the global similarity thesaurus, compute a

similarity sim(q,kv) between each term kv correlated to the

query terms and the whole query q.

3 Expand the query with the top r ranked terms according to

sim(q,kv)25

Query Expansion – Step 1

To the query q is associated a vector q in the term-concept space given by

where wi,q is a weight associated to the index-query pair[ki,q]

iqk

qi kwi

,q

26


Compute a similarity sim(q,kv) between each term kv and the user query q

where cu,v is the correlation factor

qk

vu,qu,vv

u

cwkq)ksim(q,

27


Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’To each expansion term kv in the query q’ is assigned a weight wv,q’ given by

The expanded query q’ is then used to retrieve new documents to the user

qkqu,

vq'v,

u

w

)ksim(q,w

28

Query Expansion - Sample

Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A

c(A,A) = 10.991c(A,C) = 10.781c(A,D) = 10.781...c(D,E) = 10.398c(B,E) = 10.396c(E,E) = 10.224

29

Query Expansion - Sample

Query: q = A E Esim(q,A) = 24.298sim(q,C) = 23.833sim(q,D) = 23.833sim(q,B) = 23.830sim(q,E) = 23.435

New query: q’ = A C D E Ew(A,q')= 6.88w(C,q')= 6.75w(D,q')= 6.75w(E,q')= 6.64

30

Query Expansion

Methods of local analysis extract

information from local set of documents

retrieved to expand the query

An alternative is to expand the query

using information from the whole set of

documents

31

Local Cluster

stem V(s): a non-empty subset of words which are

grammatical variants of each othere.g., {polish, polishing, polished}

A canonical form s of V(s) is called a steme.g., polish

local document set Dl the set of documents retrieved for a given query

local vocabulary Vl (Sl) the set of all distinct words (stems) in the local

document set

32

Local Cluster

basic concept Expanding the query with terms correlated to the

query terms The correlated terms are presented in the local

clusters built from the local document set

local clusters association clusters: co-occurrences of pairs of

terms in documents metric clusters: distance factor between two terms scalar clusters: terms with similar neighborhoods

have some synonymity relationship33

Association Clusters

idea Based on the co-occurrence of stems (or terms) inside

documents

association matrix fsi,j: the frequency of a stem si in a document dj (Dl)

m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns

: a local stem-stem association matrixtmms

34

jsv

Dldj

jsuvu ffc ,,,

vuvvuu

vuvu ccc

cs

,,,

,,

: a correlation between the stems su and sv

: normalized matrix

su,v=cu,v: unnormalized matrix

:)(nsu local association cluster around the stem su

Take u-th rowReturn the set of n largest values su,v (uv)

an element in tmm

35

Metric Clusters

idea Consider the distance between two terms in the

computation of their correlation factor

local stem-stem metric correlation matrix r(ki,kj): the number of words between keywords ki and

kj in a same document

cu,v: metric correlation between stems su and sv

)()(

, ),(

1

svVkj jisuVki

vu kkrc

),(1

),(ji

ji kkrkkr

36

|)(||)(|,

,vu

vuvu sVsV

cs

: normalized matrix

su,v=cu,v: unnormalized matrix

:)(nsu local metric cluster around the stem su


37

Scalar Clusters

idea Two stems with similar neighborhoods have synonymity

relationship The relationship is indirect or induced by the

neighborhood

scalar association matrix

The row corresponding to a specific termin a term co-occurrence matrix forms its neighborhood

:)(nsu local scalar cluster around the stem su


||||,

vu

vuvu

ss

sss

38

Interactive Search Formulation

neighbors of the query term sv

Terms su belonging to clusters associated to sv, i.e., suSv(n)

su is called a searchonym of sv

x

Su

Sv

Sv(n)

x

xx

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x39


The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence.

This relationship are not derived directly from co-occurrence of terms inside documents.

They are obtained by considering that the terms are concepts in a concept space.

In this concept space, each term is indexed by the documents in which it appears.

Terms assume the original role of documents while documents are interpreted as indexing elements

40


Inverse term frequency for document dj

t: number of terms in the collection N: number of documents in the collection fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj

To ki is associated a vector

jj t

titf log

),....,,(k ,2,1,i Niii www

41


where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows

N

lj

lil

li

jjij

ji

ji

itff

f

itff

f

w

1

22

,

,

,

,

,

))(max

5.05.0(

))(max

5.05.0(

42


The relationship between two terms ku and kv is computed as a correlation factor cu,v given by

The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

jd

jv,ju,vuvu, wwkkc

43

Represent the query in the concept space used for representation of the index terms

Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q

qk

iqii

kwq

,

qk

vuquvv

u

cwkqkqsim ,,,

query term expand term

44

Expand the query with the top r ranked terms according to sim(q,kv)

quk qu

vwkqsim

qvw,

,,

45


This computation is expensiveGlobal similarity thesaurus has to be computed only once and can be updated incrementally

46