chapter 5: query operations
DESCRIPTION
Chapter 5: Query Operations. Hassan Bashiri April 2009. Cross-Language. What is CLIR? Users enter their query in one language and the search engine retrieves relevant documents in other languages. English Query. French Documents. Retrieval System. Cross-Language Text Retrieval. - PowerPoint PPT PresentationTRANSCRIPT
Chapter 5:
Query Operations
Hassan Bashiri
April 20091
Cross-Language
• What is CLIR?
• Users enter their query in one language and the search engine retrieves relevant documents in other languages.
Retrieval SystemEnglish Query French Documents
2
11
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
3
Query Language
Visual languages Example: library shown on the screen. Act:
take books, open catalogs, etc.
Better Boolean queries: “I need books by
Cervantes AND Lope de Vega”?!
4
IR Interface
Query interface
Selection interface
Examination interface
Document delivery
5
Retrieval System Model
Query Formulation
Detection
Delivery
Selection
Examination
Index
Docs
User
Indexing
6
Starfield
7
Query Formulation
No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for good retrieval
First formulation: naïve attempt to retrieve relevant information Documents initially retrieved:
Examined for relevance information Improved query formulations for retrieving additional relevant
documents
Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query
8
Three approaches
Approaches based on feedback from users
(relevance feedback)
Approaches based on information derived from
set of initially retrieved documents (local set of
documents)
Approaches based on global information
derived from document collection
9
User relevance feedback
Most popular query reformulation strategy
Cycle: User presented with list of retrieved documents User marks those which are relevant
In practice: top 10-20 ranked documents are examined Incremental
Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query
Expected: New query moves towards relevant documents and away from non-
relevant documents For Instance
Q1:US Open Q2:US Open Robocup
10
User relevance feedback
Two basic techniques Query expansion
Add new terms from relevant documents
Term reweighting
Modify term weights based on user relevance judgements
11
Query Expansion and Term Reweighting for the Vector Model
basic idea Relevant documents resemble each other Non-relevant documents have term-weight
vectors which are dissimilar from the ones for the relevant documents
The reformulated query is moved to closer to the term-weight vector space of relevant documents
12
13
Query Expansion and Term Reweighting for the Vector Model (Continued)
Cr: set of relevant documents
set of non-relevant documents
Dr: set of relevant documents, as identified by the user Dn: set of non-relevant documents
the retrieveddocuments collection
14
User relevance feedback: Vector Space Model
rj rjCd Cd
j
r
j
ropt d
CNd
Cq
||
1
||
1
rD
nD
rC
| |,| |,| |r n rD D C
, ,
: set of relevant documents, as identified by the user, among the retrieved documents;
: set of non-relevant documents among the retrieved documents;
: set of relevant documents among all documents in the collection;
: number of documents in the sets , respectively;
: tuning constants.
15
Standard-Rochio
Ide-Regular
Ide-Dec-Hi
, , : tuning constants (usually, >) =1 (Rochio, 1971) ===1 (Ide, 1971) =0: positive feedback
Dnd
jnDrd
jr
mjj
dD
dD
qq||||
Dnd
jDrd
jmjj
ddqq
)(max jrelevantnonDrd
jm ddqqj
the highest ranked non-relevant document
Similar performance
Calculate the modified query
16
Analysis
advantages simplicity good results
disadvantages No optimality criterion is adopted
17
User relevance feedback: Probabilistic Model
The similarity of a document dj to a query q
))|(
))|(1(log
))|(1(
)|((log),(
1,, RkP
RkP
RkP
RkPwwqdsim
i
i
i
it
ijiqij
)|( RkP i : the probability of observing the term ki in the set R of relevant documents
)|( RkP i : the probability of observing the term ki in the set R of non-relevant documents
5.0)|( RkP iInitial search:N
nRkP i
i )|(
18
i
it
ijiqi
i
it
ijiqi
i
i
i
it
ijiqij
n
nNww
NnNn
ww
RkP
RkP
RkP
RkPwwqdsim
log
))1(
log)5.01(
5.0(log
))|(
))|(1(log
))|(1(
)|((log),(
1,,
1,,
1,,
Feedback search:
||
||)|( ,
r
irii D
D
V
VRkP
||
||)|( ,
r
iriiii DN
Dn
VN
VnRkP
User relevance feedback: Probabilistic Model
19
Feedback search:||
||)|( ,
r
iri D
DRkP
||
||)|( ,
r
irii DN
DnRkP
||
|)|(||
||||
||log
||
||||
||1
||
||1
||
||
log
))|(
))|(1(log
))|(1(
)|((log),(
,
,
,
,
1,,
,
,
,
,
1,,
1,,
iri
irir
irr
irt
ijiqi
r
iri
r
iri
r
ir
r
irt
ijiqi
i
i
i
it
ijiqij
Dn
DnDN
DD
Dww
DN
DnDN
Dn
D
DD
D
ww
RkP
RkP
RkP
RkPwwqdsim
No query expansion occurs
User relevance feedback: Probabilistic Model
20
1||
5.0||
1
5.0)|(
1||
5.0||
1
5.0)|(
,
,
r
iriiii
r
irii
DN
Dn
VN
VnRkP
D
D
V
VRkP
For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0)
Alternative 1:
Alternative 2:
1||
||)|(
1||
||)|(
,
,
r
iiri
i
r
iir
i
DNNn
DnRkP
DNn
DRkP
User relevance feedback: Probabilistic Model
21
Analysis
advantages Feedback process is directly related to the derivation of
new weights for query terms The term reweighting is optimal
disadvantages Document term weights are not considered No query expansion is used
22
Query Expansion
Query Expansion
Global
Local
Similarity Thesaurus
Statistical Thesaurus
Context Analysis
Clustering
Scalar Clustering
Metric Clustering
Association Clustering
23
Automatic Local Analysis
user relevance feedback Known relevant documents contain terms which can be
used to describe a larger cluster of relevant documents with assistance from the user (clustering)
automatic analysis Obtain a description (i.t.o terms) for a larger cluster of
relevant documents automatically global strategy: global thesaurus-like structure is
trained from all documents before querying local strategy: terms from the documents retrieved for a
given query are selected at query time24
Query Expansion based on a Similarity Thesaurus
Query expansion is done in three steps as follows:
Represent the query in the concept space used for
representation of the index terms
2 Based on the global similarity thesaurus, compute a
similarity sim(q,kv) between each term kv correlated to the
query terms and the whole query q.
3 Expand the query with the top r ranked terms according to
sim(q,kv)25
Query Expansion – Step 1
To the query q is associated a vector q in the term-concept space given by
where wi,q is a weight associated to the index-query pair[ki,q]
iqk
qi kwi
,q
26
Query Expansion – Step 2
Compute a similarity sim(q,kv) between each term kv and the user query q
where cu,v is the correlation factor
qk
vu,qu,vv
u
cwkq)ksim(q,
27
Query Expansion – Step 3
Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’To each expansion term kv in the query q’ is assigned a weight wv,q’ given by
The expanded query q’ is then used to retrieve new documents to the user
qkqu,
vq'v,
u
w
)ksim(q,w
28
Query Expansion - Sample
Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A
c(A,A) = 10.991c(A,C) = 10.781c(A,D) = 10.781...c(D,E) = 10.398c(B,E) = 10.396c(E,E) = 10.224
29
Query Expansion - Sample
Query: q = A E Esim(q,A) = 24.298sim(q,C) = 23.833sim(q,D) = 23.833sim(q,B) = 23.830sim(q,E) = 23.435
New query: q’ = A C D E Ew(A,q')= 6.88w(C,q')= 6.75w(D,q')= 6.75w(E,q')= 6.64
30
Query Expansion
Methods of local analysis extract
information from local set of documents
retrieved to expand the query
An alternative is to expand the query
using information from the whole set of
documents
31
Local Cluster
stem V(s): a non-empty subset of words which are
grammatical variants of each othere.g., {polish, polishing, polished}
A canonical form s of V(s) is called a steme.g., polish
local document set Dl the set of documents retrieved for a given query
local vocabulary Vl (Sl) the set of all distinct words (stems) in the local
document set
32
Local Cluster
basic concept Expanding the query with terms correlated to the
query terms The correlated terms are presented in the local
clusters built from the local document set
local clusters association clusters: co-occurrences of pairs of
terms in documents metric clusters: distance factor between two terms scalar clusters: terms with similar neighborhoods
have some synonymity relationship33
Association Clusters
idea Based on the co-occurrence of stems (or terms) inside
documents
association matrix fsi,j: the frequency of a stem si in a document dj (Dl)
m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns
: a local stem-stem association matrixtmms
34
jsv
Dldj
jsuvu ffc ,,,
vuvvuu
vuvu ccc
cs
,,,
,,
: a correlation between the stems su and sv
: normalized matrix
su,v=cu,v: unnormalized matrix
:)(nsu local association cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
an element in tmm
35
Metric Clusters
idea Consider the distance between two terms in the
computation of their correlation factor
local stem-stem metric correlation matrix r(ki,kj): the number of words between keywords ki and
kj in a same document
cu,v: metric correlation between stems su and sv
)()(
, ),(
1
svVkj jisuVki
vu kkrc
),(1
),(ji
ji kkrkkr
36
|)(||)(|,
,vu
vuvu sVsV
cs
: normalized matrix
su,v=cu,v: unnormalized matrix
:)(nsu local metric cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
37
Scalar Clusters
idea Two stems with similar neighborhoods have synonymity
relationship The relationship is indirect or induced by the
neighborhood
scalar association matrix
The row corresponding to a specific termin a term co-occurrence matrix forms its neighborhood
:)(nsu local scalar cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
||||,
vu
vuvu
ss
sss
38
Interactive Search Formulation
neighbors of the query term sv
Terms su belonging to clusters associated to sv, i.e., suSv(n)
su is called a searchonym of sv
x
Su
Sv
Sv(n)
x
xx
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x39
Similarity Thesaurus
The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence.
This relationship are not derived directly from co-occurrence of terms inside documents.
They are obtained by considering that the terms are concepts in a concept space.
In this concept space, each term is indexed by the documents in which it appears.
Terms assume the original role of documents while documents are interpreted as indexing elements
40
Similarity Thesaurus
Inverse term frequency for document dj
t: number of terms in the collection N: number of documents in the collection fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj
To ki is associated a vector
jj t
titf log
),....,,(k ,2,1,i Niii www
41
Similarity Thesaurus
where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows
N
lj
lil
li
jjij
ji
ji
itff
f
itff
f
w
1
22
,
,
,
,
,
))(max
5.05.0(
))(max
5.05.0(
42
Similarity Thesaurus
The relationship between two terms ku and kv is computed as a correlation factor cu,v given by
The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection
jd
jv,ju,vuvu, wwkkc
43
Represent the query in the concept space used for representation of the index terms
Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q
qk
iqii
kwq
,
qk
vuquvv
u
cwkqkqsim ,,,
query term expand term
44
Expand the query with the top r ranked terms according to sim(q,kv)
quk qu
vwkqsim
qvw,
,,
45
Similarity Thesaurus
This computation is expensiveGlobal similarity thesaurus has to be computed only once and can be updated incrementally
46