ir models j. h. wang mar. 11, 2008. the retrieval process user interface text operations query...
Post on 02-Jan-2016
215 Views
Preview:
TRANSCRIPT
IR Models
J. H. WangMar. 11, 2008
The Retrieval ProcessUserInterface
Text Operations
Query Operations
Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager Module
4, 10
6, 7
5 8
2
8
Text Database
Text
Introduction
• Traditional information retrieval systems usually adopt index terms to index and retrieve documents– An index term is a keyword (or group of related
words) which has some meaning of its own (usually a noun)
• Advantages– Simple– The semantic of the documents and of the user
information need can be naturally expressed through sets of index terms
Docs
Information Need
Index Terms
doc
query
Rankingmatch
IR Models
Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).
A Taxonomy of Information Retrieval Models
Retrieval:Ad hoc
Filtering
Classic Models
Browsing
USER
TASK
BooleanVector
Probabilistic
Structured Models
Non-overlapping listsProximal Nodes
FlatStructure Guided
Hypertext
Browsing
FuzzyExtended Boolean
Set Theoretic
AlgebraicGeneralized VectorLat. Semantic Index
Neural Networks
Inference NetworkBelief Network
Probabilistic
Structure Guided Hypertext
FlatHypertext
FlatBrowsing
StructuredClassicSet TheoreticAlgebraicProbabilistic
ClassicSet TheoreticAlgebraicProbabilistic
Retrieval
Full Text+Structure
Full TextIndex Terms
Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering
• Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system
• Routing (Filtering): The queries remain relatively static while new documents come into the system
Retrieval: Ad Hoc x Filtering
• Ad hoc retrieval:
Collection“Fixed Size”
Q2
Q3
Q1
Q4Q5
Retrieval: Ad Hoc x Filtering
• Filtering:
Documents Stream
User 1Profile
User 2Profile
Docs Filteredfor User 2
Docs forUser 1
A Formal Characterization of IR Models
• D : A set composed of logical views (or representation) for the documents in the collection
• Q : A set composed of logical views (or representation) for the user information needs (queries)
• F : A framework for modeling document representations, queries, and their relationships
• R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query
Definition
• ki : A generic index term• K : The set of all index terms {k1,…,kt}• wi,j : A weight associated with index term ki of a document dj
• gi: A function returns the weight associated with ki in any t-dimensional vector ( gi(dj)=wi,j )
Classic IR Model
• Basic concepts: Each document is described by a set of representative keywords called index terms
• Assign a numerical weights to distinct relevance between index terms
• Three classic models: Boolean, vector, probabilistic
Boolean Model
• Binary decision criterion– Either relevant or nonrelevant (no partial match)
• Data retrieval model• Advantage
– Clean formalism, simplicity• Disadvantage
– It is not simple to translate an information need into a Boolean expression
– Exact matching may lead to retrieval of too few or too many documents
Example
• Can be represented as a disjunction of conjunctive vectors (in DNF)– Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)
• Formal definition– For the Boolean model, the index term weight
are all binary, i.e. wij {0,1}– A query is a conventional Boolean expression,
which can be transformed to a disjunctive normal form (qcc: conjunctive component)
if (qcc )(ki, wi,j=gi(qcc))dnfq
0
1),( qdsim j
dnfq
(1,1,1)(1,0,0)
(1,1,0)Ka Kb
Kc
Vector Model [Salton, 1968]
• Assign non-binary weights to index terms in queries and in documents => TFxIDF
• Compute the similarity between documents and query => Sim(Dj, Q)
• More precise than Boolean model
The IR Problem A Clustering Problem
• We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects
• Intra-cluster similarity– What are the features which better describe
the objects in the set A?
• Inter-cluster similarity– What are the features which better distinguish
the objects in the set A?
• TF: intra-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj– term frequency (the tf factor) provides one mea
sure of how well that term describes the document contents
• IDF: inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection– inverse document frequency (the idf factor)
Idea for TFxIDF
Vector Model (1/4)
• Index terms are assigned positive and non-binary weights
• The index terms in the query are also weighted
• Term weights are used to compute the degree of similarity between documents and the user query
• Then, retrieved documents are sorted in decreasing order
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd
Vector Model (2/4)
• Degree of similarity
t
i qi
t
i ji
t
i qiji
j
jj
ww
ww
qd
qdqdsim
1
2,1
2,
1 ,,
||||),(
dj
q
Figure 2.4 The cosine of is adoptedas sim(dj,q)
Vector Model (3/4)
• Definition– normalized frequency
– inverse document frequency
– term-weighting schemes
– query-term weights
jll
jiji freq
freqf
,
,, max
ii n
Nidf log
ijiji idffreqw ,,
iqll
qiqi n
N
freq
freqw log)
max
5.05.0(
,
,,
Vector Model (4/4)
• Advantages– Its term-weighting scheme improves retrieval
performance– Its partial matching strategy allows retrieval of
documents that approximate the query conditions– Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
• Disadvantage– The assumption of mutual independence between
index terms
The Vector Model: Example I
k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
The Vector Model: Example II
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2
q 1 2 3
The Vector Model: Example III
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
Probabilistic Model (1/6)
• Introduced by Roberston and Sparck Jones, 1976– Binary independence retrieval (BIR) model
• Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set– Assumption (probabilistic principle): the probability of releva
nce depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/6)
• Definition– All index term weights are all binary i.e., wi,j {0,1}
– Let R be the set of documents known to be relevant to query q
– Let be the complement of R– Let be the probability that the docu
ment dj is relevant to the query q– Let be the probability that the docu
ment dj is nonelevant to query q
R)|( jdRP
)|( jdRP
Probabilistic Model (3/6)
• The similarity sim(dj,q) of the document dj to the query q is defined as the ratio
• Using Bayes’ rule,
– P(R) stands for the probability that a document randomly selected from the entire collection is relevant
– stands for the probability of randomly selecting the document dj from the set R of relevant documents
)|Pr(
)|Pr(),(
j
jj
dR
dRqdsim
)Pr()|Pr(
)Pr()|Pr(),(
RRd
RRdqdsim
j
jj
)|( RdP j
Probabilistic Model (4/6)
• Assuming independence of index terms and given q=(d1, d2, …, dt),
t
iiij
t
iiij
RdkRd
RdkRd
1
1
)|Pr()|Pr(
)|Pr()|Pr(
)Pr(
)Pr(log
)|Pr(
)|Pr(log),(
R
R
Rd
Rdqdsim
j
jj
t
iii
t
iii
j
Rdk
Rdkqdsim
1
1
)|Pr(
)|Pr(log),(
Probabilistic Model (5/6)
– Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R
– stands for the probability that the index term ki is not present in a document randomly selected from the set R
)|Pr( Rki
Probabilistic Model (6/6)
1)( 0)(
1)( 0)(
)|Pr()|Pr(
)|Pr()|Pr(),(
ji ji
ji ji
dg dg ii
dg dg ii
jRkRk
RkRkqdsim
1)|Pr()|Pr( RkRk ii
t
i i
i
i
ijiqij
RkP
RkP
RkP
RkPwwqdsim
1,, )|(
)|(1log
)|(1
)|(log),(
t
i i
i
i
ij
RkP
RkP
RkP
RkPqdsim
1 )|(
)|(1log
)|(1
)|(log),(
Estimation of Term Relevance
In the very beginning:
Next, the ranking can be improved as follows:
For small values of V and Vi
N
dfRk
Rk
ii
i
)|Pr(
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
)|Pr(
)|Pr(
1
5.0)|Pr(
1
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
Let V be a subset of the documents initially retrieved
1)|Pr(
1)|Pr(
VN
VdfRk
V
VRk
VV
iii
VV
ii
i
i
N
Vdfi
• Advantage– Documents are ranked in decreasing order
of their probability of being relevant• Disadvantage
– The need to guess the initial relevant and nonrelevant sets
– Term frequency is not considered– Independence assumption for index terms
Brief Comparison of Classic Models
• Boolean model is the weakest– Not able to recognize partial matches
• Controversy between probabilistic and vector models– The vector model is expected to
outperform the probabilistic model with general collections
Alternative Set Theoretic Models
• Fuzzy Set Model• Extended Boolean Model
Fuzzy Theory
• A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA
• Let A and B be two fuzzy subsets of U,
),min(
),max(
1
BABA
BABA
AA
Fuzzy Information Retrieval
• Using a term-term correlation matrix
• Define a fuzzy set associated to each index term ki
– If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1
– If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0
vuvu
vuvu dfdfdf
dfc
,
,,
ji dk
liji cd )1(1)( ,
Example
• Disjunctive Normal Form
)( cbadnf kkkq
)()()( cbacbacbadnf kkkkkkkkkq
)1)(1()(
)1()(
)()()()(
,,,,,
,,,,,
,,,,,
jcjbjajcba
jcjbjajcba
jcjbjajcjbjajcba
uuudu
uuudu
uuududududu
))(1())(1())(1(1
)1(1)(
,,,,,,
3
1111
jcbajcbajcba
cci
ccccccjq
ddd
di
cc1cc3
cc2Ka Kb
Kc
Algebraic Sum and Product
• The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function
• The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function
• More smooth than max and min functions
Alternative Algebraic Models
• Generalized Vector Space Model• Latent Semantic Model•Neural Network Model
Sparse Matrix Problem
• Considering a term-doc matrix of dimensions 1M*1M– Most of the entries will be 0 sparse matrix– A waste of storage and computation– How to reduce the dimensions?
Latent Semantic Indexing (1/5)
• Let M=(Mij) be a term-document association matrix with t rows and N columns
• Latent semantic indexing decomposes M using Singular Value Decompositions
– K is the matrix of eigenvectors derived from the term-to-term correlation matrix (MMt)
– Dt is the matrix of eigenvectors derived from the transpose of the document-to-document matrix (MtM)
– S is an rr diagonal matrix of singular values, where r=min(t,N) is the rank of M
tKSDM
Latent Semantic Indexing (2/5)
• Consider now only the s largest singular values of S, and their corresponding columns in K and Dt
– (The remaining singular values of S are deleted)
• The resultant matrix Ms (rank s) is closest to the original matrix M in the least square sense
• s<r is the dimensionality of a reduced concept space
tssss DSKM
Latent Semantic Indexing (3/5)
• The selection of s attempts to balance two opposing effects– s should be large enough to allow fitting
all the structure in the real data– s should be small enough to allow
filtering out all the non-relevant representational details
Latent Semantic Indexing (4/5)
• Consider the relationship between any two documents
tssss
tssss
tsss
tsss
tsss
ttsss
t
SDSD
DSSD
DSKKSD
DSKDSKMMss
))((
)(
Latent Semantic Indexing (5/5)
• To rank documents with regard to a given user query, we model the query as a pseudo-document in the original matrix M– Assume the query is modeled as the docum
ent with number k – Then the kth row in the matrix provides
the ranks of all documents with respect to this query
ssMM t
Computing an Example• Let (Mij) be given by the matrix
– Compute the matrices (K), (S), and (D)t
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
• Latent Semantic Indexing transforms the occurrence matrix into a relation between the terms and concepts, and a relation between the concepts and the documents– Indirect relation between terms and
documents through some hidden (or latent) conceptsTaipei
Taiwan
…doc
?
Taipei
Taiwan
…doc
(Latent)Concepts
Alternative Probabilistic Model
• Bayesian Networks• Inference Network Model
• Belief Network Model
Bayesian Network
• Let xi be a node in a Bayesian network G and xi
be the set of parent nodes of xi
• The influence of xi on xi can be specified by
any set of functions that satisfy:
• P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
1),(0
1),(
i
i
i
xii
xxii
xF
xF
Belief Network Model (1/6)
• The probability spaceThe set K={k1, k2, …, kt} of all index terms is the universe. To each subset u is associated a vector such that gi( )=1 kiu
• Random variables– To each index term ki is associated a binary
random variable
k
k
Belief Network Model (2/6)
• Concept space– A document dj is represented as a concept c
omposed of the terms used to index dj– A user query q is also represented as a conc
ept composed of the terms used to index q– Both user query and document are modeled
as subsets of index terms• Probability distribution P over K
t
u
uP
uPucPcP
)2
1()(
)()|()(
Degree of coverage of K by c
Belief Network Model (3/6)
• A query q is modeled as a network node– This random variable is set to 1 whenever q comple
tely covers the concept space K– P(q) computes the degree of coverage of the space
K by q• A document dj is modeled as a network node
– This random variable is 1 to indicate that dj completely covers the concept space K
– P(dj) computes the degree of coverage of the space K by dj
Belief Network Model (4/6)
Belief Network Model (5/6)
• Assumption – P(dj |q) is adopted as the rank of the docum
ent dj with respect to the query q
kj
uj
ujj
jj
kPkqPkdP
uPuqPudP
uPuqdPqdP
qPqdPqdP
)()|()|(
)()|()|(
)()|()(
)(/)()|(
Belief Network Model (6/6)
• Specify the conditional probabilities as follows
• Thus, the belief network model can be tuned to subsume the vector model
otherwise
qgkkifkqP
otherwise
dgkkifkdP
iiw
w
jiiw
w
j
ti qi
qi
ti ji
ji
1)(
0)|(
1)(
0)|(
12,
,
12,
,
Comparison
• Belief network model – Is based on set-theoretic view– It provides a separation between the
document and the query – It is able to reproduce any ranking strategy
generated by the inference network model
• Inference network model– Takes a purely epistemological view which
is more difficult to grasp
top related