classic ir models

43
E.G.M. Petrakis Information Retrieval Models 1 Classic IR Models Boolean model simple model based on set theory queries as Boolean expressions adopted by many commercial systems Vector space model queries and documents as vectors in an M-dimensional space M is the number of terms find documents most similar to the query in the M- dimensional space Probabilistic model a probabilistic approach assume an ideal answer set for each query iteratively refine the properties of the ideal answer set

Upload: maia

Post on 23-Feb-2016

65 views

Category:

Documents


1 download

DESCRIPTION

Classic IR Models. Boolean model simple model based on set theory queries as Boolean expressions adopted by many commercial systems Vector space model queries and documents as vectors in an M -dimensional space M is the number of terms - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 1

Classic IR ModelsBoolean model

simple model based on set theory queries as Boolean expressions adopted by many commercial systems

Vector space model queries and documents as vectors in an M-dimensional

space M is the number of terms find documents most similar to the query in the M-

dimensional spaceProbabilistic model

a probabilistic approach assume an ideal answer set for each query iteratively refine the properties of the ideal answer set

Page 2: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 2

Document Index Terms Each document is represented by a set of

representative index terms or keywords requires text pre-processing (off-line) these terms summarize document contents adjectives, adverbs, connectives are less useful the index terms are mainly nouns (lexicon look-

up) Not all terms are equally useful

very frequent terms are not useful very infrequent terms are not useful neither terms have varying relevance (weights) when

used to describe documents

Page 3: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 3

Text PreprocessingExtract terms from documents and queries

document - query profileProcessing stages

word separation sentence splittingchange terms to a standard form (e.g., lowercase)eliminate stop-words (e.g. and, is, the, …)reduce terms to their base form (e.g., eliminate

prefixes, suffixes)construct term indices (usually inverted files)

Page 4: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 4

Text Preprocessing Chart

from Baeza – Yates & Ribeiro – Neto, 1999

Page 5: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 5

Inverted Index

άγαλμααγάπη…δουλειά…πρωί…ωκεανός

index posting list

(1,2)(3,4)

(4,3)(7,5)

(10,3)

123456789

1011

………

documents

Page 6: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 6

Basic NotationDocument: usually text

D: document collection (corpus)d: an instance of D

Query: same representation with documentsQ: set of all possible queriesq: an instance of Q

Relevance: R(d,q)binary relation R: D x Q {0,1}d is “relevant” to q iff R(d,q) = 1 or degree of relevance: R(d,q) [0,1] or probability of relevance R(d,q) = Prob(R|d,q)

Page 7: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 7

Term WeightsT = {t1, t2, ….tM } the terms in corpus N number of documents in corpusdj a documentdj is represented by (w1j,w2j,…wMj) where

wij > 0 if ti appears in dj

wij = 0 otherwiseq is represented by (q1,q2,…qM)

R(d,q) > 0 if q and d have common terms

Page 8: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 8

Term Weighting

t2

wMNwM1tM

w1Nw12w11t1

dN….d2d1 docsterms

w2i

Page 9: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 9

Document Space (corpus)

q

D

queryrelevant documentnon-relevant document

Page 10: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 10

Boolean ModelBased on set theory and Boolean algebra

Boolean queries: “John” and “Mary” not “Ann”terms linked by “and”, “or”, “not”terms weights are 0 or 1 (wij=0 or 1)query terms are present or absent in a documenta document is relevant if the query condition is

satisfiedPros: simple, in many commercial systemsCons: no ranking, not easy for complex

queries

Page 11: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 11

Query Processing For each term ti in query q={t1,t2,…tM}

1) use the index to retrieve all dj with wij > 02) sort them by decreasing order (e.g., by term

frequency) Return documents satisfying the query

condition Slow for many terms: involves set

intersections Keep only the top K documents for each

term at step 2 or Do not process all query terms

Page 12: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 12

Vector Space ModelDocuments and queries are M –

dimensional term vectorsnon-binary weights to index termsa query is similar to a document if their

vectors are similarretrieved documents are sorted by

decreasing order a document may match a query only

partially SMART is the most popular implementation

Page 13: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 13

Query – Document Similarity

M

i idM

i iq

M

i idiq

ww

ww

dqdqdqSim

12

12

1

||||),(

Similarity is defined as the cosine of the angle between document and query vectors

θ

q

d

Page 14: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 14

Weighting Schemetf x idf weighting scheme

wij: weight of term ti associated with document dj

tfij frequency of term ti in document dj

max frequencytfli is computed over all terms in dj

tfij: normalized frequencyidfi: inverse document frequency ni: number of documents where term ti occurs

iidfnN

ijtffreq

freqw

ili

ijij logmax=

Page 15: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 15

Weight NormalizationMany ways to express weights

E.g., using log(tfij) The weight is normalized in [0,1]

Normalize by document length

M

ik kj

iijij

tf

idftfw

2))log(1(

))log(1(

Page 16: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 16

M

k kj

ijij

w

ww

1

2'

Normalization by Document Length

The longer the document, the more likely it is for a given term to appear in it

Normalize the term weights by document length (so longer documents are not given more weight)

Page 17: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 17

Comments on Term Weighting

tfij: term frequency – measures how well a term describes a documentintra document characterization

idfi: terms appearing in many documents are not very useful in distinguishing relevant from non-relevant documentsinter document characterization

This scheme favors average terms

Page 18: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 18

Comments on Vector Space Model

Pros:at least as good as other modelsapproximate query matching: a query

and a document need not contain exactly the same terms

allows for ranking of resultsCons:

assumes term independency

Page 19: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 19

Document DistanceConsider documents d1, d2 with vectors u1,

u2

their distance is defined as the length AB

)),(1(2=))cos(1(2

=)2/sin(2=),(tan

21

21

ddsimilarity

θ

θddcedis

-

-

Page 20: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 20

Probabilistic ModelComputes the probability that the

document is relevant to the queryranks the documents according to their

probability of being relevant to the queryAssumption: there is a set R of relevant

documents which maximizes the overall probability of relevance R: ideal answer set

R is not known in advanceinitially assume a description (the terms) of Riteratively refine this description

Page 21: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 21

Basic NotationD: corpus, d: an instance of DQ: set of queries, q: an instance of Q

P(R | d) : probability that d is relevant : probability that d is not

relevant

q} orelevant t is d ,q ,d | q){(d, R QD

q} orelevant tnot is d ,q ,d | q){(d, R QD

)( |dRP

Page 22: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 22

Probability of RelevanceP(R|d): probability that d is relevant

Bayes rule

P(d|R): probability of selecting d from RP(R): probability of selecting R from DP(d): probability of selecting d from D

)()()(=)(

dPRPd|RPR|dP

Page 23: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 23

Document RankingTake the odds of relevance as the

rank

Minimizes probability of erroneous judgment

are the same for all docs

)()()()(

)()()(

RPRd|PRPd|RP

|dRPR|dPd|qSim

)(),( RPRP

)()(=)(

Rd|Pd|RPd|qSim

Page 24: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 24

Ranking (cont’d)Each document is represented by a set

of index terms t1,t2,..tM assume binary terms wi for terms ti

d=(w1,w2,…wM) wherewi=1 if the term appears in dwi=0 otherwise

Assuming independence of index terms

dt idt i

ii|R)tP(|RtPd|RP )()(

Page 25: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 25

Ranking (conted)By taking logarithms and by omitting

constant terms

R is initially unknown

)R|P(t)R|P(t-1logww+

R)|P(t-1R)|P(tlogww

~)R|P(d

R)|P(d=Sim(d/q)

i

iid1 q

i

iid1 q

M

i iM

i i

Page 26: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 26

Initial EstimationMake simplifying assumptions such as

where ni: number of documents containing ti and N: total number of documents

Retrieve initial answer set using these values

Refine answer iteratively

Nn

R|tP|RtP iii =)( ,5.0=)(

Page 27: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 27

ImprovementLet V the number of documents retrieved

initiallyTake the fist r answers as relevant From them compute Vi: number of documents

containing ti

Update the initial probabilities:

Resubmit query and repeat until convergenceV-NV-n=)RP(t ,V

V=R)P(t iii

ii ||

Page 28: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 28

Comments on Probabilistic Model

Pros: good theoretical basis

Cons: need to guess initial probabilitiesbinary weights independence assumption

Extensions:relevance feedback: humans choose relevant

docsOKAPI formula for non – binary weights

Page 29: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 29

Comparison of Models The Boolean model is simple and used

used almost everywhere. It does not allow for partial matches. It is the weakest model

The Vector space model has been shown (Salton and Buckley) to outperform the other two models

Various extensions deal with their weaknesses

Page 30: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 30

Query ModificationThe results are not always satisfactory

some answers are correct, others are notqueries can’t specify user’s needs precisely

Iteratively reformulate and resubmit the query until the results become satisfactory

Two approachesrelevance feedbackquery expansion

Page 31: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 31

Relevance FeedbackMark answers as

relevant: positive examplesirrelevant: negative examples

Query: a point in document spaceat each iteration compute new query pointthe query moves towards an “optimal

point” that distinguishes relevant from non-relevant document

the weights of query terms are modified “term reweighting”

Page 32: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 32

Rochio Vectorsq0 q1

q2

optimal query

Page 33: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 33

Rochio FormulaQuery point

di: relevant answerdj: non-relevant answern1: number of relevant answersn2: number or non-relevant answersα, β, γ: relative strength (usually α=β=γ=1) α = 1, β = 0.75, γ = 0.25: q0 and relevant

answers contain important information

21

12

n

1i1

0 - n

j ji dn

dn

qq

Page 34: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 34

Query ExpansionAdds new terms to the query which are

somehow related to existing termssynonyms from dictionary (e.g., staff, crew)semantically related terms from a

thesaurus (e.g., “wordnet”): man, woman, man kind, human…)

terms with similar pronunciation (Phonix, Soundex)

Better results in many cases but query defocuses (topic drift)

Page 35: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 35

CommentsDo all together

query expansion: new terms are added from relevant documents, dictionaries, thesaurus

term reweighing by Rochio formulaIf consistent relevance judgments are

provided2-3 iterations improve resultsquality depends on corpus

Page 36: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 36

ExtensionsPseudo relevance feedback: mark top

k answers as relevant, bottom k answers as non-relevant and apply Rochio formula

Relevance models for probabilistic modelevaluation of initial answers by humansterm reweighting model by Bruce Croft,

1983

Page 37: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 37

Text ClusteringThe grouping of similar vectors into

clustersSimilar documents tend to be

relevant to the same requestsClustering on M-dimensional space

M number of terms

Page 38: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 38

Clustering MethodsSound methods based on the

document-to-document similarity matrixgraph theoretic methodsO(N2) time

Iterative methods operating directly on the document vectorsO(NlogN) or O(N2/logN) time

Page 39: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 39

Sound Methods1. Two documents with similarity > T

(threshold) are connected with an edge [Duda&Hart73]

clusters: the connected components (maximal cliques) of the resulting graph

problem: selection of appropriate threshold T

Page 40: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 40

Zahn’s method [Zahn71]

Find the minimum spanning tree For each doc delete edges with length l > lavg

lavg: average distance if its incident edges Or remove the longest edge (1 edge removed

=> 2 clusters, 2 edges removed => 3 clusters Clusters: the connected components of the

graph

the dashed edge is inconsistent and is deleted

Page 41: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 41

Iterative MethodsK-means clustering (K known in

advance)Choose some seed points (documents)

possible cluster centroidsRepeat until the centroids do not

changeassign each vector (document) to its

closest seed compute new centroids reassign vectors to improve clusters

Page 42: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 42

Cluster SearchingThe M-dimensional query vector is

compared with the cluster-centroidssearch closest cluster retrieve documents with similarity > T

Page 43: Classic  IR  Models

E.G.M. Petrakis Information Retrieval Models 43

References "Modern Information Retrieval", Richardo Baeza-Yates,

Addison Wesley 1999 "Searching Multimedia Databases by Content",

Christos Faloutsos, Kluwer Academic Publishers, 1996 Information Retrieval Resources

http://nlp.stanford.edu/IR-book/information-retrieval.html

TREC http://trec.nist.gov/ SMART http://en.wikipedia.org/wiki/SMART_

Information_Retrieval_System LEMOUR http://www.lemurproject.org/ LUCENE http://lucene.apache.org/