vector space model

Vector Space Model

CS 652 Information Extraction and Integration

IntroductionDocs

Information Need

Index Terms

Rankingmatch

IntroductionA ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

A ranking is based on fundamental premises regarding the notion of relevance, such as:

common sets of index terms

sharing of weighted terms

likelihood of relevance

Each set of premises leads to a distinct IR model

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval

Browsing

U s e r

T a s k

Classic Models

Boolean Vector (Space) Probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector (Space) Latent Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Basic Concepts

Each document is described by a set of representative keywords or index terms

Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document

However, search engines assume that all words are index terms (full text representation)

Basic ConceptsNot all terms are equally useful for representing the document contents

The importance of the index terms is represented by weights associated to them

ki be an index term

dj be a document

wij is a weight associated with (ki,dj ), which quantifies the

importance of ki for describing the contents of dj

The Vector (Space) Model

Define:wij

> 0 whenever ki dj

wiq>= 0 associated with the pair (ki,q)

vec(dj ) = (w1j, w2j

, ..., wtj ), document vector of dj

vec(q) = (w1q, w2q

, ..., wtq ), query vector of q

The unitary vectors vec(di) and vec(qj) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

Queries and documents are represented as weighted vectors

Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t

i=1 wij wiq

] / ti=1 wij

2 ti=1 wiq

where is the inner product operator & |q| is the length of q

Since wij 0 and wiq

0, 1 sim(q, dj ) 0

A document is retrieved even if it matches the query terms only partially

Sim(q, dj ) = [ti=1 wij

wiq ] / |dj | |q|

How to compute the weights wij and wiq

A good weight must take into account two effects:

quantification of intra-document contents (similarity)

tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)

idf factor, the inverse document frequency

wij = tf(i, j) idf(i)

The Vector (Space) ModelLet,

N be the total number of documents in the collectionni be the number of documents which contain ki

freq(i, j), the raw frequency of ki within dj

A normalized tf factor is given byf(i, j) = freq(i, j) / max(freq(l, j)),

where the maximum is computed over all terms which occur within the document dj

The inverse document frequency (idf) factor is idf(i) = log (N / ni )the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.

The best term-weighting schemes use weights which are give by

wij = f(i, j) log(N / ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion isWiq

= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)

The vector model with tf-idf weights is a good ranking strategy with general collections

The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute

The Vector (Space) ModelAdvantages:

term-weighting improves quality of the answer set

partial matching allows retrieval of documents that approximate the query conditions

cosine ranking formula sorts documents according to degree of similarity to the query

A popular IR model because of its simplicity & speed

Disadvantages:

assumes mutually independence of index terms (??);

not clear that this is bad though

Naïve Bayes Classifier

CS 652 Information Extraction and Integration

Bayes Theorem

The basic starting point for inference problems using probability theory as logic

Bayes Theorem

.008 .992

.98 .02

.03 .97

P(+|cancer)P(cancer)=(.98).008=.0078

P(+|~cancer)P(~cancer)=(.03).992=.0298

Basic Formulas for Probabilities

Naïve Bayes Classifier

Naïve Bayes Algorithm

Naïve Bayes Subtleties

m-estimate of probability

Learning to Classify Text

Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships

– Between texts– Between real-world entities mentioned in text

Learn_Naïve_Bayes_Text(Example, V)

Calculate_Probability_Terms

Classify_Naïve_Bayes_Text(Doc)

How to Improve

More training data Better training data Better text representation

– Usual IR tricks (term weighting, etc.)– Manually construct good predictor features

Hand off hard cases to human being

vector space model

Documents

vector space model : tf - idf

an extension of the vector space model...

vector space scoring -...

analysis of a vector space model, latent semantic indexing...

signal vector spacesignal vector space

ranking image annotation using vector space model · vector...

space vector pulse width modulation space vector pulse width...

elmdist: a vector space model with words and musicbrainz

ir models: the vector space...

implementasi vector space model dalam pembangkitan

adaptation of the vector space model for the comparison of

generalisiertes vektorraummodell ( generalized vector space...

text document representation & indexing ---- vector space...

lecture 4: term weighting and the vector space...

an ipc-based vector space model for patent retrieval

vector space model cs 652 information extraction and...

ir theory: ir basics & vector space model

generalisiertes vektorraummodell ( generalized vector space...

1 computing relevance, similarity: the vector space model

a vector space model of semantics using canonical...