Vector Space Model
CS 652 Information Extraction and Integration
2
IntroductionDocs
Information Need
Index Terms
doc
query
Rankingmatch
3
IntroductionA ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
A ranking is based on fundamental premises regarding the notion of relevance, such as:
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premises leads to a distinct IR model
4
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval
Browsing
U s e r
T a s k
Classic Models
Boolean Vector (Space) Probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector (Space) Latent Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
5
Basic Concepts
Each document is described by a set of representative keywords or index terms
Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document
However, search engines assume that all words are index terms (full text representation)
6
Basic ConceptsNot all terms are equally useful for representing the document contents
The importance of the index terms is represented by weights associated to them
Let
ki be an index term
dj be a document
wij is a weight associated with (ki,dj ), which quantifies the
importance of ki for describing the contents of dj
7
The Vector (Space) Model
Define:wij
> 0 whenever ki dj
wiq>= 0 associated with the pair (ki,q)
vec(dj ) = (w1j, w2j
, ..., wtj ), document vector of dj
vec(q) = (w1q, w2q
, ..., wtq ), query vector of q
The unitary vectors vec(di) and vec(qj) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
Queries and documents are represented as weighted vectors
8
The Vector (Space) Model
Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t
i=1 wij wiq
] / ti=1 wij
2 ti=1 wiq
2
where is the inner product operator & |q| is the length of q
Since wij 0 and wiq
0, 1 sim(q, dj ) 0
A document is retrieved even if it matches the query terms only partially
i
jdj
q
9
The Vector (Space) Model
Sim(q, dj ) = [ti=1 wij
wiq ] / |dj | |q|
How to compute the weights wij and wiq
?
A good weight must take into account two effects:
quantification of intra-document contents (similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation (dissi-milarity)
idf factor, the inverse document frequency
wij = tf(i, j) idf(i)
10
The Vector (Space) ModelLet,
N be the total number of documents in the collectionni be the number of documents which contain ki
freq(i, j), the raw frequency of ki within dj
A normalized tf factor is given byf(i, j) = freq(i, j) / max(freq(l, j)),
where the maximum is computed over all terms which occur within the document dj
The inverse document frequency (idf) factor is idf(i) = log (N / ni )the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.
11
The Vector (Space) Model
The best term-weighting schemes use weights which are give by
wij = f(i, j) log(N / ni)
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion isWiq
= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)
The vector model with tf-idf weights is a good ranking strategy with general collections
The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute
12
The Vector (Space) ModelAdvantages:
term-weighting improves quality of the answer set
partial matching allows retrieval of documents that approximate the query conditions
cosine ranking formula sorts documents according to degree of similarity to the query
A popular IR model because of its simplicity & speed
Disadvantages:
assumes mutually independence of index terms (??);
not clear that this is bad though
Naïve Bayes Classifier
CS 652 Information Extraction and Integration
14
Bayes Theorem
The basic starting point for inference problems using probability theory as logic
15
Bayes Theorem
.008 .992
.98 .02
.03 .97
P(+|cancer)P(cancer)=(.98).008=.0078
P(+|~cancer)P(~cancer)=(.03).992=.0298
16
Basic Formulas for Probabilities
17
Naïve Bayes Classifier
18
Naïve Bayes Classifier
19
Naïve Bayes Classifier
20
Naïve Bayes Algorithm
21
Naïve Bayes Subtleties
22
Naïve Bayes Subtleties
m-estimate of probability
23
Learning to Classify Text
Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships
– Between texts– Between real-world entities mentioned in text
24
Learn_Naïve_Bayes_Text(Example, V)
25
Calculate_Probability_Terms
26
Classify_Naïve_Bayes_Text(Doc)
27
How to Improve
More training data Better training data Better text representation
– Usual IR tricks (term weighting, etc.)– Manually construct good predictor features
Hand off hard cases to human being