vsm 벡터공간모델

22
Chapter 2 Chapter 2 Modeling

Upload: guesta34d441

Post on 17-Jun-2015

558 views

Category:

Technology


5 download

DESCRIPTION

정보검색시스템 강의노트_강승식교수님

TRANSCRIPT

Page 1: Vsm 벡터공간모델

Chapter 2Chapter 2

Modeling

Page 2: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Contents

Introduction Taxonomy of IR Models Retrieval : Ad hoc, Filtering Formal Characterization of IR Models Classic IR Models Alternative Set Theoretic Models Alternative Algebraic Models Alternative Probabilistic Models

Page 3: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Contents (Cont.)

Structured Text Retrieval Models Models for Browsing Trends and Research Issues

Page 4: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.1 Introduction

Traditional IR System– Adopt index terms to index and retrieve documents

Index Term– Restricted sense

• Keyword which has some meaning of its own (usually noun)

– General form• Any word which appears in the text of a document

Ranking Algorithm– Attempt to establish a simple ordering of the documents

retrieved– Operate according to basic premises regarding the notion

of document relevance

Page 5: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.2 A Taxonomy of IR Models

Set Theoretic

Fuzzy

Extended Boolean

Algebraic

Generalized Vector

Lat. Semantic Index

Neural Networks

Probabilistic

Inference Network

Belief Network

U

s

e

r

T

a

s

k

Retrieval:

Ad hoc

Filtering

Browsing

Classic Models

boolean

vector

probabilistic

Structured Models

Non-Overlapping Lists

Proximal Nodes

Browsing

Flat

Structure Guided

Hypertext

Page 6: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

A Taxonomy of IR Models (Cont.)

Retrieval models

– Most frequently associated with distinct combinations of a document logical view and a user task

Index Terms Full Text Full Text + Structure

Retrieval

Classic

Set theoretic

Algebraic

Probabilistic

Classic

Set theoretic

Algebraic

Probabilistic

Structured

Browsing FlatFlat

Hypertext

Structure Guided

Hypertext

Logical View of Documents

USER

TASK

Page 7: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.3 Retrieval

Ad hoc

– The documents in the collection remain relatively static while new queries are submitted to the system

– The most common form of user task Filtering

– The queries remain relatively static while new documents come into the system (and leave)

– User profile

• Describing the user’s preferences

– Routing (variation of filtering, rank the filtered document)

Page 8: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.4 A Formal Characterization of IR Models

IR Model

jiji

ji

dqdqR

F

Q

D

dqRFQD

document and query with associateshich function w ranking : ),(

ipsrelationsh their and queries, documents, modelingfor framework :

needsn informatiouser for the viewslogical of composedset :

documents for the viewslogical of composedset :

)],(,,,[

theoremsBayes' and operations ticprobabilis sets,:model ticprobabilis

operations algebralinear and space vector ldimensiona- t:modelvector

setson operations and documents of sets :modelboolean

Page 9: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.5 Classic Information Retrieval

Boolean Model– Based on set theory and Boolean algebra– Queries are specified as Boolean expressions– Model considers that index terms are present or absent in a doc

ument Vector Model

– Partial matching is possible– Assign non-binary weights to index terms– Term weights are used to compute the degree of similarity

Probabilistic Model– Given a query, the model assigns each document dj, as a measu

re of similarity to the query, p(dj relevant to q)/p(dj non-relevant to q) which computes the odds of the document dj being relevant to the query q

Page 10: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.5.1 Basic Concepts Index Term

– Word whose semantics helps in remembering the document’s main themes

– Mainly nouns

• Nouns have meaning by themselves

– Weights

• All terms are not equally useful for describing the document

– Definition

)) (i.e., vector ldimensiona-any tin

index term with theassociated weight thereturnshat function t :

document a of index term with associatedweight :

),...,( :document

},...,{: sindex term ofset

21

1

ijji

ii

jiij

tjjjj

t

wd(g

kg

dkw

wwwdj

kkK

Page 11: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Basic Concepts (Cont.)

Mutual Independence

– Index term weights are usually assumed to be mutually independent

– Knowing the weight wij associated with the pair (ki, dj) tells us nothing about the weight w(i+1)j associated with the pair (ki+1, dj)

– It does simplify the task of computing index term weights and allows for fast ranking computation

Page 12: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.5.2 Boolean Model

Base– Simple retrieval model based on Set theory and

Boolean algebra– Operation : and, or, not

Advantage– Clean formalism– Boolean query expressions have precise semantics

Disadvantage– Binary decision (no notion of a partial match)

• Retrieval of too few or too many document– Difficult to express their query requests in terms of

Boolean expressions

Page 13: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Boolean Model (Cont.)

Definition

Example

otherwiseqgdgkqqqifqdsim ccijiidnfcccc

j))()(,()( |

,0,1

),(

)0,0,1()0,1,1()1,1,1(

)(

dnf

cba

q

kkkq

ka kb

kc

}1,0{ ),...,( 21 ijtjjjj wwwwd

Page 14: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Boolean Model (Cont.)

)1,0,1()0,1,1()1,1,1(

)(

dnfq

q

시스템프로그램병렬병렬 프로그램

시스템문서

색인어유사도 병렬 프로그램 시스템 …

001 1 0 1 …

1

002 0 0 1 …

0

003 0 1 1 …

0

004 1 1 0 …

1

Page 15: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

2.5.3 Vector model

Motivation

– Binary weights is too limiting

• Assign non-binary weights to index terms

– A framework in which partial matching is possible

• Instead of attempting to predict whether a document is relevant or not

• Rank the documents according to their degree of similarity to the query

Page 16: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

Definition

documents theof space in theion Normalizat : ||

ranking affect thenot Does : ||

)similarity (cosine 1),(0

||||

),(

0 ),...,,(

0 ),...,,(

11

1

21

21

22

j

j

t

iiq

t

iij

t

iiqij

j

j

j

ijtjjjj

iqtqqq

d

q

qdsim

ww

ww

qd

qdqdsim

wwwwd

wwwwq

Page 17: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

Clustering Problem

– Intra-cluster similarity

• What are the features which better describe the objects

– Inter-cluster similarity

• What are the features which better distinguish the objects

IR Problem

– Intra-cluster similarity (tf factor)

• Raw frequency of a term ki inside a document dj

– Inter-cluster similarity (idf factor)

• Inverse of the frequency of a term ki among the documents

Page 18: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.) Weighting Scheme

– Term Frequency (tf)

• Measure of how well that term describes the document contents

– Inverse Document Frequency (idf)

• Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one

)document in the termoffrequency Raw : ( max jiij

ljl

ijij dkfreq

freq

freqf

documents ofnumber Total :

appears index term hein which t documents ofNumber :

log

N

kn

n

Nidf

ii

ii

Page 19: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

Best known index term weighting scheme

– Balance tf and idf (tf-idf scheme)

Query term weighting scheme

iijij idffw

iiqiq idffw 5.05.0

Page 20: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

truck"ain arrived gold ofShipment " :

ck"silver tru ain arrivedsilver ofDelivery " :

fire" ain damaged gold ofShipment " :

3

2

1

D

D

D

ii n

Nidf log

ck"silver tru gold" :Q

Term a arrived damaged delivery fire gold in of silver shipment truck

idf 0 .176 .477 .477 .477 .176 0 0 .477 .176 .176

iijij idffw iiqiq idffw

Page 21: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11

D1 0 0 .477 0 .477 .176 0 0 0 .176 0

D2 0 .176 0 .477 0 0 0 0 .954 0 .176

D3 0 .176 0 0 0 .176 0 0 0 .176 .176

Q 0 0 0 0 0 .176 0 0 .477 0 .176

ij

t

iiqj wwDQSC

1

),(

031.0)176.0(

)0)(176.0()176.0)(0()0)(477.0()0)(0()0)(0()176.0)(176.0(

)477.0)(0()0)(0()477.0)(0()0)(0()0)(0(),(

2

1

DQSC

486.0)176.0()477.0)(954.0(),( 22 DQSC

062.0)176.0()176.0(),( 223 DQSC

Hence, the ranking would be D2, D3, D1

Document vectors

Not normalized

Page 22: Vsm 벡터공간모델

http://nlp.kookmin.ac.kr/

Vector model (Cont.)

Advantage– Term-weighting scheme improves retrieval performan

ce– Partial matching strategy allows retrieval of document

s that approximate the query conditions– Cosine ranking formula sorts the documents accordin

g to their degree of similarity to the query Disadvantage

– Index terms are assumed to be mutually independent• tf-idf scheme does not account for index term depe

ndencies• However, in practice, consideration of term depend

encies might be a disadvantage