a language modeling approach to information retrieval 한 경 수 2002-04-02 introduction previous...

A Language Modeling Approach

to Information Retrieval

한 경 수2002-04-02

Introduction Previous Work Model Description Empirical Results Conclusions and Future Work Relevance Feedback in LM

Introduction Previous Work Model Description Empirical Results Conclusions and Future Work Relevance Feedback in LM

한경수 LM Approach to IR 2

Indexing model of probabilistic retrieval model

A model of the assignment of indexing terms to documents

Indexing model of 2-Poisson model Indicate the useful indexing terms by means of the

differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness.

The current indexing models have not led to improved retrieval results. Due to 2 unwarranted assumptions Documents are members of pre-defined classes.

– Combinatorial explosion of elite sets The parametric assumption

– Unnecessary to construct a parametric model of the data when we have the actual data.

Introduction

)|()|()|()|()|( REPETFPREPETFPRTFP


Retrieval based on probabilistic LM

Treat the generation of queries as a random process.

Approach Infer a language model for each document. Estimate the probability of generating the query according

to each of these models. Rank the documents according to these probabilities.

Intuition Users …

– Have a reasonable idea of terms that are likely to occur in documents of interest.

– Will choose query terms that distinguish these documents from others in the collection.

Collection statistics … Are integral parts of the language model. Are not used heuristically as in many other approaches.

Introduction


Probabilistic IR

query

d1

d2

dn

…

Information need

document collection

matchingmatching

),|( dQRP

Introduction


IR based on LM

query

d1

d2

dn

…

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

…

ndM

Introduction


Previous Work Difference from the 2-Poisson model

Don’t make distributional assumptions. Don’t distinguish a subset of specialty words.

– Don’t assume a preexisting classification of documents into elite and non-elite sets.

Difference from Robertson & Sparck Jones model and Croft & Harper model Don’t focus on relevance except to the extent that the

process of query production is correlated with it. Fuhr model INQUERY Kwok, Wong & Yao, Kalt

Previous Work


Query generation probability Ranking formula

The probability of producing the query given the language model of document d

Model Description

Qt d

dt

Qtdmld

dl

tf

MtpMQp

),(

)|(ˆ)|(ˆ

Assumption:Given a particular language model, the query term occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

dM

)|()(

)|()(),(

dMQpdp

dQpdpdQp


Insufficient data Zero probability

Don’t wish to assign a probability of zero to a document that is missing one or more of the query terms.

Somewhat radical assumption to infer that Assumption

A non-occurring term is possible, but no more likely than what would be expected by chance in the collection.

If ,

Model Description

0)|( dMtp

0),( dttf

cs

cs

cfMtp t

d )|(

tcf : raw count of term t in the collection

: raw collection size(total number of tokens in the collection)


Averaging for robustness If we could get an arbitrary sized sample of

data from we could be reasonably confident in the maximum likelihood estimator. We only have a document sized sample from that

distribution. To circumvent this problem,

Need an estimate from a larger amount of data

Model Description

dM

tdf

t

dtddml

avg df

Mtp

tp

)(

)|(

)(ˆ

: document frequency of t


The Risk Cannot and are not assuming that every document

containing t is drawn from the same language model. There is some risk in using the mean to estimate

If we used the mean by itself, there would be no distinction between documents with different term frequencies.

The risk for a term t in a document d (geometric distribution)

As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate.

Model Description

)|( dMtp

dttf

t

t

tdt f

f

fR

,

)0.1()0.1(

0.1ˆ,

davg dltp )(tf : mean term frequency of term t in documents where t occurs

normalized by document length (= )


Combining the two estimatesModel Description

otherwise

0 if )(),()|(ˆ

),(

ˆ)ˆ0.1( ,,

cs

cftftpdtp

Mtpt

dtR

avgR

ml

d

dtdt

Qt

dQt

dd MtpMtpMQp )|(ˆ0.1)|(ˆ)|(ˆ

)|(ˆ)(ˆ),(ˆ dMQpdpdQp


Analysis of the formulation Generalization: formulation of the LM for IR

Conception The user has a document in mind, and generate the

query from this document. The equation represents the probability that the

document that the user had in mind was in fact this one.

Model Description

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model


Experiment Environment Data

TREC topics 202-250 on TREC disks 2 and 3– Natural language queries consisting of one sentence each

TREC topics 51-100 on TREC disk 3 using the concept fields

– Lists of good terms

Empirical Results

<num>Number: 054

<dom>Domain: International Economics

<title>Topic: Satellite Launch Contracts

<desc>Description:

…

<con>Concept(s):

1. Contract, agreement

2. Launch vehicle, rocket, payload, satellite

3. Launch services, …

…

<num>Number: 054

<dom>Domain: International Economics

<title>Topic: Satellite Launch Contracts

<desc>Description:

…

<con>Concept(s):

1. Contract, agreement

2. Launch vehicle, rocket, payload, satellite

3. Launch services, …

…


Recall/Precision Experiments(1)

Empirical Results


Recall/Precision Experiments(2)

Empirical Results


Improving the Basic Model(1) Smoothing the estimate of the average

probability for terms with low document frequency The estimate is based on a small amount of data So could be sensitive to outliers

Binned estimate Bin the low frequency data by document frequency

– Cutoff: df=100 Use the binned estimate for the average

Empirical Results


Improving the Basic Model(2)Empirical Results


Improving the Basic Model(3)Empirical Results


Conclusions & Future Work Conclusions

Novel way of looking at the problem of text retrieval based on probabilistic language modeling

– Conceptually simple and explanatory LM will provide effective retrieval and can be improved to

the extent that the following conditions can be met– Our language models are accurate representations of the

data.– Users understand our approach to retrieval.– Users have a some sense of term distribution.

The ability to think about retrieval in a new way Future Work

Estimate of default probability– Current estimator could in some strange cases assign a

higher probability to a non-occurring query term. Query expansion

Conclusions & Future Work


LM approach to multiple relevant documents

Current LM approach Allow for N+1 language models

– N(collection size) + general language model The relationship between general language model and

the individual document models is never raised.– How can a document be generated from one language

model when the entire collection is generated from a different one?

We need … General model for some accumulation of text, which is

modified (not replaced) by a local model for some smaller part of the same text.

Relevance Feedback in LM


3-level model(1) 3-level model

1. Whole collection model ( )2. Specific-topic model; relevant-documents model ( )3. Individual-document model ( )

Relevance hypothesis A request(query; topic) is generated from a specific-topic

model { , }. Iff a document is relevant to the topic, the same model

will apply to the document.– It will replace part of the individual-document model in

explaining the document. The probability of relevance of a document

– The probability that this model explains part of the document

– The probability that the { , , } combination is better than the { , } combination

Relevance Feedback in LM

CM

dMTM

CM TM

CM TM dMCM dM


3-level model(2)

query

d1

d2

dn

…

Information need

document collection

generationgeneration

),,|( dTC MMMQP

1dM

2dM

…

ndM

CM

1TM

2TM

mTM

…),|( TC MMQP


Geometric distribution(1) 기하분포

첫번째 성공을 거둘 때까지 성공률이 p 인 베르누이 시행을 반복 할 때 , 총 시행횟수를 X 라 두면 이 확률변수 X 가 갖는 분포가 기하분포이다 .

,...3,2,1 ][ 1 xpqxXP x

22 ,

1

p

q

p


Geometric distribution(2) 예

어떤 실험을 한번 하는데 드는 비용은 10 만원이다 . 이 실험이 성공할 확률은 0.2 이고 성공할 때까지 이 실험을 반복한다고 할 때 실험에 드는 총비용을 얼마로 예상하면 될까 ?

000,5002.0

1000,100)(000,100)( XEYE

,...3,2,1 )8.0)(2.0(][ 1 xxXP x