a language modeling approach to information retrieval 한 경 수 2002-04-02 introduction previous...
TRANSCRIPT
![Page 1: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/1.jpg)
A Language Modeling Approach
to Information Retrieval
한 경 수2002-04-02
Introduction Previous Work Model Description Empirical Results Conclusions and Future Work Relevance Feedback in LM
Introduction Previous Work Model Description Empirical Results Conclusions and Future Work Relevance Feedback in LM
![Page 2: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/2.jpg)
한경수 LM Approach to IR 2
Indexing model of probabilistic retrieval model
A model of the assignment of indexing terms to documents
Indexing model of 2-Poisson model Indicate the useful indexing terms by means of the
differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness.
The current indexing models have not led to improved retrieval results. Due to 2 unwarranted assumptions Documents are members of pre-defined classes.
– Combinatorial explosion of elite sets The parametric assumption
– Unnecessary to construct a parametric model of the data when we have the actual data.
Introduction
)|()|()|()|()|( REPETFPREPETFPRTFP
![Page 3: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/3.jpg)
한경수 LM Approach to IR 3
Retrieval based on probabilistic LM
Treat the generation of queries as a random process.
Approach Infer a language model for each document. Estimate the probability of generating the query according
to each of these models. Rank the documents according to these probabilities.
Intuition Users …
– Have a reasonable idea of terms that are likely to occur in documents of interest.
– Will choose query terms that distinguish these documents from others in the collection.
Collection statistics … Are integral parts of the language model. Are not used heuristically as in many other approaches.
Introduction
![Page 4: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/4.jpg)
한경수 LM Approach to IR 4
Probabilistic IR
query
d1
d2
dn
…
Information need
document collection
matchingmatching
),|( dQRP
Introduction
![Page 5: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/5.jpg)
한경수 LM Approach to IR 5
IR based on LM
query
d1
d2
dn
…
Information need
document collection
generationgeneration
)|( dMQP 1dM
2dM
…
ndM
Introduction
![Page 6: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/6.jpg)
한경수 LM Approach to IR 6
Previous Work Difference from the 2-Poisson model
Don’t make distributional assumptions. Don’t distinguish a subset of specialty words.
– Don’t assume a preexisting classification of documents into elite and non-elite sets.
Difference from Robertson & Sparck Jones model and Croft & Harper model Don’t focus on relevance except to the extent that the
process of query production is correlated with it. Fuhr model INQUERY Kwok, Wong & Yao, Kalt
Previous Work
![Page 7: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/7.jpg)
한경수 LM Approach to IR 7
Query generation probability Ranking formula
The probability of producing the query given the language model of document d
Model Description
Qt d
dt
Qtdmld
dl
tf
MtpMQp
),(
)|(ˆ)|(ˆ
Assumption:Given a particular language model, the query term occur independently
),( dttf
ddl
: language model of document d
: raw tf of term t in document d
: total number of tokens in document d
dM
)|()(
)|()(),(
dMQpdp
dQpdpdQp
![Page 8: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/8.jpg)
한경수 LM Approach to IR 8
Insufficient data Zero probability
Don’t wish to assign a probability of zero to a document that is missing one or more of the query terms.
Somewhat radical assumption to infer that Assumption
A non-occurring term is possible, but no more likely than what would be expected by chance in the collection.
If ,
Model Description
0)|( dMtp
0),( dttf
cs
cs
cfMtp t
d )|(
tcf : raw count of term t in the collection
: raw collection size(total number of tokens in the collection)
![Page 9: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/9.jpg)
한경수 LM Approach to IR 9
Averaging for robustness If we could get an arbitrary sized sample of
data from we could be reasonably confident in the maximum likelihood estimator. We only have a document sized sample from that
distribution. To circumvent this problem,
Need an estimate from a larger amount of data
Model Description
dM
tdf
t
dtddml
avg df
Mtp
tp
)(
)|(
)(ˆ
: document frequency of t
![Page 10: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/10.jpg)
한경수 LM Approach to IR 10
The Risk Cannot and are not assuming that every document
containing t is drawn from the same language model. There is some risk in using the mean to estimate
If we used the mean by itself, there would be no distinction between documents with different term frequencies.
The risk for a term t in a document d (geometric distribution)
As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate.
Model Description
)|( dMtp
dttf
t
t
tdt f
f
fR
,
)0.1()0.1(
0.1ˆ,
davg dltp )(tf : mean term frequency of term t in documents where t occurs
normalized by document length (= )
![Page 11: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/11.jpg)
한경수 LM Approach to IR 11
Combining the two estimatesModel Description
otherwise
0 if )(),()|(ˆ
),(
ˆ)ˆ0.1( ,,
cs
cftftpdtp
Mtpt
dtR
avgR
ml
d
dtdt
Qt
dQt
dd MtpMtpMQp )|(ˆ0.1)|(ˆ)|(ˆ
)|(ˆ)(ˆ),(ˆ dMQpdpdQp
![Page 12: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/12.jpg)
한경수 LM Approach to IR 12
Analysis of the formulation Generalization: formulation of the LM for IR
Conception The user has a document in mind, and generate the
query from this document. The equation represents the probability that the
document that the user had in mind was in fact this one.
Model Description
Qt
dMtptpdpdQp ))|()()1(()(),(
general language model
individual-document model
![Page 13: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/13.jpg)
한경수 LM Approach to IR 13
Experiment Environment Data
TREC topics 202-250 on TREC disks 2 and 3– Natural language queries consisting of one sentence each
TREC topics 51-100 on TREC disk 3 using the concept fields
– Lists of good terms
Empirical Results
<num>Number: 054
<dom>Domain: International Economics
<title>Topic: Satellite Launch Contracts
<desc>Description:
…
<con>Concept(s):
1. Contract, agreement
2. Launch vehicle, rocket, payload, satellite
3. Launch services, …
…
<num>Number: 054
<dom>Domain: International Economics
<title>Topic: Satellite Launch Contracts
<desc>Description:
…
<con>Concept(s):
1. Contract, agreement
2. Launch vehicle, rocket, payload, satellite
3. Launch services, …
…
![Page 14: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/14.jpg)
한경수 LM Approach to IR 14
Recall/Precision Experiments(1)
Empirical Results
![Page 15: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/15.jpg)
한경수 LM Approach to IR 15
Recall/Precision Experiments(2)
Empirical Results
![Page 16: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/16.jpg)
한경수 LM Approach to IR 16
Improving the Basic Model(1) Smoothing the estimate of the average
probability for terms with low document frequency The estimate is based on a small amount of data So could be sensitive to outliers
Binned estimate Bin the low frequency data by document frequency
– Cutoff: df=100 Use the binned estimate for the average
Empirical Results
![Page 17: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/17.jpg)
한경수 LM Approach to IR 17
Improving the Basic Model(2)Empirical Results
![Page 18: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/18.jpg)
한경수 LM Approach to IR 18
Improving the Basic Model(3)Empirical Results
![Page 19: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/19.jpg)
한경수 LM Approach to IR 19
Conclusions & Future Work Conclusions
Novel way of looking at the problem of text retrieval based on probabilistic language modeling
– Conceptually simple and explanatory LM will provide effective retrieval and can be improved to
the extent that the following conditions can be met– Our language models are accurate representations of the
data.– Users understand our approach to retrieval.– Users have a some sense of term distribution.
The ability to think about retrieval in a new way Future Work
Estimate of default probability– Current estimator could in some strange cases assign a
higher probability to a non-occurring query term. Query expansion
Conclusions & Future Work
![Page 20: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/20.jpg)
한경수 LM Approach to IR 20
LM approach to multiple relevant documents
Current LM approach Allow for N+1 language models
– N(collection size) + general language model The relationship between general language model and
the individual document models is never raised.– How can a document be generated from one language
model when the entire collection is generated from a different one?
We need … General model for some accumulation of text, which is
modified (not replaced) by a local model for some smaller part of the same text.
Relevance Feedback in LM
![Page 21: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/21.jpg)
한경수 LM Approach to IR 21
3-level model(1) 3-level model
1. Whole collection model ( )2. Specific-topic model; relevant-documents model ( )3. Individual-document model ( )
Relevance hypothesis A request(query; topic) is generated from a specific-topic
model { , }. Iff a document is relevant to the topic, the same model
will apply to the document.– It will replace part of the individual-document model in
explaining the document. The probability of relevance of a document
– The probability that this model explains part of the document
– The probability that the { , , } combination is better than the { , } combination
Relevance Feedback in LM
CM
dMTM
CM TM
CM TM dMCM dM
![Page 22: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/22.jpg)
한경수 LM Approach to IR 22
3-level model(2)
query
d1
d2
dn
…
Information need
document collection
generationgeneration
),,|( dTC MMMQP
1dM
2dM
…
ndM
CM
1TM
2TM
mTM
…),|( TC MMQP
![Page 23: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/23.jpg)
한경수 LM Approach to IR 23
Geometric distribution(1) 기하분포
첫번째 성공을 거둘 때까지 성공률이 p 인 베르누이 시행을 반복 할 때 , 총 시행횟수를 X 라 두면 이 확률변수 X 가 갖는 분포가 기하분포이다 .
,...3,2,1 ][ 1 xpqxXP x
22 ,
1
p
q
p
![Page 24: A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02 Introduction Previous Work Model Description Empirical Results Conclusions](https://reader035.vdocuments.site/reader035/viewer/2022070411/56649f4f5503460f94c71b6f/html5/thumbnails/24.jpg)
한경수 LM Approach to IR 24
Geometric distribution(2) 예
어떤 실험을 한번 하는데 드는 비용은 10 만원이다 . 이 실험이 성공할 확률은 0.2 이고 성공할 때까지 이 실험을 반복한다고 할 때 실험에 드는 총비용을 얼마로 예상하면 될까 ?
000,5002.0
1000,100)(000,100)( XEYE
,...3,2,1 )8.0)(2.0(][ 1 xxXP x