1 smoothing methods for lm in ir alejandro figueroa

57
1 Smoothing Methods for LM in IR Alejandro Figueroa

Post on 18-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Smoothing Methods for LM in IR Alejandro Figueroa

1

Smoothing Methods for LM in IRSmoothing Methods for LM in IR

Alejandro FigueroaAlejandro Figueroa

Page 2: 1 Smoothing Methods for LM in IR Alejandro Figueroa

2

OutlineOutline

• The linguistic phenomena behind the retrieval of documents.

• Language Modeling Approach.• Smoothing methods.

– Overview.– Methods.– Parameters setting.

• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.

• The linguistic phenomena behind the retrieval of documents.

• Language Modeling Approach.• Smoothing methods.

– Overview.– Methods.– Parameters setting.

• Interpolation vs. Back-off.• Comparison of methods.• Combination of methods.• Personal outlook and conclusions.

Page 3: 1 Smoothing Methods for LM in IR Alejandro Figueroa

3

The Linguistic Phenomena behind IR.

The Linguistic Phenomena behind IR.

•„Reducing Information Variation on Texts“ (Agata Savary and Christian Jacquemin).

•Work on our QA Group – DFKI.

Page 4: 1 Smoothing Methods for LM in IR Alejandro Figueroa

4

Information VariationInformation Variation

• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born

in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,

Germany.– Physics nobel prize Albert Einstein was born at Ulm,

in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-

American physicist.

• The same information can be found in several ways:

• The problem: simply keyword matching is not enough to retrieve the best documents for a query. For example: „When was Albert Einstein born?„– The nobel prize of physics Albert Einstein was born

in 1879 in Ulm, Germany.– Born: 14 March 1879 in Ulm, Württemberg,

Germany.– Physics nobel prize Albert Einstein was born at Ulm,

in Württemberg, Germany, on March 14, 1879.– Died 18 Apr 1955 (born 14 Mar 1879) German-

American physicist.

• The same information can be found in several ways:

Page 5: 1 Smoothing Methods for LM in IR Alejandro Figueroa

5

Information VariationInformation Variation

• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar

1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at

Ulm“ and "German-American physicist“.• Appropriateness:

– Precision.– Economy.

• Kinds of variation:– Graphic: "14 March 1879“ and "14 Mar

1879“.– Morphological:” Physics nobel prize“– Syntactical: “German-American physicist“– Semantic:"Albert Einstein was born at

Ulm“ and "German-American physicist“.• Appropriateness:

– Precision.– Economy.

Page 6: 1 Smoothing Methods for LM in IR Alejandro Figueroa

6

Language Modeling Approach Language Modeling Approach

•„A Study of smoothing methods for Language Models applied to Information Retrieval“ (Chengxiang Zhai and John Lafferty)

Page 7: 1 Smoothing Methods for LM in IR Alejandro Figueroa

7

Language ModelingLanguage Modeling

• The probability that a query Q was generated by a probabilistic model based on a document.

• The probability that a query Q was generated by a probabilistic model based on a document.

nqqqq ....21 mdddd ...21

)|( dqp

)(*)|()|( dpdqpqdp

• Uni-gram model:• Uni-gram model:

n

ii dqPdqP

1

)|()|(

P(q|d)?0

Page 8: 1 Smoothing Methods for LM in IR Alejandro Figueroa

8

Language ModelingLanguage Modeling

• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).

• Smoothing methods makes use of two probabilites for the model Pu(w|d) and Ps(w|d).

n

ii dqpdqp

1

)|(log)|(log

)|()|( CqPdqP idiu

n

iiu

dqci iu

is dqPdqP

dqp

i 10),(:

)|(log)|(

)|(log

Page 9: 1 Smoothing Methods for LM in IR Alejandro Figueroa

9

Language ModelingLanguage Modeling

n

iid

dqci id

is CqPnCqP

dqPdqP

i 10);(:

)|(loglog)|(

)|(log)|(log

carried out

over the

matched terms. Longer documents => less smoothing,

longer documents => greater penalty!!.

Page 10: 1 Smoothing Methods for LM in IR Alejandro Figueroa

10

Smoothing Methods Smoothing Methods

Page 11: 1 Smoothing Methods for LM in IR Alejandro Figueroa

11

OverviewOverview

• The problem: Adjust the MLE to compensate data sparseness.

• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.

• Goal of the work:– How sensitive is retrieval performance to the

smoothing of a document LM?– How should be the model and the parameters

chosen?

• The problem: Adjust the MLE to compensate data sparseness.

• The role of smoothing is:– LM more accurate.– Explain the non-informative words in the query.

• Goal of the work:– How sensitive is retrieval performance to the

smoothing of a document LM?– How should be the model and the parameters

chosen?

Page 12: 1 Smoothing Methods for LM in IR Alejandro Figueroa

12

OverviewOverview• The unsmoothed model is the MLE:• The unsmoothed model is the MLE:

Vw

ml dwc

dwcdwP

*

);(

);()|(

*

otherwiseCwP

seeniswwordifdwPdwP

d

s

)|(

)|()|(

0);(:

0);(:

)|(1

)|(1

dwcVw

dwcVws

d CwP

dwP

Page 13: 1 Smoothing Methods for LM in IR Alejandro Figueroa

13

OverviewOverview

• Smoothing: tackles the effect of statistical variability in small training sets.

• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

• Smoothing: tackles the effect of statistical variability in small training sets.

• Discounting: the relative frequencies of seen events are discounted; the gained probability mass is then distributed over the unseen words.

Page 14: 1 Smoothing Methods for LM in IR Alejandro Figueroa

14

Smoothing MethodsSmoothing Methods

• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

• Based on the Good-turing idea: Estimate the probabilities of new events by taking the counts of singleton events, dividing it by the total number of events (0,1).

Page 15: 1 Smoothing Methods for LM in IR Alejandro Figueroa

15

GooD-Turing IdeaGooD-Turing Idea

tf

tf

N

NEtftf

)()1(* 1

The probability of a term with freq. tf is given by:

Nd = Total number of terms occurred in d.

dN

tf *

dtf

tfGT NNS

NStfdtP

)(

)1()1()|(

Number of terms with frequency tf in

a document.

Expected value of Ntf.

Total number of terms occurred in d.

Page 16: 1 Smoothing Methods for LM in IR Alejandro Figueroa

16

Smoothing MethodsSmoothing Methods

• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

• Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model.

)|()|()1()|( CwPdwPdwP ml

Page 17: 1 Smoothing Methods for LM in IR Alejandro Figueroa

17

Smoothing MethodsSmoothing Methods

• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

• Absolute discounting: decrease the probability of seen words by substracting a constant from their counts.

)|();(

)0,);(max()|(

*

*CwP

dwc

dwcdwP

Vw

s

Page 18: 1 Smoothing Methods for LM in IR Alejandro Figueroa

18

Smoothing MethodsSmoothing Methods

• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:

• Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution:

Vw

dwc

CwPdwcdwP

*

);(

)|();()|(

*

• The idea is to adjust the probabilities according to the query.

• The idea is to adjust the probabilities according to the query.

Page 19: 1 Smoothing Methods for LM in IR Alejandro Figueroa

19

Summary: Smoothing MethodsSummary: Smoothing Methods

Method Ps(w|d) αd Parameter

Jelinek-Mercer λ λ

Dirichlet μ

Absolute discounting

δ

|)d|(

|d|

||

ud

C)|P(w d)|(w)P -(1 ml

)|d(|

C)|µp(wd)C(w,

|d|

C)|P(w|d|

|d|

,0)-d)Max(c(w,

Page 20: 1 Smoothing Methods for LM in IR Alejandro Figueroa

20

Parameters SettingParameters Setting

• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.

• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).

• 5 databases from TREC:– Financial Times on disk 4.– FBIS on disk 5.– Los Angeles on disk 5.– Disk 4 and disk 5 minus Congressional Record.– The TREC8 web data.

• Queries:– Topics 351-400 (TREC 7 ad-hoc task).– Topics 401-450 (TREC 8 ad hoc web task).

Page 21: 1 Smoothing Methods for LM in IR Alejandro Figueroa

21

Parameters SettingParameters Setting

<num> Number: 384

<title> space station moon

<desc> Description:

Identify documents that discuss the building of

a space station with the intent of colonizing the

moon.

<narr> Narrative:

A relevant document will discuss the purpose of a

space station, initiatives towards colonizing the

moon, impediments which thus far have thwarted such a

project, plans currently underway or in the planning

stages for such a venture; cost, countries prepared

to make a commitment of men, resources, facilities

and money to accomplish such a feat.

</top>

TREC7

Page 22: 1 Smoothing Methods for LM in IR Alejandro Figueroa

22

Parameters SettingParameters Setting

<num> Number: 414

<title> Cuba, sugar, exports

<desc> Description:

How much sugar does Cuba export and which

countries import it?

<narr> Narrative:

A relevant document will provide information

regarding Cuba's sugar trade. Sugar production

statistics are not relevant unless exports

are mentioned explicitly.

</top>

TREC8

Page 23: 1 Smoothing Methods for LM in IR Alejandro Figueroa

23

Parameters SettingParameters Setting

• Interaction query length/type:– Two different version of each set of queries:

• Title only (2 or 3 words).• A long version (Title + description + narrative).

• Optimize the performance of each method by means of the non-interpolated average precision.

• Interaction query length/type:– Two different version of each set of queries:

• Title only (2 or 3 words).• A long version (Title + description + narrative).

• Optimize the performance of each method by means of the non-interpolated average precision.

Page 24: 1 Smoothing Methods for LM in IR Alejandro Figueroa

24

Parameters SettingParameters Setting

• Jelinek-Mercer smoothing:– Weight for a matched term:

• Jelinek-Mercer smoothing:– Weight for a matched term:

)|(

)|()1(1log

CqP

dqP

i

iml

)|(

)|()1(1

CqP

dqP

i

iml

i i

iml

CqP

dqP

)|(

)|(

λ->1

Page 25: 1 Smoothing Methods for LM in IR Alejandro Figueroa

25

Parameters SettingParameters Setting

• Dirichlet priors: – Term weight:

• Dirichlet priors: – Term weight:

)|(

);(1log

CqP

dqc

i

i

)|(

)|(||1log

CqP

dqPd

i

iml

αd is a document-dependent

length normalization factor that penalizes

long documents.

Page 26: 1 Smoothing Methods for LM in IR Alejandro Figueroa

26

Parameters SettingParameters Setting

• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter

distribution of words. – Weight of a matched term:

• Absolute discounting: αd is a document-dependent:– Larger for a document with a flatter

distribution of words. – Weight of a matched term:

)|(||

);(1log

CqPd

dqc

iu

i

Page 27: 1 Smoothing Methods for LM in IR Alejandro Figueroa

27

Parameters SettingParameters Setting

• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for

long queries than for title queries.• Long queries need more smoothing, that is, lees

emphasis on the relative weighting of terms.

– In the web collection, it was sensitive to smoothing for title queries too.

– For title queries the retrieval performance tends to be optimized when λ=0.1.

• Conclusions Jelinek-Mercer:– The precision is much more sensitive to λ for

long queries than for title queries.• Long queries need more smoothing, that is, lees

emphasis on the relative weighting of terms.

– In the web collection, it was sensitive to smoothing for title queries too.

– For title queries the retrieval performance tends to be optimized when λ=0.1.

Page 28: 1 Smoothing Methods for LM in IR Alejandro Figueroa

28

Parameters SettingParameters Setting

• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries

than for title queries, especially, when μ is small. – When μ is large, all long queries performed better

than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long

queries than for title queries.– The value of μ tends to vary from collection to

collection.

• Conclusions Dirichlet Priors:– The precision is more sensitive to μ for long queries

than for title queries, especially, when μ is small. – When μ is large, all long queries performed better

than short queries, opposite to μ small.– The optimal value of μ tends to be larger for long

queries than for title queries.– The value of μ tends to vary from collection to

collection.

Page 29: 1 Smoothing Methods for LM in IR Alejandro Figueroa

29

Parameters SettingParameters Setting

• Conclusions Absolute discounting:– The precision is more sensitive to δ for long

queries than for title queries.– The optimal value of δ0.7 does not seem to

be much different for title queries and long queries.

– Smoothing plays a more important role for long verbose queries than for concise queries.

• Conclusions Absolute discounting:– The precision is more sensitive to δ for long

queries than for title queries.– The optimal value of δ0.7 does not seem to

be much different for title queries and long queries.

– Smoothing plays a more important role for long verbose queries than for concise queries.

Page 30: 1 Smoothing Methods for LM in IR Alejandro Figueroa

30

Interpolation vs. Back-offInterpolation vs. Back-off

Page 31: 1 Smoothing Methods for LM in IR Alejandro Figueroa

31

Interpolation vs. Back-offInterpolation vs. Back-off

• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.

• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

• Interpolation-based methods: counts of the seen words and the extra counts are shared by both the seen words and unseen words.

• Back-off: Trust in the MLE for the high count words and discount and redistribute mass only for the less common terms.

Page 32: 1 Smoothing Methods for LM in IR Alejandro Figueroa

32

• Interpolation:• Interpolation:

Interpolation vs. Back-offInterpolation vs. Back-off

)|()()( CwPwPwP ddmls

)|()( CwPwP du

Page 33: 1 Smoothing Methods for LM in IR Alejandro Figueroa

33

Interpolation vs. Back-offInterpolation vs. Back-off

• Back-Off:• Back-Off:

0),'(:'

)|'(1

)|()(

)()(

dwcVw

du

dmls

CwP

CwPwP

wPwP

Page 34: 1 Smoothing Methods for LM in IR Alejandro Figueroa

34

Interpolation vs. Back-offInterpolation vs. Back-off

• Results:– The performance of the back-off strategy is

more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.

– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

• Results:– The performance of the back-off strategy is

more sensitive to the smoothing parameters.• Specially: Jeliner-Mercer and Dirichlet priors.

– This sensitivity is smaller for the absolute discounting method, due to the lower upper bound.

||

||

d

d u

Page 35: 1 Smoothing Methods for LM in IR Alejandro Figueroa

35

Comparisson of methodsComparisson of methods

Page 36: 1 Smoothing Methods for LM in IR Alejandro Figueroa

36

Comparison of methodsComparison of methods

• For title queries:– Dirichlet prior is better than absolute

discounting, which is better than J-M.– Dirichlet prior performed extremelly well on

the web collection and it is insensitive to the value of μ.

– Many no optimal runs were better than the other two methods.

• For title queries:– Dirichlet prior is better than absolute

discounting, which is better than J-M.– Dirichlet prior performed extremelly well on

the web collection and it is insensitive to the value of μ.

– Many no optimal runs were better than the other two methods.

Page 37: 1 Smoothing Methods for LM in IR Alejandro Figueroa

37

Comparison of methodsComparison of methods

• For long queries:– Jelinek-Mercer is better than Dirichlet, which

is better than absolute discounting.– The three methods perform better on long

queries than in short queries.– Jelinek-Mercer is much more effective for long

and verbose queries.– All methods perform better for long queries

than for short queries.

• For long queries:– Jelinek-Mercer is better than Dirichlet, which

is better than absolute discounting.– The three methods perform better on long

queries than in short queries.– Jelinek-Mercer is much more effective for long

and verbose queries.– All methods perform better for long queries

than for short queries.

Page 38: 1 Smoothing Methods for LM in IR Alejandro Figueroa

38

Comparison of methodsComparison of methods

• General Remark:– The strong correlation between the effect of

smoothing and the type of the query is unexpected.

– Smoothing only improves accuracy in estimating the unigram language model based on a document.

• General Remark:– The strong correlation between the effect of

smoothing and the type of the query is unexpected.

– Smoothing only improves accuracy in estimating the unigram language model based on a document.

Effect of verbose

Queries???

Page 39: 1 Smoothing Methods for LM in IR Alejandro Figueroa

39

Query Length/VerbosityQuery Length/Verbosity

• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on

average.– Long verbose: Using the title, description and the narrative field

(more than 50 words on average).

• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and

the verbose query too.• The retrieval performance is much less sensitive to

smoothing in the case of the keyword queries than for the verbose queries.

• Four types of query:– Short keywords: Only the title of the topic description.– Long keywords: Using only the description field.– Short verbose: Using the concept field, 28 keywords on

average.– Long verbose: Using the title, description and the narrative field

(more than 50 words on average).

• Generated for the TREC topics 1-150.• Both keywords queries behaved in the similar way and

the verbose query too.• The retrieval performance is much less sensitive to

smoothing in the case of the keyword queries than for the verbose queries.

Page 40: 1 Smoothing Methods for LM in IR Alejandro Figueroa

40

Combining MethodsCombining Methods•„A General Language Model for Information Retrieval“ (Fei Song / W. Bruce Croft)

Page 41: 1 Smoothing Methods for LM in IR Alejandro Figueroa

41

A general LM for IRA general LM for IR

• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.

• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

• They propose a extensible model based on:– Good-turing estimate.– Curve-fitting functions.– Model combinations.

• The idea is to use n-grams is taking into account the local context, the uni-gram models assume independence.

Page 42: 1 Smoothing Methods for LM in IR Alejandro Figueroa

42

A general LM for IRA general LM for IR

• The new model:1. Smooth each document with the Good-turing

estimate.

2. Expand each document with the corpus.

3. Consider terms pairs and expand the unigram model to the bi-gram model.

• The new model:1. Smooth each document with the Good-turing

estimate.

2. Expand each document with the corpus.

3. Consider terms pairs and expand the unigram model to the bi-gram model.

Page 43: 1 Smoothing Methods for LM in IR Alejandro Figueroa

43

Step 1: Good turing Idea-RevisingStep 1: Good turing Idea-Revising

tf

tf

N

NEtftf

)()1(* 1

Ntf = Number of terms with frequency tf in a doc.

E(Ntf)= Expected value of Ntf .

The probability of a term with freq. tf is given by:

Nd = Total number of terms occurred in d.dN

tf *

dtf

tfGT NNS

NStfdtP

)(

)1()1()|(

Page 44: 1 Smoothing Methods for LM in IR Alejandro Figueroa

44

Step 2Step 2

• Expanding a document model with the corpus:

• Expanding a document model with the corpus:

)(*)1()|(*)|( tPwdtPwdtP corpusdsum

wcorpus

wdweighted tPdtPdtP 1)()|()|(

Page 45: 1 Smoothing Methods for LM in IR Alejandro Figueroa

45

Step 3Step 3

• Modeling a query as a sequence of terms:

• Modeling a query as a sequence of terms:

QtQt

set dtPdtPdQP ))|(0.1(*)|()|(

m

iiseq dtPdQP

1

)|()|(

Page 46: 1 Smoothing Methods for LM in IR Alejandro Figueroa

46

Step 4Step 4

• Combining uni-grams and bi-grams:• Combining uni-grams and bi-grams:

)|,()|()|,( 1211 dttPdtPdttP iiiii

121

Page 47: 1 Smoothing Methods for LM in IR Alejandro Figueroa

47

ResultsResults

• Two collections:– The wall street journal (WSJ), 250 MB, 74.520

docs.– TREC 4, 2 GB, 567.529 docs.

• Phrases of word pairs can be useful in improving the retrieval performance.

• The strategy can be easily extended.

• Two collections:– The wall street journal (WSJ), 250 MB, 74.520

docs.– TREC 4, 2 GB, 567.529 docs.

• Phrases of word pairs can be useful in improving the retrieval performance.

• The strategy can be easily extended.

Page 48: 1 Smoothing Methods for LM in IR Alejandro Figueroa

48

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

Page 49: 1 Smoothing Methods for LM in IR Alejandro Figueroa

49

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

• Stop-List.

• Porter Steemer.

• N-grams can not capture large-span relationships in the language.

• The performance of the n-gram model has reached a plateau.

• P(d).

• Stop-List.

• Porter Steemer.

• N-grams can not capture large-span relationships in the language.

• The performance of the n-gram model has reached a plateau.

• P(d).

Page 50: 1 Smoothing Methods for LM in IR Alejandro Figueroa

50

Principal Component AnalysisPrincipal Component Analysis

• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,

where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

• A low dimensional representation of the data.• Relation between features.• PCA tries to find a low-rank approximation,

where the quality of the approximation depends on how close the data is to lying in a subspace of the given dimensionality.

Page 51: 1 Smoothing Methods for LM in IR Alejandro Figueroa

51

Latent Semantic AnalysisLatent Semantic Analysis

• Latent Semantic Analysis– Semantic Information is extracted by means

of the Singular Value Decomposition (SVD).

• Latent Semantic Analysis– Semantic Information is extracted by means

of the Singular Value Decomposition (SVD).

VUD

kUdd )(

LSI uses a reduction of the first k columns of U.

Page 52: 1 Smoothing Methods for LM in IR Alejandro Figueroa

52

Latent Semantic AnalysisLatent Semantic Analysis

• Latent Semantic Analysis– The eigenvectors for a set of documents can

be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.

– Terms that co-occur frequently will tend to align in the same eigenvectors.

• Latent Semantic Analysis– The eigenvectors for a set of documents can

be viewed as concepts described by a linear combination of terms chosen in such a way that documents are described as accurately as possible using only k such concepts.

– Terms that co-occur frequently will tend to align in the same eigenvectors.

Page 53: 1 Smoothing Methods for LM in IR Alejandro Figueroa

53

Latent Semantic AnalysisLatent Semantic Analysis

• SVD is expensive to compute.• Cristianini developed an approximation strategy,

based on the Gram-Schmidt decomposition.• Multilinguality:

– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.

• SVD is expensive to compute.• Cristianini developed an approximation strategy,

based on the Gram-Schmidt decomposition.• Multilinguality:

– The semantic space proposed here provides an ideal representation for performing multilingual information retrieval.

Page 54: 1 Smoothing Methods for LM in IR Alejandro Figueroa

54

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

• What happens if we use LSA to improve smoothing?– We can think:

• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.

– Problem:• Scalability of the model: if a term is not in the set

W, from which the SVD decomposition was made, then we should do an approximation.

• What happens if we use LSA to improve smoothing?– We can think:

• We can smooth terms assigning probability mass according to their semantic distance to the terms in the collection/query.

– Problem:• Scalability of the model: if a term is not in the set

W, from which the SVD decomposition was made, then we should do an approximation.

Page 55: 1 Smoothing Methods for LM in IR Alejandro Figueroa

55

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

• What happens if we use LSA to improve smoothing?– Problem:

• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.

• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.

• What happens if we use LSA to improve smoothing?– Problem:

• If the documents belong diverse topics, the classification on the new space becomes too heterogeneous.

• If the documents belong diverse topics, the classification of the words in the new space is ambiguous.

Page 56: 1 Smoothing Methods for LM in IR Alejandro Figueroa

56

Personal Outlook / ConclusionsPersonal Outlook / Conclusions

• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data

sparseness problem.– They can be choose according to the taste of the

consumer.– But, they do not model the linguistic phenomena

behind the scenes... At least for the moment.– Even though, the techniques does not requieres

language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.

• Conclusions:– Smoothing methods are simple and efficient.– They provide a elegant way to deal with the data

sparseness problem.– They can be choose according to the taste of the

consumer.– But, they do not model the linguistic phenomena

behind the scenes... At least for the moment.– Even though, the techniques does not requieres

language knowledge, the fact of the markov assumption, drives to some sort of language dependecy.

Page 57: 1 Smoothing Methods for LM in IR Alejandro Figueroa

57

Questions?Questions?

• English only?.

• Query Expansion?.

• How would help smoothing to the Question Answering task?

• Which method would help in a more appropriate way to a QA System? Why?

• English only?.

• Query Expansion?.

• How would help smoothing to the Question Answering task?

• Which method would help in a more appropriate way to a QA System? Why?