language modeling putting a curve to the bag of words courtesy of chris jordan

28
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Upload: shana-reeves

Post on 19-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Language Modeling

Putting a curve to the bag of words

Courtesy of Chris Jordan

Page 2: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

What models we covered in class so far

• Boolean

• Extended Boolean

• Vector Space – TF*IDF

• Probabilistic Modeling– log P(D|R) / P(D|N)

Page 3: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Probability Ranking Principle

“If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

- Robertson

Page 4: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Bag of words? What bag?

• Documents are a vector of term occurrences

• Assumption of exchangeability

• What is this really?– A hyperspace where each dimension is

represented by a term– Values are term occurrences

Page 5: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Can we model this bag?

• Binomial Distribution– Bernoulli / Success Fail Trials– e.g. Flipping a coin: chance of getting a head

• Multinomial– Probability of events occurring– e.g. Flipping a coin: chance of head, chance of tail– e.g. Die Roll: chance of 1, 2, …, 6– e.g. Document: chance of a term occurring

Page 6: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Review

• What is the Probability Ranking Principle?

• What is the bag of words model?

• What is exchangeability?

• What is a binomial?

• What is a multinomial?

Page 7: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Some Terminology

• Term: t

• Vocabulary: V = {t1 t2 … tn}

• Document: dx = tdx1 … tdxm V

• Corpus: C = {d1 d2… dk}

• Query: Q = q1 q2 … qi V

Page 8: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Language Modeling

• A document is represented by multinomial

• Unigram model– A piece of text is generated by each term

independently

• p(t1 t2 … tn) = p(t1)p(t2)…p(tn)

• p(t1)+p(t2)+…+p(tn)=1

Page 9: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Why Unigram

• Easy to implement– Reasonable performance

• Word order and structure not captured– How much benefit would they add?

• Open question

• More parameters to tune in complex models– Need more data to train– Need more time to compute– Need more space to store

Page 10: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Enough… how do I retrieve documents?

• p(Q|d) = p(q1|d)p(q2|d)…p(qn|d)

• How do we estimate p(q|d)?– Maximum Likelihood Estimate– MLE(q|d) = freq(q|d) / ∑freq(i|d)

• Probability Ranking Principle

Page 11: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Review

• What is the unigram model?

• Is the language model a binomial or multinomial?

• Why use the unigram model?

• Given a query, how do we use a language model to retrieve documents?

Page 12: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

What is wrong with MLE

• Creates 0 probabilities for terms that do not occur

• 0 probabilities break similarity scoring function

• Is a 0 probability sensible?– Can a word never ever occur?

Page 13: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

How can we fix this?

• How do we get around the zero probabilities?– New similarity function?– Remove zero probabilities?

• Build a different model?

Page 14: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Smoothing Approaches

• Laplace / Addictive

• Mixture Models– Interpolation

• Jelinek Mercer• Dirichlet• Absolute Discounting

– Backoff

Page 15: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Laplace

• Just up all term frequencies by 1

• Where have you seen this before?

• Is this a good idea?– Strengths– Weaknesses

Page 16: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Interpolation

• Mixture model approach– Combine probability models

• Traditionally combine document model with the corpus model

• Is this a good idea?– What else is the corpus model used for?– Strengths– Weaknesses

Page 17: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Backoff

• Only add probability mass to terms that are not seen

• What does this do to the probability model?– Flatter?

• Is this a good idea?

Page 18: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Are their other sources for probability mass?

• Document Clusters

• Document Classes

• User Profiles

• Topic models

Page 19: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Review

• What is wrong with 0 probabilities?• How does smoothing fix it?• What is smoothing really doing?• What is Interpolation?

– What is that mixture model really representing?

• What can we use to mix with the document model?

Page 20: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Bored yet? Let’s do something complicated

• Entropy - Information Theory– H(x) = -∑p(x) log p(x)– Good for data compression

• Relative Entropy– D(p||q) = ∑p(x) log (p(x)/q(x))– Not a true distance measure– Used to find differences between

probability models

Page 21: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Ok… that’s nice

• What does relative entropy give us?– Why not just subtract probabilities?– On your calculators calculate

p(x) log (p(x)/q(x)) for• p(x) = .8, q(x) = .6• p(x) = .6, q(x) = .4

Page 22: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Clarity Score

• Calculate the relative entropy between the result set and the corpus– Positive correlation between high clarity

score / relative entropy and query performance

– So what is that actually saying?

Page 23: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Relative Entropy Query Expansion

• Relevance Feedback

• Blind Relevance Feedback

• Expand query with terms that contribute the most to relative entropy

• What are we doing to the query when we do this?

Page 24: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Controlled Query Generation

• Some of my research

• p(x) log (p(x)/q(x)) is a good term discrimination function

• Regulate the construction of queries for evaluating retrieval algorithms– First real controlled reaction experiments

with retrieval algorithms

Page 25: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Review

• Who is the father of Information Theory?

• What is Entropy?• What is Relative Entropy?• What is the Clarity Score?• What are the terms that contribute the

most to relative entropy?– Are they useful?

Page 26: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

You have been a good class

• Introduced to the language model for information retrieval

• Documents represented as multinomial distributions– Generative model– Queries are generated

• Smoothing• Applications in IR

Page 27: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Questions for me?

Page 28: Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan

Questions for you

• What is the Maximum Likelihood Estimate?

• Why is smoothing important?

• What is interpolation?

• What is entropy?

• What is relative entropy?

• Does language modeling make sense?