latent dirichlet allocation - stanford universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...

29
Latent Dirichlet Allocation (Blei et al.) Kris Sankaran 2016-11-14 1

Upload: others

Post on 30-May-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Latent Dirichlet Allocation(Blei et al.)

Kris Sankaran

2016-11-14

1

Page 2: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Introduction

2

Page 3: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Agenda

I Generative Mechanism (15 minutes): What is the proposedmodel, and how does it differ from what existed before?

I Interpretations (10 minutes): What are alternative ways tounderstand the model?

I Model Inference (15 minutes): How would we fit this modelin practice?

I Examples and Conclusion (10 minutes): Why might we fitLDA in practice, and what are its limitations?

3

Page 4: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Context and Motivation

I Motivated by topic modeling:I Building interpretable representations of text dataI Designing preprocessing steps for classification or information

retrieval

I This said, LDA is not necessarily tied to text analysisI Generative Modeling: Design unified probabilistic models

I Is explicit about assumptions, feels less ad hocI Gives access to (large) Bayesian inference literatureI Can be used as a module in larger probabilistic models

4

Page 5: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Generative Model

5

Page 6: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Latent Dirichlet Allocation

I For the nth word in document d ,

wdn|β, θd ∼ Cat (β·k)zdn|θd ∼ Cat (θd)

θd |α ∼ Dir (α)

I Mnemonics:I wdn ∈ {1, . . . ,V } is the term used as the nth word in document

dI zdn ∈ {1, . . . ,K } is the topic associated with the nth word in

document dI θd ∈ SK−1 are the topic mixture proportions for document dI β·k ∈ SV−1 are the term mixture proportions for topic kI α is the topic shrinkage parameter

6

Page 7: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Latent Dirichlet Allocation

β

α θ z w

D

N

I w are observed dataI α,β are fixed, global parametersI θ, z are random, local parameters

7

Page 8: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Observed Counts (sum of wdn’s)

word

doc

count

0

10

20

30

8

Page 9: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Mixing Proportions (θd ’s)

topic

doc

theta

0.25

0.50

0.75

9

Page 10: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Topic Counts (sum of zdn’s)

12

3

word

docu

men

t

z counts

10

20

30

10

Page 11: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Latent Dirichlet Allocation (β)

topic

wor

d

beta

0.01

0.02

0.03

11

Page 12: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Unigram Model

I It can be illustrative to compare with earlier topic modelingapproaches

I The unigram model draws all words from the same multinomial,wdn ∼ Cat (β).

w

D

N

β

12

Page 13: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Mixture of UnigramsI This is the multinomial analog of gaussian mixture modelsI Each word is drawn from a mixture of K topics

zd ∼ p (z)wdn|zd ∼ Cat (βzdk )

I Topic assignment is drawn at the document level

β

z w

D

N

13

Page 14: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Probabilistic Latent Semantic Indexing (pLSI)I pLSI draws a different topic for each word in the document,

zdn|d ∼ p (zdn|d)wdn|zdn ∼ Cat (β)

I The per-document topic mixture proportions are nonrandomand different for each document

I The number of fixed parameters grows linearly with the numberof documents

β

d z w

D

N

14

Page 15: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Back to LDA

I Essential difference: Randomness in topic mixture proportionslets us share information across documents

I Number of fixed parameters does not grow with number ofdocuments.

β

α θ z w

D

N

15

Page 16: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Interpretations

16

Page 17: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

GeometricI Each topic is a point on the simplex, and the K topics

determine a topics simplexI The mixture of unigrams model gives each document a corner

of the topics simplexI The pLSI estimates the empirical distribution of observed

mixing proportionsI LDA estimates a smooth density over the topics simplex

17

Page 18: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Matrix FactorizationI We can think of topics as latent factors and mixing proportions

as document scores,

p (wdn = v |θd ,β) =K∑

k=1p (wdn|βvk) p (zdn = k)

= βTv ·p (zdn)

I The different models treat the p (zdn)’s differentlyI In LDA, this probability is βT

v ·θd .

p(wdn = v) θdk

βkv

=D

V K

18

Page 19: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Inference

19

Page 20: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Variational Bayes

I As scientists / modelers, our primary interest is in the posteriorp (θ, z |w ,α,β) after observing the words w

I This not available in closed form (the normalizing constant isintractable)

I In practice, we also need to estimate the α and β – more onthis later

20

Page 21: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Variational Bayes

I (Blei et al.) propose a variational approachI Turns Bayesian inference into an optimization problem

I Specifically, consider the family of Γ of q’s that factor like

q (θ, z |γ,ϕ) =D∏

d=1

[Dir (θd |γd)

N∏n=1

Cat (zdn|ϕdn)

],

and try to identify,

argminq∈Γ

KL (q (θ, z ||γ,ϕ) ‖p (θ, z |w ,α,β))

21

Page 22: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

KL Minimization

I Note that

KL (q, p) = Eq

[log q (θ, z |γ,ϕ)

p (θ, z |w ,α,β)

]= −H (q) + log p (w |α,β) − Eq [p (θ, z ,w |α,β)] ,

and that the middle term (the “evidence”) is irrelevant to ouroptimization.

I Hence, find γ∗,ϕ∗ that maximize,

Eq [p (θ, z ,w |,α,β)] + H (q) ,

the “evidence lower bound” (ELBO).

22

Page 23: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

KL Minimization

I The ELBO can be written explicitly (though it’s not pretty),

D∑d=1

K∑k=1

(αk − 1)Eq [log (θdk) |γd ] +D∑

d=1

N∑n=1

K∑k=1

ϕdnkEq [log (θdk) |γd ] +

D∑d=1

N∑n=1

V∑v=1

I (wdn = v)ϕdnk logβvk−

D∑d=1

log Γ( K∑

k=1γdk

)+

K∑k=1

log Γ (γdk) −K∑

k=1(γdk − 1)Eq [log θdk |γd ] −

D∑d=1

N∑n=1

K∑k=1

ϕdnk logϕdnk

where we have omitted constants in γ,ϕ.

23

Page 24: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

KL Minimization

I The point is that we can perform coordinate ascent on theparameters ϕ and γ to find locally optimal ϕ∗ and γ∗

I The updates look like

ϕdnk ∝ βnwdn exp (Eq [log θd |γd ])

γdk = αk +

N∑n=1

ϕdnk

I InterpretationI First update is like p (zdn|wdn) ∝ p (wdn|zdn) p (zdn)I Second update is like Dirichlet posterior update upon observing

data ϕdnk .I ϕndk are the same across occurrences of the same term → save

memory

24

Page 25: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Estimating α,β

I So far, we have assumed the fixed parameters α,β are known,when in practice they aren’t

I (Blei et al.) propose two approachesI Variational EM: Here, the ELBO takes the place of the usual

Expected Complete Log-Likelihood, and we alternate betweenoptimizing ϕdnk ,γd (Variational E-step) and α,β (VariationalM-step)

I Smoothed Variational Bayes: Place a Dirichlet prior on β andintroduce this to the Variational approximation. The VariationalM-step now only optimizes α.

I The Smoothed Bayesian approach is better when ML estimatesof β are unreliable (e.g., when data are sparse).

25

Page 26: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Conclusion

26

Page 27: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Examples(Blei et al.) compare approaches to a variety of topic modelingtasks,

I Directly fitting to Associated Press corpus, evaluated usingheld-out likelihood

I As preprocessing for classification on the Reuters dataI Collaborative filtering – evaluate likelihood on held-out movies,

instead of words

27

Page 28: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

Conclusion

I The basic LDA model can be easily extended by removingvarious exchangeability assumptions (D. M. Blei and J. D.Lafferty, D. Blei and J. Lafferty, Lacoste-Julien et al.)

I More generally, the three-level hierarchical Bayesian idea opensthe door to a variety of “mixed-membership” models (Airoldi etal., Erosheva and Fienberg, Mackey et al., Fox and Jordan)

I Alternative MCMC, Variational Inference, and Method ofMoments techniques are still an active area of research (M.Hoffman et al., Anandkumar et al., M. D. Hoffman et al., Tehet al.)

28

Page 29: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters

ReferencesAiroldi, Edoardo M., et al. “Mixed Membership StochasticBlockmodels.” Journal of Machine Learning Research, vol. 9, no.Sep, 2008, pp. 1981–2014.Anandkumar, Anima, et al. “A Spectral Algorithm for LatentDirichlet Allocation.” Advances in Neural Information ProcessingSystems, 2012, pp. 917–925.Blei, David M., and John D. Lafferty. “Dynamic Topic Models.”Proceedings of the 23rd International Conference on MachineLearning, ACM, 2006, pp. 113–120.Blei, David M., et al. “Latent Dirichlet Allocation.” Journal ofMachine Learning Research, vol. 3, no. Jan, 2003, pp. 993–1022.Blei, David, and John Lafferty. “Correlated Topic Models.”Advances in Neural Information Processing Systems, vol. 18, MIT;1998, 2006, p. 147.Erosheva, Elena A., and Stephen E. Fienberg. “Bayesian MixedMembership Models for Soft Clustering and Classification.”Classification—The Ubiquitous Challenge, Springer, 2005, pp.11–26.Fox, Emily B., and Michael I. Jordan. “Mixed Membership Modelsfor Time Series.” ArXiv Preprint ArXiv:1309.3533, 2013.Hoffman, Matthew D., et al. “Stochastic Variational Inference.”Journal of Machine Learning Research, vol. 14, no. 1, 2013, pp.1303–1347.Hoffman, Matthew, et al. “Online Learning for Latent DirichletAllocation.” Advances in Neural Information Processing Systems,2010, pp. 856–864.Lacoste-Julien, Simon, et al. “DiscLDA: Discriminative Learning forDimensionality Reduction and Classification.” Advances in NeuralInformation Processing Systems, 2009, pp. 897–904.Mackey, Lester W., et al. “Mixed Membership Matrix Factorization.”Proceedings of the 27th International Conference on MachineLearning (ICML-10), 2010, pp. 711–718.Teh, Yee W., et al. “A Collapsed Variational Bayesian InferenceAlgorithm for Latent Dirichlet Allocation.” Advances in NeuralInformation Processing Systems, 2006, pp. 1353–1360.

29