an introduction to lda tools

An Introduction to LDA Tools

Kuan-Yu ChenInstitute of Information Science, Academia Sinica

2

Reference• D. M. Blei et al., “Latent Dirichlet allocation,” Journal of Machine Learning

Research, 3, pp. 993–1022, January 2003.

• D. Blei and J. Lafferty, “Topic models,” in A. Srivastava and M. Sahami, (eds.), Text Mining: Theory and Applications. Taylor and Francis, 2009.

• T. Hoffmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42, pp. 177–196, 2001.

• T. Griffiths and M. Steyvers, ”Finding scientific topics,” in Proc. of the National Academy of Sciences, 2004.

• X. Wei and W.B. Croft, ”LDA-based document models for ad-hoc retrieval,” in Proc. of ACM SIGIR, 2006.

3

Outline• A Briefly Review of Mixture Models

– Unigram Model– Mixture of Unigrams– Probabilistic Latent Semantic Analysis– Latent Dirichlet Allocation

• LDA Tools– GibbsLDA++– VB-EM source code from Blei

• Examples

4

Unigram Model & Mixture of Unigrams

• Unigram model– Under the unigram model, the words of every document

are drawn independently from a single multinomial distribution:

• Mixture of unigrams– Under this mixture model, each document is generated by

first choosing a topic and then generating words independently from the conditional multinomial:

N

nnwPP

1w

z

N

nn z|wPzPP

1w

z N

5

Probabilistic Latent Semantic Analysis

• Probabilistic latent semantic analysis (PLSA/PLSI)– The PLSA model attempts to relax the simplifying

assumption made in the mixture of unigrams model that each document is generated from only one topic• serves as the mixture weights of the topics for a

particular document

N

n zn d|zPz|wPP

1w

d|zP

6

Latent Dirichlet Allocation• The basic idea is that documents are represented as

random mixtures over latent topics, where each topic is characterized by a distribution over words

• LDA assumes the following generative process for each document in a corpus :1. Choose 2. Choose3. For each of the N words :

a) Choose a topicb) Choose a word from , a multinomial

probability conditioned on the topic

Poisson~N αDir~θ

nw θlMultinomia~zn

nw β,z|wp nnnz

w D

7

Latent Dirichlet Allocation• Several simplifying assumptions are made:

– The dimensionality of Dirichlet distribution is assumed known and fixed

– The word probabilities are parameterized by a matrix , which we treat as a fixed quantity that is to be estimated

– The Poisson assumption is not critical to anything• Note that document length is independent of all the other

data generating variables ( and )

Vk

k

β

Nθ z

8

Latent Dirichlet Allocation• Given the parameters and , the joint distribution of a

topic mixture , a set of topics , and a set of words is given by:

• Integrating over and summing over , we obtain the marginal distribution of a document:

• Obtain the probability of a corpus:

θ βθθβα,1

d,z|wp|zpα|p|pz

N

nnnn

w

N

nnnn ,z|wp|zp|p|p

1βθαθβα,,θ, wz

M

dd

z

N

ndndnddnd d,z|wp|zp|p|Dp

d

d

1 1θ βθαθβα,

α βθ N z N

w

θ z

9

Latent Dirichlet Allocation• The key inferential problem is that of computing the

posteriori distribution of the hidden variable given a document :

– Unfortunately, this distribution is intractable to compute in general

– Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA

βα,|

βα,|,θ,βα,θwwzwz

pp,|,p

10

Latent Dirichlet Allocation - VBEM

• The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood

• A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed

11

Latent Dirichlet Allocation - VBEM

• This family is characterized by the following variational distribution:

• The desideratum of finding a tight lower bound on the log likelihood translates directly into the following optimization problem:

– by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior

N

nnn |zq|q,|,q

1φγθφγθ z

βαθφγθD min argφγφγ

,,|,p||,|,q,,

wzz

12

GibbsLDA++• GibbsLDA++ is a C/C++ implementation of Latent

Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference

• The main page of GibbsLDA++ is:http://gibbslda.sourceforge.net/

• We can download this tool from:http://sourceforge.net/projects/gibbslda/

• It needs to be compiled on Linux/Cygwin environment

http://gibbslda.sourceforge.net/

http://gibbslda.sourceforge.net/

http://sourceforge.net/projects/gibbslda/

http://sourceforge.net/projects/gibbslda/

13

GibbsLDA++• Extract “GibbsLDA++-0.2.tar.gz”• Run cygwin• Switch current directory to “/GibbsLDA++-0.2”• Execute the commands

• Then, we have an executable file “lda.exe” in the “/GibbsLDA++-0.2/src” directory

make cleanmake all

14

An Example of GibbsLDA++• Format of the training corpus

226540889 44022 10092 2471 9800….31677 653 657 17998 1788…...1521 15820 3015 48825 2690…..42763 7680 38280 2913 42763…..42763 2997 732 42472 3844…..2572 1583 2584 44400 3015….....

Total document number

Doc1Doc2

word1 word2

15

An Example of GibbsLDA++• LDA Parameter Estimation

– Command – Parameter Settings

lda.exe –est –dfile Gibbs_TDT2_Text.txt –alpha 6.25 –beta 0.1 –ntopics 8 –niters 2000

-dfile: the input training data-alpha: the hyper-parameter of LDA-beta: the hyper-parameter of LDA-ntopics: the number of latent topics-niters: the number of iterations

16

An Example of GibbsLDA++• Outputs of Gibbs sampling estimation of GibbsLDA++

include the following files:– model.others: This file contains some parameters of LDA

model– model.phi: This file contains the word-topic distributions

(topic-by-word matrix)– model.theta: This file contains the topic-document

distributions (document-by-topic)– model.tassign: This file contains the topic assignments for

words in training data– Wordmap.txt: This file contains the maps between words

and word's IDs (integer)

17

VB-EM source code from Blei• Blei implement the Latent Dirichlet Allocation (LDA) by

using VB-EM for parameter estimation and inference

• The main page of the source code is:http://www.cs.princeton.edu/~blei/lda-c/index.html

• We can download this tool from:http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz

• It needs to be compiled on Linux/Cygwin environment

http://www.cs.princeton.edu/~blei/lda-c/index.html

http://www.cs.princeton.edu/~blei/lda-c/index.html

http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz

http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz

18

VB-EM source code from Blei• Extract “lda-c-dist.tgz”• Run cygwin• Switch current directory to “/lda-c-dist”• Execute the commands

• Then, we have an executable file “lda.exe” in the “/lda-c-dist” directory

make

19

An Example of LDA• Format of the training corpus

77 508:1 596:3 612:2 709:1 713:1 …..72 508:2 596:5 597:1 653:1 657:3 …..88 457:1 508:1 572:2 596:6 795:1 …..62 457:1 508:1 596:2 657:1 732:1 …..53 336:4 341:1 457:1 596:1 657:1 ….....

number of unique words word-id

appeared times

Doc1Doc2

20

An Example of LDA• LDA Parameter Estimation

– The input format can be expressed as:

• [alpha]: The hyper-parameter of LDA• [k]: The number of latent topics• [settings]: The settings file• [data]: The input training data• [initialization]: Specify how the topics will be initialized• [directory]: The output directory

– Commandlda.exe est 6.25 8 ./settings.txt Blei_TDT2_Text.txt random ./

lda.exe est [alpha] [k] [settings] [data] [initialization] [directory]

21

An Example of LDA• The settings file contain several experimented values:

– var max iter: The maximum number of iterations for a single document

– var convergence: The convergence criteria for inference– em max iter: The maximum number of iterations of VB-EM– em convergence: The convergence criteria for VB-EM– alpha: set “fixed” or “estimate”

var max iter 20var convergence 1e-6em max iter 100em convergence 1e-4alpha estimate

22

An Example of LDA• The saved models are in three files:

– <iteration>.other: This file contains alpha and some other statistical information of LDA model

– <iteration>.beta: This file contains the log of the topic distribution over words (topic-by-word matrix)

– <iteration>.gamma: This file contains the variational posterior Dirichlets of each document (document-by-topic matrix)

an introduction to lda tools

Documents

latent topics

topic mixture

words lda

document length

posterior distribution

joint distribution

posteriori distribution

marginal distribution