lda training system [email protected] 8/22/2012
TRANSCRIPT
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
Problem – Text Relevance
• Q1: apple pie• Q2: iphone crack
• Doc1: Apple Computer Inc. is a well known company located in California, USA.
• Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.
Topic Models
Topic Model – Generative Process
Topic Model - Inference
Latent Dirichlet Allocation
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
Gibbs Sampling for LDA
Gibbs Sampling for LDA
Document-Topic Statistics
Topic-Word Statistics
For each token,
For each token,
For each token,
For each token,
For each token,
Sample a new topic
For each token,
Summary so far
The normalizing constant
The normalizing constant
The normalizing constant
Statistics are sparse
Summary so far
Huge savings: time and memory
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
Priors for LDA
Priors for LDA
Priors for LDA
Priors for LDA
Priors for LDA
Comparing Priors for LDA
Optimizing m
Selecting T
Outline
• Introduction
• SparseLDA
• Rethinking LDA: Why Priors Matter
• LDA Training System Design: MapReduce-LDA
Overview
MapReduce Jobs
Scalability
• Hypothesis- memory 40GB per machine;- 5 words per doc.
• Scalability- if #<docs> <= 1,000,000,000, no #<topics> limit;- if #<topics> < 14,000, no #<docs> limit.
Experiment for Correctness Validation
References• D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003.• Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004.• Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009.• Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic
Model Inference on StreamingDocument Collections. KDD'09.• Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why
Priors Matter, NIPS2009.• David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed
Inference for Latent Dirichlet Allocation, NIPS2007.• Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA:
Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009.• Xueminzhao. LDA design doc. http://x.x.x.x/~
xueminzhao/html_docs/internal/modules/lda.html.
Thanks!