(hierarchical) topic modeling_yueshen xu

20
(Hierarchical) Topic Modeling Yueshen Xu (lecturer) [email protected] / [email protected] Data and Knowledge Engineering Research Center Xidian University Text Mining & NLP & ML

Upload: yueshen-xu

Post on 16-Jan-2017

72 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: (Hierarchical) Topic Modeling_Yueshen Xu

(Hierarchical) Topic Modeling

Yueshen Xu (lecturer)

[email protected] / [email protected]

Data and Knowledge Engineering Research Center

Xidian University

Text Mining & NLP & ML

Page 2: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Outline

Background

Some Concepts

Topic Modeling

Probabilistic Latent Semantic Indexing (PLSI)

Latent Dirichlet Allocation (LDA)

Hierarchical Topic Modeling

Chinese Restaurant Process (CRP)

What I do

Supplement & Reference

2

Keywords: topic modeling, hierarchical topic modeling, probabilistic graphical model, Bayesian model

Basics, not state-of-the-art

Page 3: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Background

Information Overloading

3

we need

summarization

Visualization

Dimensional Reduction

Big DataCloud ComputingArtificial IntelligenceDeep Learning,…, etc

Page 4: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Background

Text Summarization

Document Summarization

What do these docs (or this doc) talk about?

Review Summarization

What do these consumers care about or complain about?

Short Text/Tweets Summarization

What are people discussing about?

4

Automatic Applicable Explainable

Basic Requirement

Topic Modeling

Page 5: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

General Concepts

Latent Semantic Analysis

Text Mining

Natural Language Processing

Computational Linguistics

Information Retrieval

Dimension Reduction

Topic Modeling

Some Concepts

5

Information Retrieval

Computational Linguistics

Natural Language Processing

LSA/Topic Model

Text Mining

LSA

Data Mining

Re

ductio

n

Dimension

Machine

Learning

Machine

Translation

Topic

Modeling

to learn the latent topics from a corpus/document

Page 6: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Topic Modeling

Topic modeling

an example in Chinese (from my doctorate thesis)

6

继续实施稳健的货币政策,保持松紧适度适时预调微调,做好与供给侧结构,并综合运用数量、价格等多种货币政策

从员额上来看,这次改革远远超过了裁军的数量,它是一种结构性的改革,是军队组织结构现代化的一个关键步骤

美元作为主要国际货币的地位在可预见的将来仍无可取代,唯一的出路是推动全球治理向更均衡的方向发展。国际货币基金组织总裁拉加德日前在美国马里兰大学演讲时就呼吁,国际治理改革应认清新兴经济体越来越重要这一现实。

独立学院从母体高校“断奶”后,可能会面临品牌、招生等方面阵痛,但是在国家和省市鼓励民间资本进入教育领域的实施意见发布后,一些独立学院果断切割连接母体大学的“脐带”,自立门户发展。

Corpus

Doc1 Doc2

Doc3 Doc4

Page 7: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Topic Modeling

After topic modeling

7

继续实施稳健的货币政策,保持松紧适度适时预调微调,做好与供给侧结构,并综合运用数量、价格等多种货币政策

政策 0.082改革 0.063…

金融 0.074货币 0.051…

学院 0.077教育 0.071…

军队 0.083组织 0.079…

从员额上来看,这次改革远远超过了裁军的数量,它是一种结构性的改革,是军队组织结构现代化的一个关键步骤

美元作为主要国际货币的地位在可预见的将来仍无可取代,唯一的出路是推动全球治理向更均衡的方向发展。国际货币基金组织总裁拉加德日前在美国马里兰大学演讲时就呼吁,国际治理改革应认清新兴经济体越来越重要这一现实。

独立学院从母体高校“断奶”后,可能会面临品牌、招生等方面阵痛,但是在国家和省市鼓励民间资本进入教育领域的实施意见发布后,一些独立学院果断切割连接母体大学的“脐带”,自立门户发展。 …

……

Corpus

Doc1

Doc2

Doc3Doc4

Topic2

Topic3

Topic4

Topic1

Page 8: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Topic Modeling

A topic

A word cluster a group of words

Not clustered randomly, but meaningfully (not semantically)

8

Models

Parametric models

Latent Semantic Indexing (LSI)

PLSI; Latent Dirichlet Allocation (LDA)

Non-parametric models (Dirichlet Process)

(Nested) Chinese Restaurant Process

Indian Buffet Process

Pitman-Yor Process

Page 9: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Topic Modeling

9

pLSI Model

w1

w2

wN

z1

zK

z2

d1

d2

dM

…..

…..

…..

)(dp)|( dzp)|( zwp

Assumption

Pairs(d,w) are assumed to be

generated independently

Conditioned on z, w is generated

independently of d

Words in a document are

exchangeable

Documents are exchangeable

Latent topics z are independent

The generative process

∑∑∈∈ ZzZz

dzpzwpdpdzwpdpdpdwpwdp )|()|()(=)|,()(=)()|(=),(

Multinomial Distribution

Multinomial Distribution

One layer of ‘Deep

Neutral Network’

Page 10: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Topic Modeling

10

Latent Dirichlet Allocation (LDA)

David M. Blei, Andrew Y. Ng, Michael I. Jordan

Hierarchical Bayesian model; Bayesian pLSI

θ z w

N

β

iterative times

Generative process of LDA

Choose N ~ Poisson(𝜉);

For each document d={𝑤1, 𝑤2…𝑤𝑛}

Choose 𝜃 ~𝐷𝑖𝑟(𝛼); For each of the N

words 𝑤𝑛 in d:

a) Choose a topic 𝑧𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃

b) Choose a word 𝑤𝑛 from 𝑝 𝑤𝑛 𝑧𝑛, 𝛽 ,

a multinomial distribution conditioned on 𝑧𝑛

Page 11: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Gibbs Sampling (MCMC, Markov Chain Monte Carlo)

‘I want to know a distribution, but I haven’t known yet, so I find a

way to generate its samples’

300 lines (code) for LDA, not complex but solid

lim𝑛→∞

𝜋0𝑃𝑛 =

𝜋(1) … 𝜋(|𝑆|)⋮ ⋮ ⋮

𝜋(1) 𝜋(|𝑆|) 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}

Topic Modeling

Parameter Estimation

Variational Inference (+EM) :Complex, rarely use

‘I want to know a distribution, but I haven’t known yet, so I find a

similar distribution (tight upper bound or lower bound)’

K-L divergence (or information gain)

11

Stationary Distribution

Page 12: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Hierarchical Topic Modeling

Topic modeling is not enough

12

Hierarchical Structure

Page 13: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Hierarchical Topic Modeling

13

Chinese Restaurant Process (Dirichlet Process)

A restaurant with an infinite number of tables, and

customers (word) enter this restaurant sequentially. The ith

customer (𝜃𝑖) sits at a table (𝜙𝑘) according to the probability

𝜙𝑘: Clustering == 1/2 unsupervised learning clustering, topic modeling (two layer

clustering), hierarchical concept building, collaborative filtering, similarity computation…

Page 14: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Hierarchical Topic Modeling

14

The generative process (nested CRP)

Focus on the insight

1. Let 𝑐1 be the root restaurant (only one table)

2. For each level 𝑙 ∈ {2, … , 𝐿}:

Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to

by that table

3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)

4. For each word 𝑤𝑛:

Draw 𝑧 ∈ 1,… , 𝐿 ~ Mult(𝜃)

Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧

α

zm,n

N

c1

c2

cL

T

γ

wm,n

M

β

k

m

Matryoshka

(Russia) Doll

Page 15: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Hierarchical Topic Modeling

Examples

15

root topic analysis obtain base system concentration

thermal

polymer acid

property

diamine

activity compound acid

derivative active

compound ligand group

investigate synergistic

reaction

derivative

yield synthesis

microwave

assay food quality content

analysis

decoction

component

radix quality

constituent

compound

activity

synthesize salt

derivative

antioxidant

activity extract

inhibitory

flavonoid

interaction

cation metal

energy

solution

Page 16: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

What I do

Topic-specific opinion mining

Goal: automatically learn which group of aspects people like,

dislike, and how people like, and why people like

Methods: topic model (LDA), Dirichlet process, Gibbs sampling,

etc.

Collaborative recommendation

Goal: automatically learn which group of products people like,

dislike, and how people like, and why people like

Methods: matrix factorization, gradient descent, regularization

norm, etc.

Common basics: Bayesian inference (MLE, MAP, PGM)

16

Page 17: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Supplement

17

Some supplements

Probabilistic Graphical Model

Modeling Bayesian Network using plates and circles

Generative Model & Discriminative Model: 𝑝(𝜃|𝑋/𝐷𝑎𝑡𝑎)

Generative Model: p(θ|X) ∝ p(X|θ)p(θ)

- Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning

Discriminative Model: 𝑝(𝜃|𝑋)

- LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning Also can be represented by

graphical models

Page 18: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Reference

My previous tutorials/notes (ZJU/UIC/Netease/ITRZJU as a Ph.D)

‘Topic modeling (an introduction)’

‘Non-parametric Bayesian learning in discrete data’

‘The research of topic modeling in text mining’

‘Matrix factorization with user generated content’

…, etc

Website

You can download all slides of mine

http://web.xidian.edu.cn/ysxu/teach.html

http://liu.cs.uic.edu/yueshenxu/

http://www.slideshare.net/obamaxys2011

https://www.researchgate.net/profile/Yueshen_Xu

18

Page 19: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29

Reference

• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003

• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007

• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical

Association, 2006

• David Blei. Probabilstic topic models. Communications of the ACM, 2012

• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of

Topic Hierarchies. Journal of the ACM, 2010

• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008

• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals

of Statistics, 1973

• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational

Inference

• Rick Durrett. Probability: Theory and Examples, 2010

• Christopher Bishop. Pattern Recognition and Machine Learning, 2007

• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014

19

Page 20: (Hierarchical) Topic Modeling_Yueshen Xu

Software Engineering2016/12/29 20

Q&A