hierarchical dirichlet processes

44
1 Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010

Upload: talor

Post on 29-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Presenters: Micah Hodosh, Yizhou Sun 4/7/2010. Hierarchical Dirichlet Processes. Content. Introduction and Motivation Dirichlet Processes Hierarchical Dirichlet Processes Definition Three Analogs Inference Three Sampling Strategies. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Dirichlet Processes

1

Hierarchical Dirichlet Processes

Presenters:

Micah Hodosh, Yizhou Sun

4/7/2010

Page 2: Hierarchical Dirichlet Processes

2

Content

• Introduction and Motivation

• Dirichlet Processes

• Hierarchical Dirichlet Processes– Definition– Three Analogs

• Inference– Three Sampling Strategies

Page 3: Hierarchical Dirichlet Processes

3

Introduction

Hierarchical approach to model-based clustering of grouped data

Find an unknown number of clusters to capture the structure of each group and allow for sharing among the groups Documents with an arbitrary number of topics which

are shared globably across the set of corpora. A Dirichlet Process will be used as a prior mixture

components The DP will be extended to a HDP to allow for sharing

clusters among related clustering problems

Page 4: Hierarchical Dirichlet Processes

4

Motivation

Interested in problems with observations organized into groups

Let xji be the ith observation of group j = x

j = {x

j1,

xj2...}

xji is exchangeable with any other element of x

j

For all j,k , xj is exchangeable with x

k

Page 5: Hierarchical Dirichlet Processes

5

Motivation Assume each observation is drawn independently for a

mixture model Factor θ

ji is the mixture component associated with

xji

Let F(θji ) be the distribution of x

ji given θ

ji

Let Gj be the prior distribution of θ

j1, θ

j2... which are

conditionally independent given Gj

Page 6: Hierarchical Dirichlet Processes

6

Content

• Introduction and Motivation

• Dirichlet Processes

• Hierarchical Dirichlet Processes– Definition– Three Analogs

• Inference– Three Sampling Strategies

Page 7: Hierarchical Dirichlet Processes

7

The Dirichlet Process

Let (Θ , β) be a measureable space, Let G

0 be a probability measure on that space

Let A = (A1,A

2..,A

r) be a finite partition of that space

Let α0 be a positive real number

G ~ DP( α0, G

0) is defined s.t. for all A :

Page 8: Hierarchical Dirichlet Processes

8

Stick Breaking Construction

The general idea is that the distribution G will be a weighted average of the distributions of a set of infinite random variables

2 infinite sets of i.i.d random variables ϕ

k ~ G

0 – Samples from the initial probability measure

πk' ~ Beta (1, α

0) – Defines the weights of these

samples

Page 9: Hierarchical Dirichlet Processes

9

Stick Breaking Construction

πk' ~ Beta (1, α

0)

Define πk as

π1' 1-π

1'

(1-π1')π

2' ...

0 1

∑1

π k=1

Page 10: Hierarchical Dirichlet Processes

10

Stick Breaking Construction

πk ~ GEM(α

0)

These πk define the weight of drawing the value

corresponding to ϕk.

Page 11: Hierarchical Dirichlet Processes

11

Polya urn scheme/ CRP

Let each θ1, θ

2,.. be i.i.d. Random variables

distributed according to G Consider the distribution of θ

i, given θ

1,...θ

i-1,

integrating out G:

Page 12: Hierarchical Dirichlet Processes

12

Polya urn scheme

Consider a simple urn model representation. Each sample is a ball of a certain color

Balls are drawn equiprobably, and when a ball of color x is drawn, both that ball and a new ball of color x is returned to the urn

With Probability proportional to α0, a new atom is

created from G0,

A new ball of a new color is added to the urn

Page 13: Hierarchical Dirichlet Processes

13

Polya urn scheme

Let ϕ1 ...ϕ

K be the distinct values taken on by

θ1,...θ

i-1,

If mk is the number of values of θ

1,...θ

i-1, equal

to ϕk:

Page 14: Hierarchical Dirichlet Processes

14

Chinese restaurant process:

...ϕ1 ϕ

3

θ1

θ2

θ3

θ4

Page 15: Hierarchical Dirichlet Processes

15

Dirichlet Process Mixture Model Dirichlet Process as nonparametric prior on the

parameters of a mixture model:

Page 16: Hierarchical Dirichlet Processes

16

Dirichlet Process Mixture Model From the stick breaking representation:

θi will be the distribution represented by ϕ

k with

probability πk

Let zi be the indicator variable representing

which ϕk θ

i is associated with:

Page 17: Hierarchical Dirichlet Processes

17

Infinite Limit of Finite Mixture Model

Consider a multinomial on L mixture components with parameters π

= (π

1, … π

L)

Let π have a symmetric Dirichlet prior with hyperparameters (α

0/L,....α

0/L)

If xi is drawn from a mixture component, z

i,

according to the defined distribution:

Page 18: Hierarchical Dirichlet Processes

18

Infinite Limit of Finite Mixture Model

If , then as L approaches ∞:

The marginal distribution of x1,x

2....

approaches that of a Dirichlet Process Mixture Model

Page 19: Hierarchical Dirichlet Processes

19

Content

• Introduction and Motivation

• Dirichlet Processes

• Hierarchical Dirichlet Processes– Definition– Three Analogs

• Inference– Three Sampling Strategies

Page 20: Hierarchical Dirichlet Processes

20

HDP Definition

• General idea– To model grouped data

• Each group j <=> a Dirichlet process mixture model

• Hierarchical prior to link these mixture models <=> hierarchical Dirichlet process

– A hierarchical Dirichlet process is

• A distribution over a set of random probability measures ( )jG

Page 21: Hierarchical Dirichlet Processes

21

HDP Definition (Cont.)

• Formally, a hierarchical Dirichlet process defines– A set of random probability measures , one

for each group j– A global random probability measure

• is a distributed as a Dirichlet process

• are conditional independent given , also follow DP

jG

0G

0G

jG 0G

0G is discrete!

Page 22: Hierarchical Dirichlet Processes

22

Hierarchical Dirichlet Process Mixture Model

• Hierarchical Dirichlet process as prior distribution over the factors for grouped data

• For each group j– Each observation corresponds to a factor – The factors are i.i.d random. variables

distributed as

jix ji

jG

Page 23: Hierarchical Dirichlet Processes

23

Some Notices

• HDP can be extended to more than two levels– The base measure H can be drawn from a

DP, and so on and so forth– A tree can be formed

• Each node is a DP• Children nodes are conditionally independent given

their parent, which is a base measure• The atoms at a given node are shared among all

its descendant nodes

Page 24: Hierarchical Dirichlet Processes

24

Analog I: The stick-breaking construction

• Stick-breaking representation of

• Stick-breaking representation of

0G

i.e.,

i.e.,

jG

Page 25: Hierarchical Dirichlet Processes

25

Equivalent representation using conditional distributions

Page 26: Hierarchical Dirichlet Processes

26

Analog II: the Chinese restaurant franchise

• General idea:– Allow multiple

restaurants to share a common menu, which includes a set of dishes

– A restaurant has infinite tables, each table has only one dish

Page 27: Hierarchical Dirichlet Processes

27

Notations

• – The factor (dish) corresponding to

• – The factors (dishes) drawn from H

• – The dish chosen by table t in restaurant j

• : the index of associated with • : the index of associated with

jijix

1, , K

jt

jit jtji

jtk k jt

Page 28: Hierarchical Dirichlet Processes

28

Conditional distributions

• Integrate out Gj (sampling table for customer)

• Integrate out G0 (sampling dish for table)

Count notation: , number of customers in restaurant j, at table t, eating dish k , number of tables in restaurant j, eating dish k

jtkn

jkm

Page 29: Hierarchical Dirichlet Processes

29

Analog III: The infinite limit of finite mixture models

• Two different finite models both yield HDPM

– Global mixing proportions place a prior for group-specific mixing proportions

As L goes infinity

Page 30: Hierarchical Dirichlet Processes

30

– Each group choose a subset of T mixture components

As L, T go to infinity

Page 31: Hierarchical Dirichlet Processes

31

Content

• Introduction and Motivation

• Dirichlet Processes

• Hierarchical Dirichlet Processes– Definition– Three Analogs

• Inference– Three Sampling Strategies

Page 32: Hierarchical Dirichlet Processes

32

Introduction to three MCMC schemes

• Assumption: H is conjugate to F– A straightforward Gibbs sampler based on

Chinese restaurant franchise– An augmented representation involving both

the Chinese restaurant franchise and the posterior for G0

– A variation to scheme 2 with streamline bookkeeping

Page 33: Hierarchical Dirichlet Processes

33

Conditional density of data under mixture component k

• For data , conditional density under component k given all data items except is:

• For data set , conditional density is similarly defined

jix

jix

Page 34: Hierarchical Dirichlet Processes

34

Scheme I: Posterior sampling in the Chinese restaurant franchise

• Sampling t and k– Sampling t–

• If is a new t, sampling the k corresponding to it by

• And

jit

Page 35: Hierarchical Dirichlet Processes

35

– Sampling k•

Where is all the observations for table t in restaurant j jtx

Page 36: Hierarchical Dirichlet Processes

36

Scheme II: Posterior sampling with an augmented representation

• Posterior of G0 given :

• An explicit construction for G0 is given:

jt

Page 37: Hierarchical Dirichlet Processes

37

• Given a sample of G0, posterior for each group is factorized and sampling in each group can be performed separately

• Sampling t and k:– Almost the same as in Scheme I

• Except using to replace

• When a new component knew is instantiated, draw

, and set and

,k u . ,km

Page 38: Hierarchical Dirichlet Processes

38

– Sampling for β

Page 39: Hierarchical Dirichlet Processes

39

Scheme III: Posterior sampling by direct assignment

• Difference from Scheme I and II:– In I and II, data items are first assigned to

some table t, and the tables are then assigned to some component k

– In III, directly assign data items to component via variable , which is equivalent to

• Tables are collapsed to numbers jiz jijt

k

jkm

Page 40: Hierarchical Dirichlet Processes

40

• Sampling z:

• Sampling m:

• Sampling β

Page 41: Hierarchical Dirichlet Processes

41

Comparison of Sampling Schemes

• In terms of ease of implementation– The direct assignment is better

• In terms of convergence speed– Direct assignment changes the component

membership of data items one at a time– Scheme I and II, component membership of

one table will change the membership of multiple data items at the same time, leading to better performance

Page 42: Hierarchical Dirichlet Processes

42

Applications

• Hierarchical DP extension of LDA– In CRF representation: dishes are topics,

customers are the observed words

Page 43: Hierarchical Dirichlet Processes

43

Applications

• HDP-HMM

Page 44: Hierarchical Dirichlet Processes

44

References

• Yee Whye Teh et. al., Hierarchical Dirichlet Processes, 2006