information bottleneck em school of engineering & computer science the hebrew university,...

Information Bottleneck EM

School of Engineering & Computer Science

The Hebrew University, Jerusalem, Israel

Gal Elidan and Nir Friedman

Problem: No closed-form solution for ML estimation

Use Expectation Maximization (EM)

Problem: Stuck in inferior local Maxima Random Restarts Deterministic Simulated annealing

Learning with Hidden VariablesInput: Output: A model P(X,T)

DATA

??????

X1 … XN T

Params

Lik

elih

oo

d

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

EM + information regularizationfor learning parameters

T

X2 X3X1

Learning Parameters

Input: Output: A model P(X)

DATA

X1 … XN

Empirical distribution Q(X)

Parametrization of P

P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)

X1

X2 X3

M

x,x,x#)x,x,x(Q 321

321

Empirical distribution Q(X,T) = ?

Learning with Hidden Variables

DATA

X1 … XN

??????

T

4321

M

Y

Parametrization for P

T

X2 X3X1

guess of

Desired structure:

EM Iterations

Empirical distribution Q(X,T,Y) =

Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y)

Input:

For each instance ID, complete value of T

The EM Algorithm:

E-Step: Generate empirical distribution

M-Step: Maximize using

EM is equivalent to optimizing function of Q,P

Each step increases value of functional

EM Functional

)Y|T(H)T,X(PlogE QQ

[Neal and Hinton, 1998]

])Y[X|T(P)Y|T(Q

)X,T(PlogEargmax Q

]P,Q[F


Target:

In the rest of the talk… Understanding this objective How to use it to learn better models

)Y;T(I)1(L QEMIB ]P,Q[F

EM target

Information between hidden and ID

Information RegularizationMotivating idea:Fit training data: Set T to be instance ID to “predict” X

Generalization: “Forget” ID and keep essence of X

Objective:

parameter free regularization of Q[Tishby et. al, 1999]

(lower bound of) Likelihood of P

Compression of instance IDvs.

)Y;T(I)1(]P,Q[FL QEMIB

1

7

5

3

11

9

2

6

108

4 total compression

=0

1

7

5

3

11

9

2

6

108

4

Clustering example


EMTarget

Compressionmeasure

=0EMTarget

Compressionmeasure

Clustering example

1

7

5

3

11

9

2

6

108

4 total preservation

=1

1

7

5

311

9

2

6

10

8

4


Compressionmeasure

EMTarget

=1

T ID

Clustering example

1

7

5

3

11

9

2

6

108

4

1 7

5

3

119

2

6

84 10

Desired

=?


Compressionmeasure

EMTarget

=?

|T| = 2


Formal equivalence with Information Bottleneck

at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]

)Y;T(I)1(L QEMIB

EM functional

]P,Q[F


Formal equivalence with Information Bottleneck

)Y;T(I)1(L QEMIB

EM functional

]P,Q[F

1)t(Q)Y(Z

1)y|t(Q

])y[x|t(P

Prediction ofT using P

Marginal ofT in QNormalization

Maximum of Q(T|Y) is obtained when

The IB-EM Algorithm for fixed Iterate until convergance

E-Step: Maximize LIB-EM by optimizing Q

M-Step: Maximize LIB-EM by optimizing P

(same as standard M-step)

Each step improves LIB-EM

Guaranteed to converge

])y[x|t(P)t(Q)Y(Z

1)y|t(Q 1

)X,T(PlogEmaxarg Q


Target:

In the rest of the talk… Understanding this objective How to use it to learn better models

)Y;T(I)1(L QEMIB

EM target

Information between hidden and ID

]P,Q[F

Continuation

easy

hard

Follow ridge from optimum at =0

LIB

-EM

Q

0

1

Continuation

Recall, if Q is a local maxima of LIB-EM then

We want to follow a path in (Q, ) space so that…

for all t, and y

])y[x|t(P)t(Q)Y(Z

1)y|t(Q 1 0])y[x|t(P)t(Q

)Y(Z1

)y|t(Q 1

),(, QG yt

0),(, QG yt

QLocal maxima for all

Continuation Step

1. Start at (Q,) so that

2. Compute gradient

3. Take direction

4. Take a step in thedesired direction

G

,QG

0),(, QG yt

0),Q(G y,t

Q

0

1

start

Staying on the ridge

Q

0

1

start

Potential problem:

Direction is tangent to path

miss optimumSolution:

Use EM steps to regain path

The IB-EM Algorithm

Set =0 (start at easy solution) Iterate until =1 (EM solution is reached) Iterate (stay on the ridge)

E-Step: Maximize LIB-EM by optimizing Q

M-Step: Maximize LIB-EM by optimizing P

Step (follow the ridge) Compute gradient and direction Take the step by changing and Q

Q

0

1

Calibrating the step size

Potential problem:

Step size too small too slow

Step size too large overshoot target

Inferior solution

Non-parametric: involves only QCan be bounded: I(T;Y) ≤ log2|T|

Calibrating the step size

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

Use change in I(T;Y)

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

I(T

;Y)

Naive

Recall that I(T;Y) measures compression of ID

When I(T;Y) rises more of data is captured

Too sparse

“Interesting”area

The Stock DatasetNaive Bayes modelDaily changes of20 NASDAQ stocks. 1213 train, 303 test

IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated )

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

I(T

;Y)

0 0.2 0.4 0.6 0.8 1

-23

-21

-19

Tra

in li

ke

liho

od

IB-EM

Best of EM

[Boyen et. al, 1999]

Multiple Hidden VariablesWe want to learn a model

with many hiddens ( )

Naive: Potentially exponential in # of hiddens

Variational approximation: use factorized form (Mean Field)

LIB-EM = (Variational EM) - (1-

)Regularization [Friedman et. al, 2002]

P

i

i )Y|T(Q)Y|T(Q

)Y|TT(Q)Y|T(Q K1

Q(T|Y) Y

Percentage of random runs

Te

st lo

g-l

os

s / i

ns

tan

ce

20 40 60 80 100

-342

-338

-334

-330

Mean Field EM1 min/run

400 samples 21 hiddens

Superior to all Mean Field EM runs Time single exact EM run

The USPS Digits dataset

single IB-EM 27 min

exact EM25 min/run

Offers good value for your time!

3/50 EM runs are IB-EM:

EM needs x17 time for similar results

0 20 40 60 80 100

-151.5

-150.5

-149.5

-148.5

-147.5

Precentage of random runs

Te

st lo

g-l

os

s / i

ns

tan

ce

Mean Field EM~0.5 hours

Yeast Stress Response173 experiments (variables)

6152 genes (samples)

25 hidden variables

Superior to all Mean Field EM runs An order of magnitude faster then exact EM

Effective when exact solution becomes intractable!

IB-EM ~6 hours

Exact EM>60 hours

5-24 experiments

Summary

New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation

• Learn optimal ≤1 for better generalization

• Explore other approximations of Q(T|Y)• Model selection: learning cardinality and

enrich structure

Future Work

Relation to Weight Annealing

[Elidan et. al, 2002]

Y

4321

M

DATA

X1 … XN W Init: temp = hotIterate until temp = cold

Perturb w temp Use QW and optimize Cool down

Similarities: Change in

empirical Q Morph towards

EM solution

Differences: IB-EM uses info. regulatization IB-EM uses continuation WA requires cooling policy WA applicable for wider range of

problems

Relation to Deterministic AnnealingY

4321

M

DATA

X1 … XNInit: temp = hotIterate until temp = cold “Insert” entropy temp into model Optimize noisy model Cool down

Similarities: Use information

measure Morph towards

EM solution

Differences: DA parameterization dependent IB-EM uses continuation DA requires cooling policy DA applicable for wider range of

problems

information bottleneck em school of engineering & computer science the hebrew university,...

Documents