information bottleneck em school of engineering & computer science the hebrew university,...

28
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Upload: beatrix-kelly

Post on 17-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information Bottleneck EM

School of Engineering & Computer Science

The Hebrew University, Jerusalem, Israel

Gal Elidan and Nir Friedman

Page 2: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Problem: No closed-form solution for ML estimation

Use Expectation Maximization (EM)

Problem: Stuck in inferior local Maxima Random Restarts Deterministic Simulated annealing

Learning with Hidden VariablesInput: Output: A model P(X,T)

DATA

??????

X1 … XN T

Params

Lik

elih

oo

d

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

EM + information regularizationfor learning parameters

T

X2 X3X1

Page 3: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Learning Parameters

Input: Output: A model P(X)

DATA

X1 … XN

Empirical distribution Q(X)

Parametrization of P

P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)

X1

X2 X3

M

x,x,x#)x,x,x(Q 321

321

Page 4: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Empirical distribution Q(X,T) = ?

Learning with Hidden Variables

DATA

X1 … XN

??????

T

4321

M

Y

Parametrization for P

T

X2 X3X1

guess of

Desired structure:

EM Iterations

Empirical distribution Q(X,T,Y) =

Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y)

Input:

For each instance ID, complete value of T

Page 5: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

The EM Algorithm:

E-Step: Generate empirical distribution

M-Step: Maximize using

EM is equivalent to optimizing function of Q,P

Each step increases value of functional

EM Functional

)Y|T(H)T,X(PlogE QQ

[Neal and Hinton, 1998]

])Y[X|T(P)Y|T(Q

)X,T(PlogEargmax Q

]P,Q[F

Page 6: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information Bottleneck EM

Target:

In the rest of the talk… Understanding this objective How to use it to learn better models

)Y;T(I)1(L QEMIB ]P,Q[F

EM target

Information between hidden and ID

Page 7: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information RegularizationMotivating idea:Fit training data: Set T to be instance ID to “predict” X

Generalization: “Forget” ID and keep essence of X

Objective:

parameter free regularization of Q[Tishby et. al, 1999]

(lower bound of) Likelihood of P

Compression of instance IDvs.

)Y;T(I)1(]P,Q[FL QEMIB

Page 8: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

1

7

5

3

11

9

2

6

108

4 total compression

=0

1

7

5

3

11

9

2

6

108

4

Clustering example

)Y;T(I)1(]P,Q[FL QEMIB

EMTarget

Compressionmeasure

=0EMTarget

Compressionmeasure

Page 9: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Clustering example

1

7

5

3

11

9

2

6

108

4 total preservation

=1

1

7

5

311

9

2

6

10

8

4

)Y;T(I)1(]P,Q[FL QEMIB

Compressionmeasure

EMTarget

=1

T ID

Page 10: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Clustering example

1

7

5

3

11

9

2

6

108

4

1 7

5

3

119

2

6

84 10

Desired

=?

)Y;T(I)1(]P,Q[FL QEMIB

Compressionmeasure

EMTarget

=?

|T| = 2

Page 11: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information Bottleneck EM

Formal equivalence with Information Bottleneck

at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]

)Y;T(I)1(L QEMIB

EM functional

]P,Q[F

Page 12: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information Bottleneck EM

Formal equivalence with Information Bottleneck

)Y;T(I)1(L QEMIB

EM functional

]P,Q[F

1)t(Q)Y(Z

1)y|t(Q

])y[x|t(P

Prediction ofT using P

Marginal ofT in QNormalization

Maximum of Q(T|Y) is obtained when

Page 13: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

The IB-EM Algorithm for fixed Iterate until convergance

E-Step: Maximize LIB-EM by optimizing Q

M-Step: Maximize LIB-EM by optimizing P

(same as standard M-step)

Each step improves LIB-EM

Guaranteed to converge

])y[x|t(P)t(Q)Y(Z

1)y|t(Q 1

)X,T(PlogEmaxarg Q

Page 14: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Information Bottleneck EM

Target:

In the rest of the talk… Understanding this objective How to use it to learn better models

)Y;T(I)1(L QEMIB

EM target

Information between hidden and ID

]P,Q[F

Page 15: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Continuation

easy

hard

Follow ridge from optimum at =0

LIB

-EM

Q

0

1

Page 16: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Continuation

Recall, if Q is a local maxima of LIB-EM then

We want to follow a path in (Q, ) space so that…

for all t, and y

])y[x|t(P)t(Q)Y(Z

1)y|t(Q 1 0])y[x|t(P)t(Q

)Y(Z1

)y|t(Q 1

),(, QG yt

0),(, QG yt

QLocal maxima for all

Page 17: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Continuation Step

1. Start at (Q,) so that

2. Compute gradient

3. Take direction

4. Take a step in thedesired direction

G

,QG

0),(, QG yt

0),Q(G y,t

Q

0

1

start

Page 18: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Staying on the ridge

Q

0

1

start

Potential problem:

Direction is tangent to path

miss optimumSolution:

Use EM steps to regain path

Page 19: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

The IB-EM Algorithm

Set =0 (start at easy solution) Iterate until =1 (EM solution is reached) Iterate (stay on the ridge)

E-Step: Maximize LIB-EM by optimizing Q

M-Step: Maximize LIB-EM by optimizing P

Step (follow the ridge) Compute gradient and direction Take the step by changing and Q

Page 20: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Q

0

1

Calibrating the step size

Potential problem:

Step size too small too slow

Step size too large overshoot target

Inferior solution

Page 21: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Non-parametric: involves only QCan be bounded: I(T;Y) ≤ log2|T|

Calibrating the step size

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

Use change in I(T;Y)

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

I(T

;Y)

Naive

Recall that I(T;Y) measures compression of ID

When I(T;Y) rises more of data is captured

Too sparse

“Interesting”area

Page 22: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

The Stock DatasetNaive Bayes modelDaily changes of20 NASDAQ stocks. 1213 train, 303 test

IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated )

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

I(T

;Y)

0 0.2 0.4 0.6 0.8 1

-23

-21

-19

Tra

in li

ke

liho

od

IB-EM

Best of EM

[Boyen et. al, 1999]

Page 23: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Multiple Hidden VariablesWe want to learn a model

with many hiddens ( )

Naive: Potentially exponential in # of hiddens

Variational approximation: use factorized form (Mean Field)

LIB-EM = (Variational EM) - (1-

)Regularization [Friedman et. al, 2002]

P

i

i )Y|T(Q)Y|T(Q

)Y|TT(Q)Y|T(Q K1

Q(T|Y) Y

Page 24: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Percentage of random runs

Te

st lo

g-l

os

s / i

ns

tan

ce

20 40 60 80 100

-342

-338

-334

-330

Mean Field EM1 min/run

400 samples 21 hiddens

Superior to all Mean Field EM runs Time single exact EM run

The USPS Digits dataset

single IB-EM 27 min

exact EM25 min/run

Offers good value for your time!

3/50 EM runs are IB-EM:

EM needs x17 time for similar results

Page 25: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

0 20 40 60 80 100

-151.5

-150.5

-149.5

-148.5

-147.5

Precentage of random runs

Te

st lo

g-l

os

s / i

ns

tan

ce

Mean Field EM~0.5 hours

Yeast Stress Response173 experiments (variables)

6152 genes (samples)

25 hidden variables

Superior to all Mean Field EM runs An order of magnitude faster then exact EM

Effective when exact solution becomes intractable!

IB-EM ~6 hours

Exact EM>60 hours

5-24 experiments

Page 26: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Summary

New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation

• Learn optimal ≤1 for better generalization

• Explore other approximations of Q(T|Y)• Model selection: learning cardinality and

enrich structure

Future Work

Page 27: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Relation to Weight Annealing

[Elidan et. al, 2002]

Y

4321

M

DATA

X1 … XN W Init: temp = hotIterate until temp = cold

Perturb w temp Use QW and optimize Cool down

Similarities: Change in

empirical Q Morph towards

EM solution

Differences: IB-EM uses info. regulatization IB-EM uses continuation WA requires cooling policy WA applicable for wider range of

problems

Page 28: Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Relation to Deterministic AnnealingY

4321

M

DATA

X1 … XNInit: temp = hotIterate until temp = cold “Insert” entropy temp into model Optimize noisy model Cool down

Similarities: Use information

measure Morph towards

EM solution

Differences: DA parameterization dependent IB-EM uses continuation DA requires cooling policy DA applicable for wider range of

problems