relaxations for inference in restricted boltzmann...

35
Relaxations for inference in restricted Boltzmann machines Sida Wang* Roy Frostig* Percy Liang Christoper D. Manning {sidaw, rf}@cs.stanford.edu ICLR 2014 April 15, 2014 Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 1 / 27

Upload: phungnhan

Post on 02-May-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Relaxations for inference in restricted Boltzmannmachines

Sida Wang* Roy Frostig* Percy Liang Christoper D. Manning{sidaw, rf}@cs.stanford.edu

ICLR 2014

April 15, 2014

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 1 / 27

Page 2: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

1 BackgroundThe problemInteger quadratic program

2 Randomized Relax and RoundRelaxationsRandomized rounding

3 Evaluations

4 Conclusions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 2 / 27

Page 3: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background

Outline

1 BackgroundThe problemInteger quadratic program

2 Randomized Relax and RoundRelaxationsRandomized rounding

3 Evaluations

4 Conclusions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 3 / 27

Page 4: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

ICLR context

Probablistic models seems to be losing to feedforward networks.

Probablistic models / max-margin models requiring inference are stillnecessary/better at tasks having structured outputs:

Word alignment

CRFs for image segmentation, sequence tagging

Parsing

or if you want to make use of unlabelled data.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 4 / 27

Page 5: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

Inference methods

Partly because we cannot do inference as well as we can do gradientdescend

MCMC (Gibbs)

Variational inference (mean field)

Belief propagation (not used in RBMs)

Relaxation (not used in RBMs)

We propose a new relaxation-based inference method and solve it usinggradient descend. Applicable to RBMs, DBMs, MRFs, CRFs.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 5 / 27

Page 6: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

Theoretical properties

Inference Method Terminate CorrectMCMC No Yespractical MCMC Yes Novariational inference Yes Nobelief prop. No Norelaxation Yes Approx.

Terminate: guaranteed convergence to an answer (say in polytime)Correct: guaranteed to return the right answer

Approx.: within some multiplicative/additive constant of the right answerCannot have 2 yes because Long and Servedio, 2010 showed it’s NP hardto do inference even in RBMs, even approximately.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 6 / 27

Page 7: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

Theoretical properties

Inference Method Terminate CorrectMCMC No Yespractical MCMC Yes Novariational inference Yes Nobelief prop. No Norelaxation Yes Approx.

Terminate: guaranteed convergence to an answer (say in polytime)Correct: guaranteed to return the right answerApprox.: within some multiplicative/additive constant of the right answer

Cannot have 2 yes because Long and Servedio, 2010 showed it’s NP hardto do inference even in RBMs, even approximately.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 6 / 27

Page 8: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

Theoretical properties

Inference Method Terminate CorrectMCMC No Yespractical MCMC Yes Novariational inference Yes Nobelief prop. No Norelaxation Yes Approx.

Terminate: guaranteed convergence to an answer (say in polytime)Correct: guaranteed to return the right answerApprox.: within some multiplicative/additive constant of the right answerCannot have 2 yes because Long and Servedio, 2010 showed it’s NP hardto do inference even in RBMs, even approximately.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 6 / 27

Page 9: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

How to make practical use of a hardness result

To show problem A is hard:

Find a similar problem B known to be hard

Make enough assumptions on A so that B reduces to A

To make use of the hardness proof:

Look at the assumptions needed to make A harder than B

Decide that B is harder than practical instances of A

Try to solve A using the best approximation algorithm for B

B is CUTNORM (Alon and Naor, 2006), which is a special case ofMAXCUT.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 7 / 27

Page 10: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

How to make practical use of a hardness result

To show problem A is hard:

Find a similar problem B known to be hard

Make enough assumptions on A so that B reduces to A

To make use of the hardness proof:

Look at the assumptions needed to make A harder than B

Decide that B is harder than practical instances of A

Try to solve A using the best approximation algorithm for B

B is CUTNORM (Alon and Naor, 2006), which is a special case ofMAXCUT.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 7 / 27

Page 11: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

How to make practical use of a hardness result

To show problem A is hard:

Find a similar problem B known to be hard

Make enough assumptions on A so that B reduces to A

To make use of the hardness proof:

Look at the assumptions needed to make A harder than B

Decide that B is harder than practical instances of A

Try to solve A using the best approximation algorithm for B

B is CUTNORM (Alon and Naor, 2006), which is a special case ofMAXCUT.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 7 / 27

Page 12: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

What is this work?

Do RBM inference by the (almost provably) optimal approximationalgorithm for MAXCUT

SDP relaxation and randomized rounding from vectors

solve a practical low-rank version of the SDP

use random roundings to get a variety of near-MAP solutions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 8 / 27

Page 13: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background The problem

Flash RBM review

The restricted Boltzmann Machine defines a probability distribution on bitvector v ∈ {0, 1}n

RBM:

pW (v) =1

Z

∑h

exp(v>Wh+ a>v + b>h)

Energy E(v, h) = −v>Wh− a>v − b>h

Partition function Z =∑

v,h exp(v>Wh+ a>v + b>h)

Bipartite Markov random field with latent variables

Better guarantees here, but we present a more general solution.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 9 / 27

Page 14: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background Integer quadratic program

MAP as an IQP

MAP inference in RBM is an instance of the integer quadratic program

maximize x>Axsubject to x ∈ {−1, 1}n

x>Ax =∑

i,j Aijxixj

For the RBM

A =1

2

[0 W

W> 0

]x = [v, h]>

The case x ∈ {0, 1}n and any biases can also be reduced to adifferent A.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 10 / 27

Page 15: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background Integer quadratic program

MAP as an IQP

MAP inference in RBM is an instance of the integer quadratic program

maximize x>Axsubject to x ∈ {−1, 1}n

x>Ax =∑

i,j Aijxixj

For the RBM

A =1

2

[0 W

W> 0

]x = [v, h]>

The case x ∈ {0, 1}n and any biases can also be reduced to adifferent A.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 10 / 27

Page 16: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Background Integer quadratic program

MAP as an IQP

MAP inference in RBM is an instance of the integer quadratic program

maximize x>Axsubject to x ∈ {−1, 1}n

x>Ax =∑

i,j Aijxixj

For the RBM

A =1

2

[0 W

W> 0

]x = [v, h]>

The case x ∈ {0, 1}n and any biases can also be reduced to adifferent A.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 10 / 27

Page 17: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round

Outline

1 BackgroundThe problemInteger quadratic program

2 Randomized Relax and RoundRelaxationsRandomized rounding

3 Evaluations

4 Conclusions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 11 / 27

Page 18: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

Warm-up: a simple relaxation

MAP inference in RBM is an instance of the integer quadratic program

maximize x>Axsubject to x ∈ {−1, 1}n

Relaxes tomaximize x>Axsubject to x ∈ [−1, 1]n

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 12 / 27

Page 19: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

Introducing the SDP relaxation

MAP inference in RBM is an instance of the integer quadratic program

maximize x>Ax = tr(Axx>)subject to x ∈ {−1, 1}n

Make the objective linear by reparametrizing S = xx>

maximize tr(Axx>) = tr(AS)subject to S ∈ Rn×n

S � 0, diag(S) = 1rank(S) = 1

Dropping the non-convex constraint rank(S) = 1 gives us a semidefiniteprogram (SDP).

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 13 / 27

Page 20: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

SDP as relaxation of IQP

Picture: SDP

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 14 / 27

Page 21: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

SDP as relaxation of IQP

Picture: SDP

Rows have Euclidean norm 1

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 14 / 27

Page 22: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

Rank k relaxation

While the SDP can be solved efficiently in theory, it does not scale verywell in n. Consider the rank k relexation instead:

maximize tr(AS)subject to S ∈ Rn×n

S � 0, diag(S) = 1rank(S) ≤ k

We can reparametrize in the reverse S = XX>:

maximize tr(AS) = tr(X>AX)subject to X ∈ Rn×k

for the i-th row: ||Xi||2 = 1

We lose convexity, but we can efficiently find a local minimum byprojected gradient descend on X.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 15 / 27

Page 23: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

A picture

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 16 / 27

Page 24: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Relaxations

A picture

Rank 3 relaxation. Rows have Euclidean norm 1.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 16 / 27

Page 25: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Randomized Relax and Round Randomized rounding

X → x rounding scheme

Goemans and Williamson style rounding for MAX-CUTGiven X ∈ Rn×k, we want to get x ∈ {−1, 1}n:

Sample random spherical unit vector g

Take each xi = sgn(〈Xi, g〉)

1 -1

X1

X3

X2

g

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 17 / 27

Page 26: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

Outline

1 BackgroundThe problemInteger quadratic program

2 Randomized Relax and RoundRelaxationsRandomized rounding

3 Evaluations

4 Conclusions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 18 / 27

Page 27: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

MAP Inference in RBM

Solvingmaximize x>Ax = tr(Axx>)subject to x ∈ {−1, 1}n

with randomized relax and rounding (rrr)

rrr AG rrr-AG Gu

MNIST 340.29 377.47 377.39 319.34Random 22309 22175 23358 12939Hard 40037 36236 41016 23347

AG: annealed Gibbs

Gu: Gurobi IQP solver, given 10x longer time

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 19 / 27

Page 28: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

RRR can be faster and better than annealed Gibbs

20 40 60 80 100 120 1403

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

4.8

5x 10

4

# steps

Negative e

nerg

y

Relaxation

Rounded

AG

AG−hitemp

Figure : A comparison of convergence speeds

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 20 / 27

Page 29: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

RRR is fast, and performs well

seg dbntime obj time obj

Gurobi 115.68 3.7958 301 2.1013A. Gibbs 1.26 3.3537 33 1.9086k = 1 0.15 3.2267 3.7 1.6950k = 2 0.20 3.6830 20 2.0188k = 4 0.49 3.7653 12 2.1119k = 8 0.86 3.7643 5.0 2.1120

Table : Averaged results on PASCAL PIC 2011 instances. Time is measured inseconds. DBN is a deep belief networks trained on MNIST

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 21 / 27

Page 30: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

RRR gives us near-MAP samples

50 100 150 200 250 300 3500

0.005

0.01

0.015

0.02

density

150 200 250 300 350 4000

0.02

0.04

density

rrr−MAP

Gibbs

rrr−MAP

Gibbs

Figure : The negative energy of samples from rrr-MAP compared to Gibbs. top:RBM on MNIST, bot: random matrix

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 22 / 27

Page 31: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

Histograms of the rrr-MAP samples vs. kEvaluations

Histograms of the rrr-MAP samples vs. k

*

*

*

*

*

*

*

*

**

*

*

Figure : From top to bottom, rows vary across k = 2, 4, 8. From left to right,columns show: (1) random A; (2) a pairwise distance matrix formed by MNISTdigits 4 and 9; (3) an instance from seg; (4) an instance from dbn. The limits ofthe x-axis is identical in each column.

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 23 / 27

k=2                                      

Random   Distance   Seg.   DBN  

k=4                                      

k=8                                      

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 23 / 27

Page 32: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Evaluations

Use these samples to estimate the partition function

True AIS rrr-low rrr-IS

MNIST - 436.37 436.69 438.40Random-S 5127.6 5127.5 5095.7 5092.4Random-L - 9750.5 9547.7 9606.7

Table : Estimates of the RBM log-partition function logZ(A)

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 24 / 27

Page 33: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Conclusions

Outline

1 BackgroundThe problemInteger quadratic program

2 Randomized Relax and RoundRelaxationsRandomized rounding

3 Evaluations

4 Conclusions

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 25 / 27

Page 34: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Conclusions

Current/future works

Randomized relax-and-round for learning

Better theoretical bounds specializing to machine learning problems

Tighten the relaxation using relaxation hierachies

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 26 / 27

Page 35: Relaxations for inference in restricted Boltzmann …nlp.stanford.edu/~sidaw/home/_media/papers:rrr_iclr2014.pdfRelaxations for inference in restricted Boltzmann machines Sida Wang*

Conclusions

Conclusions

Proposed the randomized relax-and-round method for MAP inferencein MRFs

Evaluated in the RBB. RRR sometimes performs better than annealedGibbs, and always give annealed Gibbs a better initialization.

Generate near-MAP samples

Wang & Frostig (Stanford) Randomized relax-and-round April 15, 2014 27 / 27