probabilistic inference by hashing and optimization€¦ · 2016-01-25 · bridging statistical...

Probabilistic Inference by Hashing and

Optimization

Stefano Ermon

Stanford University

Challenges in reasoning about complex systems

Operations Research

Management Science

Economics

Three major challenges:

1.High dimensional spaces: need to consider many variables

2.Uncertainty: limited information, need to use stochastic models

3.Preferences and utilities: need to consider optimization criteria

Statistics

Computer Science Engineering

Physics

Decision Making and Inference under

Limited Information and Large Dimensionality

High-dimensional Spaces Lack of information/Uncertainty

Optimization

Combinatorial Reasoning

CombinatorialOptimization

Stochastic Optimization

Probabilistic Reasoning

Decision-making

Statistics

Preferences and Utilities

3

Decision Making and Inference for Sustainability

Lack of Information/Uncertainty

fuel cells materials[SAT-12, AAAI-15, IJCAI-15]

EV batteries[ECML-12]

Poverty traps[AAAI-15]

Poverty mapping[ongoing]

High-dimensional Spaces

natural resources management [UAI-10, IJCAI-11]


HabitatMonitoring

[ongoing]

4

Research Agenda

5

Foundations

Applications

inference by hashing and optimization

[ICML-13, UAI-13, ICML-14, ICML-15, UAI-15, AAAI-16]

bridging statistical physics and computer science

[CP-10, IJCAI-11, NIPS-11, NIPS-12]

High-throughput data mining and catalyst discovery

[SAT-12, AAAI-15, IJCAI-15]

sampling[NIPS-13, UAI-12]

Poverty mapping[AAAI-16]

natural resources management

[UAI-10, IJCAI-11]

Outline

1. Introduction

2. WISH: probabilistic inference by hashing and optimization

I. Formalism

II. Empirical validation

3. Families of hash functions

4. Random projections and hashing

5. Conclusions

6

Statistical Modeling and Probabilistic Inference

Statistical Modeling:

• Stochastic variables

• Models are probability distributions

• Goal is to make inferences/predictions. Examples:

– What is the probability that an object is in a certain

location, given some noisy sensor measurements?

– How well will this policy perform on average?

• Inference is needed also to learn models from data

Probabilistic inference is a cornerstone of statistical

machine learning and statistical applications

7

Progress So Far

Fundamental problem: studied since the early days of

computing (Manhattan Project – diffusion behavior of nuclear

particles)

Much progress has been made (contributions from CS,

physics, statistics, etc.). To date, two main approaches:

– Markov Chain Monte Carlo Sampling (ranked one of top

10 algorithms of all times)

– Variational

This work: fundamentally different approach based on

hashing and optimization [Chakraborty, Fremont, Meel, Seshia, and

Vardi; Achlioptas et al.; Belle et al.; Gomes et al., …]8

A Worst-Case Perspective

9

P

NP

P^#P

PSPACE

NP-complete/hard:Combinatorial Reasoning

SAT, puzzles,

Integer Programming (MIP)

Note: widely believed hierarchy; know P≠EXP for sure

Easy

Hard

PH

EXP

#P-complete/hard:Probabilistic Reasoning

#SAT, sampling


Uncertainty


Uncertainty

Inference by Hashing and Optimization

Main technical contribution: If we allow

– Small failure probability (e.g., 0.1%)

– Small error (e.g., 1% relative error)

we can reduce to small number of NP-equivalent queries (combinatorial search/optimization)

Search: Is there a way to set the variables such that some property holds?

Integration: How many ways to set the variables s.t. a certain property holds?

10


Uncertainty High-dimensional Spaces

Uncertainty

#P NP

• Progress in combinatorial search since the 1990s

– Boolean Satisfiability problem SAT / SMT

– Mixed Integer Programming MIP

• Problem size: from 100 variables, 200 constraints (early 90s)

to 1,000,000 variables and 5,000,000 constraints in 20 years

• Time is mature for probabilistic inference applications.

Techniques traditionally used for deductive reasoning

(verification, synthesis) brought to the realm of probability and

inductive reasoning

A Practical Perspective

11

High-dimensional combinatorial spaces


Probabilistic Inference

For any configuration (or state), defined by

an assignment of values to the random variables, we can

compute the weight/probability of that configuration.

12

SPRINKLERRAINGRASS WET

SPRINKLER

True False

RA

INTrue 0.01 0.99

False 0.4 0.6

RAIN

True False

0.2 0.8

Example: Pr [Rain=T, Sprinkler=T, Wet=T] ∝ 0.01 * 0.2 * 0.99

WET

True False

… …

0.99 0.01

… …

Factor

Probabilistic Inference

What is the (conditional) probability of an event? For example,

Pr[Wet=T] = ∑x={T,F} ∑y={T,F} Pr[Rain=x, Sprinkler=y, Wet=T]

involves expectations/nested sums (sum up contributions from all

possible variable assignments)

13

SPRINKLERRAINGRASS WET

Probabilistic Inference in High Dimensions

• We are given – a set of 2n configurations (= assignments)

– non-negative weights w

• from Bayes Net, factor graph, weighted CNF..

• Goal: compute total weight

• Example 1: n=2 binary variables, sum over 4 configurations

• Example 2: n=100 variables, sum over 2100 ≈1030 terms (curse of dimensionality, generally #P-complete)

14

1 4 …

2n configurations

P[Wet=T]

= ∑x={T,F} ∑y={T,F} P[Rain=x, Sprinkler=y, Wet=T]

= 0 + 0.15 + 0.29 + 0.002 = 0.442

.002

Rain=FSprinkler=F

Rain=FSprinkler=T

Rain=TSprinkler =F

Rain=TSprinkler=T

configurations x

1 3

.29.150

Probabilistic model with 10 binary variables

1024 configurations sorted by weight

A New Approach: WISH

15

x1 x2 x10

? ?? … ? ? ? ? ? ? ?

Increasing weight

configurations

wei

ght

A new approach : WISH



16

x1 x2 x10

? 50? … ? ? ? ? ? ? ?

Increasing weight

Use optimization to find most likely configuration(NP-equivalent)

configurations

wei

ght

16




17

x1 x2 x10

? 500

Increasing weight

… ? ? ? ? ? ? ?

Use optimization to find least likely configuration

configurations

wei

ght




18

0≤?

≤50

x1 x2 x10

500

Increasing weight

… 0≤?

≤50

0≤?

≤50

0≤?

≤50

0≤?

≤50

0≤?

≤50

0≤?

≤50

0≤?

≤50




19

x1 x2 x10

Weight distribution could be like this

Total weight = 1023 * 0 + 50 = 50

0? 500

Increasing weight

… 0? 0? 0? 0? 0? 0? 0?




20

x1 x2 x10

OR could be like this

Total weight = 1023 * 50 + 0 = 51150

50? 500

Increasing weight

… 50? 50? 50? 50? 50? 50? 50?

50 <= Total weight <= 1023 * 50




21

40

x1 x2 x10

500

Increasing weight

… 0≤?

≤40

0≤?

≤40

0≤?

≤40

0≤?

≤40

0≤?

≤40

0≤?

≤40

0≤?

≤40

Find weight of 2nd most likely configuration(how explained later)

90 <= Total weight <= 50 + 40*1022




22

40

x1 x2 x10

500

Increasing weight

… 0≤?

≤9

0≤?

≤9

0≤?

≤9

0≤?

≤9

0≤?

≤99

9≤?

≤40

Find weight of 4th most likely(how explained later)

90 + 9*2 <= Total weight <= 50 + 40*2+9*1020




23

40

x1 x2 x10

500

Increasing weight

… 0≤?

≤55

5≤?

≤9

5≤?

≤9

5≤?

≤99

9≤?

≤40

Find weight of 8th most likely configuration(how explained later)

90 + 9*2 +5*4<= Total weight <= 50 + 40*2+9*4+5*1016

A New Approach: WISH

If we have

• Weight of the most likely configuration

• Weight of 2nd most likely configuration

• Weight of 4th most likely configuration

• …

• Weight of 512th most likely configuration

24

x1 x2 x10

Increasing weight

…

Weight of 2i-th most likely configuration

We can show:

Factor of 2 approximation of the total weight

How to compute the weight of 2i-th most likely configuration when i is large (say i=50) ?

Random parity constraints

• XOR/parity constraints:

– Example: a b c d = 1 satisfied if an odd number of a,b,c,d are set to 1

• For every configuration, set weight to zero if parity constraint is

not satisfied

25

Randomly generated parity constraint X

Each variable added with prob. 0.5

x1 x3 x4 x7 x10 = 1

configurations

wei

ght

0 50 1000

1

2

3

Configurations

Weig

ht

Strongly Universal Hashing

• XOR/parity constraints to the rescue

– a b c d = 1: satisfied if an odd number of a,b,c,d are set to 1

Two crucial properties (strongly universal hashing [Valiant-Vazirani, Stockmeyer]):

1) Uniformity : For every configuration A, Pr [ A satisfies X ] = 0.5

2) Pairwise independence: For every two configurations A and B,

“A satisfies X ” and “B satisfies X ” are independent events

26



x1 x3 x4 x7 x10 = 1

configurations

wei

ght

Universal Hashing and Optimization

• Example: suppose we want to compute the weight of the 64th most

likely configuration

• Weight of the mode (most likely configuration) of the randomly projected

distribution is “often” “close enough” to the weight of the 64th most likely

configuration of the original distribution

27

0 500 10000

1

2

3

Configurations

Weig

ht

Use optimization to find most likely configuration

configurations

wei

ght

configurations

wei

ght

After adding 6 random parity constraints we are left with 1024/64 =1024/26=16

configurations on average

Visual working of the algorithm

• How it works

28

….

median M1

1 random parity constraint

2 random parity constraints

…. ….

3 random parity constraints

median M2 median M3

….

Mode M0 + + +×1 ×2 ×4 + …

Weight function

n times

Log(n) times

Optimization to find the mode optimization

finds the mode

≈ weight of 2nd

most likely conf.≈ weight of 4th

most likely conf.≈ weight of 8th

most likely conf.Final estimate for the total weight

xnx1

Accuracy Guarantees

Main Theorem (stated informally):

With probability at least 1- δ (e.g., 99.9%),

WISH (Weighted-Sums-Hashing) computes a sum defined

over 2n configurations (probabilistic inference, #P-hard)

with a relative error that can be made as small as desired,

and it requires solving θ(n log n) optimization instances

(NP-equivalent problems).

29

Key Features

• Strong accuracy guarantees

• Can leverage recent advances in optimization

• Decoding techniques for error correcting codes (LDPC)

• Even without optimality guarantees, bounds/approximations

on the optimization translate to bounds/approximations on

the weighted sum

• Straightforward parallelization (independent optimizations)

– We experimented with >600 cores

30

Outline

1. Introduction


I. Formalism




5. Conclusions

31

Counting via NP-Queries: Sudoku

Filling a Sudoku puzzle is a combinatorial search problem.

We are reasonably good at it.

Computers are extremely good at it.

32

Counting via NP-Queries: Sudoku

• How many ways to fill a valid sudoku square?

• WISH estimate: 1.634×1021

• Traditional probabilistic inference methods fail. Cannot reason about

such complex constraints.

33

1 2

….

?

≈1077 possible squares

{ 1 if the square is valid

0 otherwise

Specialized proof:6.671×1021

x1 x2 x81

x1

x2

x3

xn

x4

Inference benchmark: Ising models

3434

…

Statistical physics Computer Vision

Number of variables

Log

no

rmal

izat

ion

co

nst

ant

Inference benchmark: Ising models from physics

35

Guaranteed relative error: ground truth is provably within this small error band(too small to see with this scale)

Variational methods provide inaccurate estimates (well outside the error band)

35

WISH

Belief Propagation

Mean Field

BP variant

=Ground Truth

Inference benchmark: Ising models from physics

3636


37

Gibbs Sampling


38

Gibbs Sampling

Belief Prop.


39

Gibbs Sampling

Belief Prop.

WISH

Outline

1. Introduction


I. Formalism




5. Conclusions

40

Hashing as coin flipping

• Hashing as a distributed coin flipping mechanism

• Ideally, each configuration flips a coin independently

0000 … 11110001 1101 1110

Heads Tails

Configurations

Coin flips

• Issue: we cannot simulate so many independent coin flips (one for

each possible variable assignment)

• Solution: each configuration flips a coin pairwise independently

• Any two coin flips are independent

• Three or more might not be independent

Still works! Pairwise independence guarantees that configurations do not

cluster too much in single bucket.

Can be simulated using parity constraints: add each variable with

probability ½.

Pairwise Independent Hashing

…

``Long’’ parity constraints are difficult to deal with!

• New class of average universal hash functions (coin flips generation

mechanism) [ICML-14]

• Pairs of coin flips are NOT guaranteed to be independent anymore

• Key idea: Look at large enough sets. If we look at all pairs, on average

they are “independent enough”. Main result:

1. These coin flips are good enough for probabilistic inference (applies to all

previous techniques based on hashing; same theoretical guarantees)

2. Can be implemented with short parity constraints

Average Universal Hashing

…

Short Parity Constraints for Probabilistic Inference

and Model Counting

Main Theorem (stated informally): [AAAI-16]

For large enough n (= number of variables),

– Parity constraints of length log(n) are sufficient.

– Parity constraints of length log(n) are also necessary.

Proof borrows ideas from the theory of low-density parity

check codes.

Short constraints are much easier to deal with in practice.

44

Model Selection

• Model Selection: we have several competing statistical models. Which

one fits the data better?

– Pick the one that is most likely to have generated the data

– In the majority of models of interest, probabilities are defined up to a

normalization constant

– Ranking requires the computation of the normalization constant

• Model selection for RBM models:

45

Handwritten Digits

Using short constraints we get state of the art estimates on Restricted Boltzmann Machines (vs. Annealed Importance Sampling) [UAI-15]

Outline

1. Introduction


I. Formalism




5. Conclusions

46

Hashing as a random projection

• XOR/parity constraints:

– Example: a b c d = 1 satisfied if an odd number of a,b,c,d are set to 1

47



x1 x3 x4 x7 x10 = 1

configurations

wei

ght

Set weight to zero if constraint is not satisfied

configurations

wei

ght

0 50 1000

1

2

3

Configurations

Weig

ht Random projection

This random projection:

1. “simplifies” the model

2. preserves its “key properties”

Variational Methods and Random Projections

48

Family of “simple” distributions Q

I-projection

p’Random projection

pComplex, intractable model

(probability distribution)

I-projection

Tractable distribution that is as close as possible to p

Cannot bound this distanceKey issue with all variational methods: the approximation can be arbitrarily bad Taking a random projection of p:

1. Properties of p are preserved2. p’ is “simpler” and therefore guaranteed to be “close” to Q

Idea: take a random projection first, then an I-projection [AISTATS-16]

Main Result: provably good approximation with high probability

Mean Field with Random Projections

49

Standard variational approach

(I-projection)

Augmented withrandom projections

reduction

Deep generative model of

handwritten digits

Family of “simple” distributions Q

I-projection

p’Random projection

pComplex, intractable model

(probability distribution)

I-projection

Adding random projections leads to an optimization problem:

• With less variables• With additional non-convex constraints• Still convex coordinate-wise

Summary

50

Probabilistic inference by hashing and

optimization:

– Probabilistic inference reduced to a small number of

combinatorial optimization problems by universal hashing

– Hashing as a random projection: simplifies a model while

preserving some of its key properties

– Can leverage existing combinatorial solvers and variational

methods

– Provides formal accuracy guarantees

– Works well in practice on a range of domains

Conclusions

51

Inductive ReasoningMachine LearningProbabilistic InferenceData

Deductive ReasoningLogicConstraint OptimizationBackground Theories (physical laws)

• Foundational AI/machine learning work:– New probabilistic inference by hashing and optimization

– Learning

– Decision making under uncertainty

– …

• Applications:– Incorporating background knowledge into probabilistic models to aid

scientific discovery

– …

Three challenges:

1. High dimensional spaces2. Uncertainty3. Preferences and utilities

probabilistic inference by hashing and optimization€¦ · 2016-01-25 · bridging statistical...

Documents