probabilistic inference by hashing and optimization€¦ · 2016-01-25 · bridging statistical...
TRANSCRIPT
Probabilistic Inference by Hashing and
Optimization
Stefano Ermon
Stanford University
Challenges in reasoning about complex systems
Operations Research
Management Science
Economics
Three major challenges:
1.High dimensional spaces: need to consider many variables
2.Uncertainty: limited information, need to use stochastic models
3.Preferences and utilities: need to consider optimization criteria
Statistics
Computer Science Engineering
Physics
Decision Making and Inference under
Limited Information and Large Dimensionality
High-dimensional Spaces Lack of information/Uncertainty
Optimization
Combinatorial Reasoning
CombinatorialOptimization
Stochastic Optimization
Probabilistic Reasoning
Decision-making
Statistics
Preferences and Utilities
3
Decision Making and Inference for Sustainability
Lack of Information/Uncertainty
fuel cells materials[SAT-12, AAAI-15, IJCAI-15]
EV batteries[ECML-12]
Poverty traps[AAAI-15]
Poverty mapping[ongoing]
High-dimensional Spaces
natural resources management [UAI-10, IJCAI-11]
Preferences and Utilities
HabitatMonitoring
[ongoing]
4
Research Agenda
5
Foundations
Applications
inference by hashing and optimization
[ICML-13, UAI-13, ICML-14, ICML-15, UAI-15, AAAI-16]
bridging statistical physics and computer science
[CP-10, IJCAI-11, NIPS-11, NIPS-12]
High-throughput data mining and catalyst discovery
[SAT-12, AAAI-15, IJCAI-15]
sampling[NIPS-13, UAI-12]
Poverty mapping[AAAI-16]
natural resources management
[UAI-10, IJCAI-11]
Outline
1. Introduction
2. WISH: probabilistic inference by hashing and optimization
I. Formalism
II. Empirical validation
3. Families of hash functions
4. Random projections and hashing
5. Conclusions
6
Statistical Modeling and Probabilistic Inference
Statistical Modeling:
• Stochastic variables
• Models are probability distributions
• Goal is to make inferences/predictions. Examples:
– What is the probability that an object is in a certain
location, given some noisy sensor measurements?
– How well will this policy perform on average?
• Inference is needed also to learn models from data
Probabilistic inference is a cornerstone of statistical
machine learning and statistical applications
7
Progress So Far
Fundamental problem: studied since the early days of
computing (Manhattan Project – diffusion behavior of nuclear
particles)
Much progress has been made (contributions from CS,
physics, statistics, etc.). To date, two main approaches:
– Markov Chain Monte Carlo Sampling (ranked one of top
10 algorithms of all times)
– Variational
This work: fundamentally different approach based on
hashing and optimization [Chakraborty, Fremont, Meel, Seshia, and
Vardi; Achlioptas et al.; Belle et al.; Gomes et al., …]8
A Worst-Case Perspective
9
P
NP
P^#P
PSPACE
NP-complete/hard:Combinatorial Reasoning
SAT, puzzles,
Integer Programming (MIP)
Note: widely believed hierarchy; know P≠EXP for sure
Easy
Hard
PH
EXP
#P-complete/hard:Probabilistic Reasoning
#SAT, sampling
High-dimensional Spaces
Uncertainty
High-dimensional Spaces
Uncertainty
Inference by Hashing and Optimization
Main technical contribution: If we allow
– Small failure probability (e.g., 0.1%)
– Small error (e.g., 1% relative error)
we can reduce to small number of NP-equivalent queries (combinatorial search/optimization)
Search: Is there a way to set the variables such that some property holds?
Integration: How many ways to set the variables s.t. a certain property holds?
10
High-dimensional Spaces
Uncertainty High-dimensional Spaces
Uncertainty
#P NP
• Progress in combinatorial search since the 1990s
– Boolean Satisfiability problem SAT / SMT
– Mixed Integer Programming MIP
• Problem size: from 100 variables, 200 constraints (early 90s)
to 1,000,000 variables and 5,000,000 constraints in 20 years
• Time is mature for probabilistic inference applications.
Techniques traditionally used for deductive reasoning
(verification, synthesis) brought to the realm of probability and
inductive reasoning
A Practical Perspective
11
High-dimensional combinatorial spaces
Preferences and Utilities
Probabilistic Inference
For any configuration (or state), defined by
an assignment of values to the random variables, we can
compute the weight/probability of that configuration.
12
SPRINKLERRAINGRASS WET
SPRINKLER
True False
RA
INTrue 0.01 0.99
False 0.4 0.6
RAIN
True False
0.2 0.8
Example: Pr [Rain=T, Sprinkler=T, Wet=T] ∝ 0.01 * 0.2 * 0.99
WET
True False
… …
0.99 0.01
… …
Factor
Probabilistic Inference
What is the (conditional) probability of an event? For example,
Pr[Wet=T] = ∑x={T,F} ∑y={T,F} Pr[Rain=x, Sprinkler=y, Wet=T]
involves expectations/nested sums (sum up contributions from all
possible variable assignments)
13
SPRINKLERRAINGRASS WET
Probabilistic Inference in High Dimensions
• We are given – a set of 2n configurations (= assignments)
– non-negative weights w
• from Bayes Net, factor graph, weighted CNF..
• Goal: compute total weight
• Example 1: n=2 binary variables, sum over 4 configurations
• Example 2: n=100 variables, sum over 2100 ≈1030 terms (curse of dimensionality, generally #P-complete)
14
1 4 …
2n configurations
P[Wet=T]
= ∑x={T,F} ∑y={T,F} P[Rain=x, Sprinkler=y, Wet=T]
= 0 + 0.15 + 0.29 + 0.002 = 0.442
.002
Rain=FSprinkler=F
Rain=FSprinkler=T
Rain=TSprinkler =F
Rain=TSprinkler=T
configurations x
1 3
.29.150
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
A New Approach: WISH
15
x1 x2 x10
? ?? … ? ? ? ? ? ? ?
Increasing weight
configurations
wei
ght
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
16
x1 x2 x10
? 50? … ? ? ? ? ? ? ?
Increasing weight
Use optimization to find most likely configuration(NP-equivalent)
configurations
wei
ght
16
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
17
x1 x2 x10
? 500
Increasing weight
… ? ? ? ? ? ? ?
Use optimization to find least likely configuration
configurations
wei
ght
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
18
0≤?
≤50
x1 x2 x10
500
Increasing weight
… 0≤?
≤50
0≤?
≤50
0≤?
≤50
0≤?
≤50
0≤?
≤50
0≤?
≤50
0≤?
≤50
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
19
x1 x2 x10
Weight distribution could be like this
Total weight = 1023 * 0 + 50 = 50
0? 500
Increasing weight
… 0? 0? 0? 0? 0? 0? 0?
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
20
x1 x2 x10
OR could be like this
Total weight = 1023 * 50 + 0 = 51150
50? 500
Increasing weight
… 50? 50? 50? 50? 50? 50? 50?
50 <= Total weight <= 1023 * 50
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
21
40
x1 x2 x10
500
Increasing weight
… 0≤?
≤40
0≤?
≤40
0≤?
≤40
0≤?
≤40
0≤?
≤40
0≤?
≤40
0≤?
≤40
Find weight of 2nd most likely configuration(how explained later)
90 <= Total weight <= 50 + 40*1022
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
22
40
x1 x2 x10
500
Increasing weight
… 0≤?
≤9
0≤?
≤9
0≤?
≤9
0≤?
≤9
0≤?
≤99
9≤?
≤40
Find weight of 4th most likely(how explained later)
90 + 9*2 <= Total weight <= 50 + 40*2+9*1020
A new approach : WISH
Probabilistic model with 10 binary variables
1024 configurations sorted by weight
23
40
x1 x2 x10
500
Increasing weight
… 0≤?
≤55
5≤?
≤9
5≤?
≤9
5≤?
≤99
9≤?
≤40
Find weight of 8th most likely configuration(how explained later)
90 + 9*2 +5*4<= Total weight <= 50 + 40*2+9*4+5*1016
A New Approach: WISH
If we have
• Weight of the most likely configuration
• Weight of 2nd most likely configuration
• Weight of 4th most likely configuration
• …
• Weight of 512th most likely configuration
24
x1 x2 x10
Increasing weight
…
Weight of 2i-th most likely configuration
We can show:
Factor of 2 approximation of the total weight
How to compute the weight of 2i-th most likely configuration when i is large (say i=50) ?
Random parity constraints
• XOR/parity constraints:
– Example: a b c d = 1 satisfied if an odd number of a,b,c,d are set to 1
• For every configuration, set weight to zero if parity constraint is
not satisfied
25
Randomly generated parity constraint X
Each variable added with prob. 0.5
x1 x3 x4 x7 x10 = 1
configurations
wei
ght
0 50 1000
1
2
3
Configurations
Weig
ht
Strongly Universal Hashing
• XOR/parity constraints to the rescue
– a b c d = 1: satisfied if an odd number of a,b,c,d are set to 1
Two crucial properties (strongly universal hashing [Valiant-Vazirani, Stockmeyer]):
1) Uniformity : For every configuration A, Pr [ A satisfies X ] = 0.5
2) Pairwise independence: For every two configurations A and B,
“A satisfies X ” and “B satisfies X ” are independent events
26
Randomly generated parity constraint X
Each variable added with prob. 0.5
x1 x3 x4 x7 x10 = 1
configurations
wei
ght
Universal Hashing and Optimization
• Example: suppose we want to compute the weight of the 64th most
likely configuration
• Weight of the mode (most likely configuration) of the randomly projected
distribution is “often” “close enough” to the weight of the 64th most likely
configuration of the original distribution
27
0 500 10000
1
2
3
Configurations
Weig
ht
Use optimization to find most likely configuration
configurations
wei
ght
configurations
wei
ght
After adding 6 random parity constraints we are left with 1024/64 =1024/26=16
configurations on average
Visual working of the algorithm
• How it works
28
….
median M1
1 random parity constraint
2 random parity constraints
…. ….
3 random parity constraints
median M2 median M3
….
Mode M0 + + +×1 ×2 ×4 + …
Weight function
n times
Log(n) times
Optimization to find the mode optimization
finds the mode
≈ weight of 2nd
most likely conf.≈ weight of 4th
most likely conf.≈ weight of 8th
most likely conf.Final estimate for the total weight
xnx1
Accuracy Guarantees
Main Theorem (stated informally):
With probability at least 1- δ (e.g., 99.9%),
WISH (Weighted-Sums-Hashing) computes a sum defined
over 2n configurations (probabilistic inference, #P-hard)
with a relative error that can be made as small as desired,
and it requires solving θ(n log n) optimization instances
(NP-equivalent problems).
29
Key Features
• Strong accuracy guarantees
• Can leverage recent advances in optimization
• Decoding techniques for error correcting codes (LDPC)
• Even without optimality guarantees, bounds/approximations
on the optimization translate to bounds/approximations on
the weighted sum
• Straightforward parallelization (independent optimizations)
– We experimented with >600 cores
30
Outline
1. Introduction
2. WISH: probabilistic inference by hashing and optimization
I. Formalism
II. Empirical validation
3. Families of hash functions
4. Random projections and hashing
5. Conclusions
31
Counting via NP-Queries: Sudoku
Filling a Sudoku puzzle is a combinatorial search problem.
We are reasonably good at it.
Computers are extremely good at it.
32
Counting via NP-Queries: Sudoku
• How many ways to fill a valid sudoku square?
• WISH estimate: 1.634×1021
• Traditional probabilistic inference methods fail. Cannot reason about
such complex constraints.
33
1 2
….
?
≈1077 possible squares
{ 1 if the square is valid
0 otherwise
Specialized proof:6.671×1021
x1 x2 x81
x1
x2
x3
xn
x4
Inference benchmark: Ising models
3434
…
Statistical physics Computer Vision
Number of variables
Log
no
rmal
izat
ion
co
nst
ant
Inference benchmark: Ising models from physics
35
Guaranteed relative error: ground truth is provably within this small error band(too small to see with this scale)
Variational methods provide inaccurate estimates (well outside the error band)
35
WISH
Belief Propagation
Mean Field
BP variant
=Ground Truth
Inference benchmark: Ising models from physics
3636
Inference benchmark: Ising models
37
Gibbs Sampling
Inference benchmark: Ising models
38
Gibbs Sampling
Belief Prop.
Inference benchmark: Ising models
39
Gibbs Sampling
Belief Prop.
WISH
Outline
1. Introduction
2. WISH: probabilistic inference by hashing and optimization
I. Formalism
II. Empirical validation
3. Families of hash functions
4. Random projections and hashing
5. Conclusions
40
Hashing as coin flipping
• Hashing as a distributed coin flipping mechanism
• Ideally, each configuration flips a coin independently
0000 … 11110001 1101 1110
Heads Tails
Configurations
Coin flips
• Issue: we cannot simulate so many independent coin flips (one for
each possible variable assignment)
• Solution: each configuration flips a coin pairwise independently
• Any two coin flips are independent
• Three or more might not be independent
Still works! Pairwise independence guarantees that configurations do not
cluster too much in single bucket.
Can be simulated using parity constraints: add each variable with
probability ½.
Pairwise Independent Hashing
…
``Long’’ parity constraints are difficult to deal with!
• New class of average universal hash functions (coin flips generation
mechanism) [ICML-14]
• Pairs of coin flips are NOT guaranteed to be independent anymore
• Key idea: Look at large enough sets. If we look at all pairs, on average
they are “independent enough”. Main result:
1. These coin flips are good enough for probabilistic inference (applies to all
previous techniques based on hashing; same theoretical guarantees)
2. Can be implemented with short parity constraints
Average Universal Hashing
…
Short Parity Constraints for Probabilistic Inference
and Model Counting
Main Theorem (stated informally): [AAAI-16]
For large enough n (= number of variables),
– Parity constraints of length log(n) are sufficient.
– Parity constraints of length log(n) are also necessary.
Proof borrows ideas from the theory of low-density parity
check codes.
Short constraints are much easier to deal with in practice.
44
Model Selection
• Model Selection: we have several competing statistical models. Which
one fits the data better?
– Pick the one that is most likely to have generated the data
– In the majority of models of interest, probabilities are defined up to a
normalization constant
– Ranking requires the computation of the normalization constant
• Model selection for RBM models:
45
Handwritten Digits
Using short constraints we get state of the art estimates on Restricted Boltzmann Machines (vs. Annealed Importance Sampling) [UAI-15]
Outline
1. Introduction
2. WISH: probabilistic inference by hashing and optimization
I. Formalism
II. Empirical validation
3. Families of hash functions
4. Random projections and hashing
5. Conclusions
46
Hashing as a random projection
• XOR/parity constraints:
– Example: a b c d = 1 satisfied if an odd number of a,b,c,d are set to 1
47
Randomly generated parity constraint X
Each variable added with prob. 0.5
x1 x3 x4 x7 x10 = 1
configurations
wei
ght
Set weight to zero if constraint is not satisfied
configurations
wei
ght
0 50 1000
1
2
3
Configurations
Weig
ht Random projection
This random projection:
1. “simplifies” the model
2. preserves its “key properties”
Variational Methods and Random Projections
48
Family of “simple” distributions Q
I-projection
p’Random projection
pComplex, intractable model
(probability distribution)
I-projection
Tractable distribution that is as close as possible to p
Cannot bound this distanceKey issue with all variational methods: the approximation can be arbitrarily bad Taking a random projection of p:
1. Properties of p are preserved2. p’ is “simpler” and therefore guaranteed to be “close” to Q
Idea: take a random projection first, then an I-projection [AISTATS-16]
Main Result: provably good approximation with high probability
Mean Field with Random Projections
49
Standard variational approach
(I-projection)
Augmented withrandom projections
reduction
Deep generative model of
handwritten digits
Family of “simple” distributions Q
I-projection
p’Random projection
pComplex, intractable model
(probability distribution)
I-projection
Adding random projections leads to an optimization problem:
• With less variables• With additional non-convex constraints• Still convex coordinate-wise
Summary
50
Probabilistic inference by hashing and
optimization:
– Probabilistic inference reduced to a small number of
combinatorial optimization problems by universal hashing
– Hashing as a random projection: simplifies a model while
preserving some of its key properties
– Can leverage existing combinatorial solvers and variational
methods
– Provides formal accuracy guarantees
– Works well in practice on a range of domains
Conclusions
51
Inductive ReasoningMachine LearningProbabilistic InferenceData
Deductive ReasoningLogicConstraint OptimizationBackground Theories (physical laws)
• Foundational AI/machine learning work:– New probabilistic inference by hashing and optimization
– Learning
– Decision making under uncertainty
– …
• Applications:– Incorporating background knowledge into probabilistic models to aid
scientific discovery
– …
Three challenges:
1. High dimensional spaces2. Uncertainty3. Preferences and utilities