tutorial on optimization with submodular functions
DESCRIPTION
Tutorial on Optimization with Submodular Functions. Andreas Krause. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. M. Algorithms implemented. Acknowledgements. - PowerPoint PPT PresentationTRANSCRIPT
Tutorial on Optimization with Submodular Functions
Andreas Krause
2
AcknowledgementsPartly based on material presented at IJCAI 2009 together with Carlos Guestrin (CMU)
MATLAB Toolbox available on http://las.ethz.ch/
MAlgorithms implemented
Discrete OptimizationMany discrete optimization problems can be formulated as follows:
Given finite set V wish to select subset A (subject to some constraints) maximizing utility F(A)
These problems are the focus of this tutorial.
3
Contamination of drinking watercould affect millions of people
4
Example: Sensor Placement [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt ‘08]
Place sensors to detect contaminations“Battle of the Water Sensor Networks” competition
Where should we place sensors to quickly detect contamination?
Sensors
Simulator from EPA Hach Sensor~$14K
Contamination
5
Can only make a limited number of measurements!
Dept
h
Location across lake
Robotic monitoring of rivers and lakes
[with Singh, Guestrin, Kaiser, Journal of AI Research ’08]
Need to monitor large spatial phenomenaTemperature, nutrient distribution, fluorescence, …
Predict atunobserved
locations
NIMSKaiseret.al.
(UCLA)Color indicates actual temperature Predicted temperatureUse robotic sensors to
cover large areas
Where should we sense to get most accurate predictions?
6
Example: Model-based sensingModel predicts impact of contaminations
For water networks: Water flow simulator from EPAFor lake monitoring: Use statistical, data-driven models
For each subset A of V compute sensing quality F(A)
S2
S3
S4S1 S2
S3
S4
S1
High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01
Model predictsHigh impact
Medium impactlocation
Low impactlocation
Sensor reducesimpact throughearly detection!
S1
Contamination
Set V of all network junctions
7
Sensor placementGiven: finite set V of locations, sensing quality FWant: such that
NP-hard!
How well can this simple heuristic do?
S1
S2
S3
S4
S5
S6Greedy algorithm:Start with A = {}For i = 1 to k
s* := argmaxs F(A U {s})A := A U {s*}
M
8
2 4 6 8 100.5
0.6
0.7
0.8
0.9
Number of sensors placed
Pop
ulat
ion
affe
cted
Performance of greedy algorithm
Greedy score empirically close to optimal. Why?
Small subset of Water networks
data
2 4 6 8 100.5
0.6
0.7
0.8
0.9
Number of sensors placed
Pop
ulat
ion
affe
cted
Greedy
Optimal
Popu
latio
n pr
otec
ted
(hig
her i
s bet
ter)
Number of sensors placed
9
Key property 1: Monotonicity
S2
S1
Placement A = {S1, S2}
S2S3
S4S1
Placement B = {S1, S2, S3, S4}
F is monotonic:
Adding sensors can only help
10
S2S3
S4S1
Key property 2: Diminishing returns
S2
S1
S’
Placement A = {S1, S2} Placement B = {S1, S2, S3, S4}
Adding S’ will help a lot!
Adding S’ doesn’t help muchNew
sensor S’New sensor Y’
S’B AS’
+
+
Large improvement
Small improvement
Submodularity:
11
One reason submodularity is useful
Theorem [Nemhauser et al ‘78]Suppose F is monotonic and submodular. Thengreedy algorithm gives constant factor approximation:
Greedy algorithm gives near-optimal solution!In general, guarantees best possible unless P = NP!
~63%
In this tutorial you willSee examples and properties of submodular functionsLearn how submodularity helps to find good solutionsFind out about current research directions
12
13
Set functionsFinite set V = {1,2,…,n}Function Will always assume F({}) = 0 (w.l.o.g.)Assume black-box that can evaluate F for any input A
Approximate (noisy) evaluation of F is ok
S2
S3
S4
S1
Low sensing quality F(A)=0.01
S2
S3
S4S1
Set V of all network junctions
High sensing quality F(A) = 0.9
14
Submodular set functionsSet function F on V is called submodular if
Equivalent diminishing returns characterization:
S
B
A
S
+
+
Large improvement
Small improvement
Submodularity:
BAA U BA B
++ ≥
15
Submodularity and supermodularityF is called supermodular if –F is submodular
F is called modular if F is both sub- and supermodularfor modular (“additive”) F:
16
Example: Set cover
Node predictsvalues of positionswith some radius
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
For A V: F(A) = “area covered by sensors placed at A”
Formally: W finite set, collection of n subsets Si W
For A V define F(A) = |i A Si|
Want to cover floorplan with discsPlace sensorsin building Possible
locations V
17
Set cover is submodular
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE
S1 S2
S1 S2
S3
S4 S’
S’
A={S1,S2}
B = {S1,S2,S3,S4}
F(A U {S’})-F(A)
F(B U {S’})-F(B)
≥
18
A better model for sensing
Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)
Ys: temperatureat location s
Xs: sensor valueat location s
Xs = Ys + noise
Prior Likelihood
Y1 Y2 Y3
Y6
Y5Y4
X1
X4
X3
X6X5
X2
Information gainUtility of having sensors at subset A of all locations
19
Uncertaintyabout temperature Y
after sensing
Uncertaintyabout temperature Y
before sensing
X1
X2
X3
A={1,2,3}: High value F(A)
X4X5
X1
A={1,4,5}: Low value F(A)
20
Example: Submodularity of info-gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)F(A) is always monotonicHowever, NOT always submodular
Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!
Y1
X1
Y2
X2
Y3
X4X3
21
Proof: Submodularity of information gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)
Variables X independent given Y
F(A U {s}) – F(A) = H(Xs|XA) – H(Xs|Y)
Δ(s | A) = F(A U {s})-F(A) monotonically nonincreasing F submodular
Nonincreasing in A:A B : H(Xs|XA) ≥ H(Xs|XB)
Constant(indep. of A)
„information never hurts“
22
Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03]
Who should get free cell phones?V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}F(A) = Expected number of people influenced when targeting A
0.5
0.30.5 0.4
0.2
0.2 0.5
Alice
Bob
Charlie
Dorothy Eric
Fiona
Prob. ofinfluencing
23
Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03]
0.5
0.30.5 0.4
0.2
0.2 0.5
Alice
Bob
Charlie
Dorothy Eric
Fiona
Key idea: Flip coins c in advance “live” edges
Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!
24
Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0Then: F(A) = i i Fi(A) is submodular!
Submodularity closed under nonnegative linear combinations!
Extremely useful fact!!F(A) submodular P() F(A) submodular!Multicriterion optimization: F1,…,Fm submodular, i>0 i i Fi(A) submodular
25
Submodularity and ConcavitySuppose g: N R and F(A) = g(|A|)Then F(A) submodular if and only if g concave!
E.g., g could say “buying in bulk is cheaper”
|A|
g(|A|)
26
Maximum of submodular functionsSuppose F1(A) and F2(A) submodular.Is F(A) = max(F1(A),F2(A)) submodular?
|A|
F2(A)F1(A)
F(A) = max(F1(A),F2(A))
max(F1,F2) not submodular in general!
27
Minimum of submodular functionsWell, maybe F(A) = min(F1(A),F2(A)) instead?
F1(A) F2(A) F(A)
{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1
F({b}) – F({})=0
F({a,b}) – F({a})=1<
min(F1,F2) not submodular in general!
28
Submodularity and convexity
Key result [Lovasz ’83]: Every submodular function F induces a convex function g on Rn
+, such thatF(A) = g(wA) for all A V minA F(A) = minw g(w) s.t. w [0,1]n is attained at corner!g is efficient to evaluate
Can use ellipsoid algorithm to minimize submodular functions!
F: 2V R submodularon V = {a,b}
g: R|V| R g(1,1)=F({a,b})
g(1,0)=F({a})wa
wb
1
1
g(w)
Minimization ofsubmodular functions
30
Minimizing submodular functions
Theorem [Iwata & Orlin (SODA 2009)]There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time
O(n6 log n)
Polynomial-time = Practical ??
Is there a more practical alternative?
31
The submodular polyhedron PF
Example: V = {a,b}
x({a}) ≤ F({a})
x({b}) ≤ F({b})
x({a,b}) ≤ F({a,b})PF
-1 x{a}
x{b}
0 1
1
2
-2
A F(A){} 0{a} -1{b} 2{a,b} 0
32
A more practical alternative? [Fujishige ’91, Fujishige et al ‘11]
Minimum norm algorithm:1. Find x* = argmin ||x||2 s.t. x in BF x*=[-1,1]2. Return A* = {i: x*(i) < 0} A*={a}
Theorem [Fujishige ’91]: A* is an optimal solution!Note: Can solve 1. using Wolfe’s algorithmRuntime finite but unknown!!
-1 x{a}
x{b}
0 1
1
2
-2
Base polytope:[-1,1]
x({a,b})=F({a,b})
x*
M
A F(A){} 0{a} -1{b} 2{a,b} 0
33
Empirical comparison [Fujishige et al ’06]
Minimum norm algorithm orders of magnitude faster!Our implementation can solve n = 10k in < 6 minutes!
Cut functions from DIMACS Challenge
Runn
ing
time
(sec
onds
)
Low
er is
bett
er (l
og-s
cale
!)
Problem size (log-scale!)512 102425612864
Minimumnorm algorithm
34
Example: Image denoising
35
Example: Image denoising
X1
X4
X7
X2
X5
X8
X3
X6
X9
Y1
Y4
Y7
Y2
Y5
Y8
Y3
Y6
Y9
P(x1,…,xn,y1,…,yn) = i,j i,j(yi,yj) i i(xi,yi)
Want argmaxy P(y | x) =argmaxy log P(x,y) =argminy i,j Ei,j(yi,yj)+i Ei(yi)
When is this MAP inference efficiently solvable(in high treewidth graphical models)?
Ei,j(yi,yj) = -log i,j(yi,yj)
Pairwise Markov Random Field
Xi: noisy pixelsYi: “true” pixels
36
MAP inference in Markov Random Fields
Energy E(y) = i,j Ei,j(yi,yj)+i Ei(yi)
Suppose yi are binary, defineF(A) = E(yA) where yA
i = 1 iff i in AThen miny E(y) = minA F(A)
Theorem [c.f., Hammer ’65, Kolmogorov et al, PAMI ’04]MAP inference problem solvable by graph cuts For all i,j: Ei,j(0,0)+Ei,j(1,1) ≤ Ei,j(0,1)+Ei,j(1,0) each Ei,j is submodular“Efficient if prefer that neighboring pixels have same color”
37
Constrained minimizationHave seen: if F submodular on V, can solve
What about
E.g., balanced cut, …
In general, not much known about constrained minimization However, can do
A*=argmin F(A) s.t. 0<|A|< nA*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95]A*=argmin F(A) s.t. A in argmin G(A) for G submodular [Fujishige ’91]
Some recent lower bounds by Svitkina & Fleischer ‘08
State of the art in minimizationFaster algorithms for special cases via modern results for convex minimization [Stobbe & Krause NIPS ‘10]
Approximation algorithms for constrained minimization: Cooperative cuts [Jegelka & Bilmes CVPR ’11]
Fast approximate minimization via iterative reduction to graph cuts [Jegelka & Bilmes NIPS ‘11]
Many open question in principled, efficient, large-scale submodular minimization!
38
Maximizing submodular functions
40
Maximizing submodular functions
Minimizing convex functions:Polynomial time solvable!
Minimizing submodular functions:Polynomial time solvable!
Maximizing convex functions:NP hard!
Maximizing submodular functions:NP hard!
But can get approximation guarantees
41
Maximizing non-monotonic functionsSuppose we want for not monotonic F
Example:F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08]
In general: NP hard. Moreover:If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)
|A|
maximum
42
Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07]
picking a random set gives ¼ approximation (½ approximation if F is symmetric!)Current best approximation [Chekuri, Vondrak, Zenklusen]: ~0.41 via simulated annealingwe cannot get better than ½ approximation unless P = NP
TheoremThere is an efficient randomized local search procedure,that, given a positive submodular function F, F({})=0, returns set ALS such that
F(ALS) ≥ (2/5) maxA F(A)
M
43
Scalarization vs. constrained maximizationGiven monotonic utility F(A) and cost C(A), optimize:
Option 1: Option 2:
Can get 2/5 approx…if F(A)-C(A) ≥ 0 for all sets A
Greedy algorithm gets(1-1/e)≈0.63 approx.
Many more results!!Focus for rest of today
Positiveness is a strong requirement
“Scalarization” “Constrained maximization”
44
Battle of the Water Sensor Networks Competition [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt 2008]
Real metropolitan area network (12,527 nodes)Water flow simulator provided by EPA3.6 million contamination eventsMultiple objectives: Detection time, affected population, …Place sensors that detect well “on average”
45
Reward function is submodularClaim:Reward function is monotonic submodular
Consider event i:Ri(uk) = benefit from sensor uk in event iRi(A) = max Ri(uk), ukA
Þ Ri is submodular
Overall objective:R(A) = Prob(i) Ri(A)
Þ R is submodular
Can use greedy algorithm to solve !
u1
Ri(u1)Event i
u2
Ri(u2)
46
Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]
(1-1/e) bound quite loose… can we get better bounds?
Popu
latio
n pr
otec
ted
F(A)
High
er is
bett
er
Water networks
data
0 5 10 15 200
0.2
0.4
0.6
0.8
1
1.2
1.4Offline
(Nemhauser)bound
Greedysolution
Number of sensors placed
47
Data dependent bounds [Minoux ’78]
Suppose A is candidate solution to
argmax F(A) s.t. |A| ≤ k
and A* = {s1,…,sk} be an optimal solution
Then F(A*) ≤ F(A U A*) = F(A)+i (F(A U {s1,…,si})-F(A U {s1,…,si-1}))≤ F(A) + i (F(A U {si})-F(A))= F(A) + i Δsi
For each s V\A, let Δs = F(A U {s})-F(A)Order such that Δ1 ≥ Δ2 ≥ … ≥ Δn
Then: F(A*) ≤ F(A) + i=1k Δi
M
48
Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]
Submodularity gives data-dependent bounds on the performance of any algorithm
Sens
ing
qual
ity F
(A)
High
er is
bet
ter
Water networks
data
0 5 10 15 200
0.2
0.4
0.6
0.8
1
1.2
1.4Offline
(Nemhauser)bound Data-dependent
bound
Greedysolution
Number of sensors placed
49
BWSN Competition results13 participantsPerformance measured in 30 different criteria
0
5
10
15
20
25
30
Tota
l Sco
reHi
gher
is b
etter
Greedy approach
Berry e
t. al.
Dorini e
t. al.
Wu & W
alski
Ostfeld &
Salomons
Propato &
Piller
Eliad
es & Polyc
arpou
Huang e
t. al.
Guan et. a
l.
Ghimire
& Bark
doll
Trach
tman
Gueli
Preis & O
stfeld
EE
D DG
GG
GG
H
H
H
G: Genetic algorithmH: Other heuristic
D: Domain knowledgeE: “Exact” method (MIP)
24% better performance than runner-up!
50
Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)
, 16 GB in main memory (compressed)
Lowe
r is b
ette
r 30 hours/20 sensors6 weeks for all30 settings
3.6M contaminations
Very slow evaluation of F(A)
1 2 3 4 5 6 7 8 9 100
100
200
300
Number of sensors selected
Runn
ing
time
(min
utes
)
Exhaustive search(All subsets)
Naivegreedy
What was the trick?
ubmodularity to the rescue
51
Scaling up the greedy algorithm [Minoux ’78]In round i+1,
have picked Ai = {s1,…,si}pick si+1 = argmaxs F(Ai U {s})-F(Ai)
I.e., maximize “marginal benefit” Δ(s | Ai)
Δ(s | Ai) = F(Ai U {s})-F(Ai)
Key observation: Submodularity implies
i ≤ j => Δ(s | Ai) ≥ Δ(s | Aj)
Marginal benefits can never increase!
s
Δ (s | Ai) ≥ Δ(s | Ai+1)
52
“Lazy” greedy algorithm [Minoux ’78]
Lazy greedy algorithm: First iteration as usual Keep an ordered list of marginal
benefits Δi from previous iteration Re-evaluate Δi only for top
element If Δi stays on top, use it,
otherwise re-sort
a
b
c
d
Benefit Δ(s | A)
e
a
d
b
c
e
a
c
d
b
e
Note: Very easy to compute online bounds, lazy evaluations, etc.[Leskovec, Krause et al. ’07]
M
53
Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)
Using “lazy evaluations”:1 hour/20 sensorsDone after 2 days!
, 16 GB in main memory (compressed)
Lowe
r is b
ette
r 30 hours/20 sensors6 weeks for all30 settings
3.6M contaminations
Very slow evaluation of F(A)
1 2 3 4 5 6 7 8 9 100
100
200
300
Number of sensors selected
Runn
ing
time
(min
utes
)
Exhaustive search(All subsets)
Naivegreedy
Fast greedy ubmodularity to the rescue:
Result of lazy evaluation
54
Non-constant cost functionsFor each s V, let c(s)>0 be its cost(e.g., feature acquisition costs, …)Cost of a set C(A) = sA c(s)Want to solve
A* = argmax F(A) s.t. C(A) ≤ B
Cost-benefit greedy algorithm:Start with A := {};While there is an s V\A s.t. C(A U {s}) ≤ B
A := A U {s*}
M
55
Performance of cost-benefit greedyWant
maxA F(A) s.t. C(A) ≤ 1
Cost-benefit greedy picks a.Then cannot afford b!
Cost-benefit greedy performs arbitrarily badly!
Set A F(A) C(A){a} 2 {b} 1 1
56
Cost-benefit optimization[Wolsey ‘82, Sviridenko ‘04, Krause et al ‘05]
Theorem [Krause and Guestrin‘05]ACB: cost-benefit greedy solution andAUC: unit-cost greedy solution (i.e., ignore costs)
Then max { F(ACB), F(AUC) } ≥ (1-1/√e) OPT
Can still compute online bounds and speed up using lazy evaluations
Note: Can also get(1-1/e) approximation in time O(n4) [Sviridenko ’04](1-1/e) approximation for multiple linear constraints [Kulik ‘09]0.38/k approximation for k matroid and m linear constraints
[Chekuri et al ‘11]
~39%
M
57
Informative path planningSo far:max F(A) s.t. |A|≤ k
s1s2
s4
s5s3
2 11
1
s10 s11
11 1
2
Most informative locationsmight be far apart!Robot needs to travelbetween selected locations
Locations V nodes in a graphC(A) = cost of cheapest path
connecting nodes A
max F(A) s.t. C(A) ≤ BCan exploit submodularity to get near-optimal paths
Outperform existing heuristics M
58
Question
I have 10 minutes. Which blogs should I read to be most up to date?[Leskovec-Krause et al. ‘07]
77% read blogs184M blogs [Universal McCann, ‘08]
59
Tim
e
Information cascade
Cascades in the Blogosphere[with Leskovec, Guestrin, Faloutsos, VanBriesen, Glance KDD 07 – Best Paper]
Which blogs should we read to learn about big cascades early?
Learn aboutstory after us!
60
Water vs. Web
In both problems we are givenGraph with nodes (junctions / blogs) and edges (pipes / links)Cascades spreading dynamically over the graph (contamination / citations)
Want to pick nodes to detect big cascades early
Placing sensors inwater networks
Selectinginformative blogsvs.
In both applications, utility functions submodular [Generalizes Kempe, Kleinberg, Tardos, KDD ’03]
61
Performance on Blog selection
Outperforms state-of-the-art heuristics700x speedup using submodularity!
Blog selection
Low
er is
bett
er
1 2 3 4 5 6 7 8 9 100
100
200
300
400
Number of blogs selected
Runn
ing
time
(sec
onds
)
Exhaustive search(All subsets)
Naivegreedy
Fast greedy
Blog selection~45k blogs
High
er is
bett
er
Number of blogs
Casc
ades
cap
ture
d
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Greedy
In-linksAll outlinks
# PostsRandom
Naïve approach: Just pick 10 best blogsSelects big, well known blogs (Instapundit, etc.)These contain many posts, take long to read!
Taking “attention” into account
casc
ades
cap
ture
d
number of posts (time) allowed
x 104
cost/benefitanalysis
ignoring cost
cost-benefit optimization picks summarizer blogs!62
Generalizations
Learning to optimize submodular functionsOnline submodular optimization
Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios
Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design
64
65
Predicting the “hot” blogs
Jan Feb Mar Apr May0
200
#det
ectio
ns
Greedy
Jan Feb Mar Apr May0
200
#det
ectio
ns
Saturate
Detects on training set
Greedy on historicTest on future
Poor generalization!Why’s that?
0 1000 2000 3000 40000
0.05
0.1
0.15
0.2
0.25 Greedy on futureTest on future
“Cheating”
Casc
ades
cap
ture
d
Number of posts (time) allowed
Detect wellhere!
Detect poorlyhere!
Want blogs that will be informative in the futureSplit data set; train on historic, test on future
Blog selection “overfits”to training data!
Let’s see whatgoes wrong here.
Want blogs thatcontinue to do well!
66
Online optimization
Jan Feb Mar Apr May0
200
#det
ectio
ns
Greedy
Jan Feb Mar Apr May0
200
#det
ectio
ns
Saturate
F1(A)=.5
F2 (A)=.8
F3 (A)=.6
F4(A)=.01
F5 (A)=.02
Online optimization:
Fi(A) = detections in interval i
“Overfit” blog selection A
67
Online maximization of submodular functions[Streeter, Golovin NIPS ‘08]
Theorem Can efficiently choose A1,…At s.t. in expectation
for any sequence Fi, as T ∞
Can get ‘no-regret’ over “cheating” greedy algorithm
A1 A2Pick sets
SFs
Reward
F1
r1=F1(A1) Total: t rt max
F2
r2
A3
F3
r3
AT
FT
rT
…
…
…Time
Observe eitherFt, or only Ft(At)
Online Greedy Algorithm[Streeter & Golovin, NIPS `08]
Replace each stage of greedy algorithm with an expert / multi-armed bandit algorithm.
a1 a2 a3 akSelect { , , , … , } .
Feedback to for action aj = (h, t) is f({a1, a2, …, aj-1, aj}) – f({a1, a2, …, aj-1})
69
Results on blogs[Streeter, Golovin, Krause NIPS ‘09]
Performance of online algorithm converges quicklyto “cheating” offline greedy algorithm!
Can learn to rank by modeling position dependent effects
0 100 200 3000
0.2
0.4
0.6
0.8
1
Time (days)
Avg.
nor
mal
ized
perfo
rman
ceT=47
Hybridizing Solvers[Streeter, Golovin, Smith]
• Hybridization via Task-Switching Schedules
• Need to learn which schedules work well!
Heuristic #1
Heuristic #2
Heuristic #3
Heuristic #4
2 sec
Time
4 sec 6 sec 2 sec 10 sec
Example Schedule S
Hybridizing Solvers[Streeter & Golovin ‘08]
• Actions A = {run heuristic h for t seconds : h in H, t >0}• Instances fi(S) = Pr[schedule S completes job f in time limit].• This is a submodular function!
• Task: Select {Si : i =1, 2, …, T} online to maximize # instances solved• Online greedy algorithm achieves no-(1-1/e)-regret
2 secTime
4 sec 6 sec 2 sec 10 sec
(log scale)
(log scale)
[Streeter, Golovin, Smith, AAAI ’07 & CSP ‘08]
(2008)
Learning to optimize submodular functionsOnline submodular optimization
Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios
Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design
74
75
Adaptive Selectionx1
x2
x3
1
1
0
0 01
=
=
=
Want to effectively diagnose while minimizing cost of testing!Interested in a policy (decision tree), not a set.
Can’t use submodular set functions!!
Can we generalize submodularity for sequential decision making?
Problem StatementGiven:
Items (tests, experiments, actions, …) V={1,…,n}Associated with random variables X1,…,Xn taking values in OObjective:
Value of policy π:
Want
NP-hard (also hard to approximate!)76
Tests run by πif world in state xV
Adaptive greedy algorithmSuppose we’ve seen XA = xA.
Conditional expected benefit of adding item s:
Adaptive Greedy algorithm:Start with For i = 1:k
PickObserve Set
77When does this adaptive greedy algorithm work??
Benefit if world in state xV
Adaptive submodularity[Golovin & Krause, JAIR 2011]
f is called adaptive submodular, iff
f is called adaptive monotone iff
Theorem: If f is adaptive submodular and adaptive monotone w.r.t. to distribution P, thenF(πgreedy) ≥ (1-1/e) F(πopt)
78
xB observes more than xA
whenever
Many other results about submodular set functionscan also be “lifted” to the adaptive setting!
79
0.50.2
0.2Prob. ofinfluencing
Eric
Fiona
0.4
Daria
0.5
0.30.5
Alice
Bob
Charlie
Adaptively select promotion targetsAfter each selection, see which people are influenced.
Application: Adaptive Viral Marketing
80
Application: Adaptive Viral Marketing
Theorem: Objective adaptive submodular!
Hence, adaptive greedy choice is a (1-1/e) ≈ 63% approximation to the optimal adaptive solution.
0.50.2
0.2Eric
Fiona
0.4
Daria
0.5
0.30.5
Alice
Bob
Charlie
Stochastic Minimum Cost Cover
Suppose we wish to get 25% market penetration at minimum cost
Theorem [Golovin & Krause, ’11]If F is adaptive submodular and monotone: Cost(πgreedy) ≤ O(log n) Cost(πopt)
0.50.2
0.2Eric
Fiona
0.4
Daria
0.5
0.30.5
Alice
Bob
Charlie
Optimal Decision TreesPrior over diseases P(Y)Likelihood of test outcomes P(XV | Y)Suppose that P(XV | Y) is deterministic (noise free)Each test eliminates hypotheses yHow should we test to eliminate all incorrect hypotheses?
“Generalized binary search”Equivalent to max. infogain
82
Y“Sick”
X1
“Fever”X2
“Rash”X3
“Cough”
X1=1 X3=0
X2=1X2=0
ODT is Adaptive Submodular
83
Objective = probability mass of hypotheses you have ruled out.
Outcome = 1Outcome = 0
Test s
Test wTest v
Not hard to show that
Guarantees for Optimal Decision Treesx1
x2
x3
1
1
0
0 0
=
=
=
Garey & Graham, 1974; Loveland, 1985;
Arkin et al., 1993; Kosaraju et al., 1999;
Dasgupta, 2004; Guillory & Bilmes, 2009;
Nowak, 2009; Gupta et al., 2010
1
10
Result requires that tests are exact (no noise)!
With adaptivesubmodular
analysis!
Noisy Bayesian diagnosis[with Daniel Golovin, Deb Ray, NIPS ‘10]In practice, observations typically are noisyResults for noise-free case do not generalize Key problem: Tests no longer rule out hypotheses (only make them less likely)
Intuitively, want to gather enough information to make the right decision!
85
Y“Sick”
X1
“Fever”X2
“Rash”X3
“Cough”
Noisy Bayesian diagnosisSuppose I run all tests, i.e., see Best I can do is to maximize expected utility
How should I cheaply test to guarantee that I choose a*?
Existing approaches:Generalized binary search?Maximize information gain?Maximize value of information?
Theorem: All these approaches can have cost more than n/log n times the optimal cost!
86
Not adaptive submodularin the noisy setting!
A new criterion for nonmyopic VOIStrategy: Replace noisy problem with noiseless problemKey idea: Make test outcomes part of the hypothesis
87
Test
[0,1,0][0,0,0]
[1,0,0][0,0,1] [1,0,1]
[1,1,0]
A new criterion for nonmyopic VOIOnly need to distinguish between noisy hypotheses that lead to different decisions!
88
Tests
[0,0,0]
[1,0,0]
[0,1,0]
[1,0,1]
[1,1,1]
= total mass of edges cut if observing
Suppose we find
Weight of edge =product of hypotheses’
probabilities
Theoretical guarantees[with Daniel Golovin, Deb Ray, NIPS ‘10]
Theorem: Equivalence class edge-cutting (EC2) is adaptive submodular and adaptive monotone.Suppose for all Then it holds that
First approximation guarantees for nonmyopic VOI in general graphical models!
89
Example: The Iowa Gambling Task[with Colin Camerer, Deb Ray]
Various competing theories on how people make decisions under uncertainty
Maximize expected utility? [von Neumann & Morgenstern ‘47]Constant relative risk aversion? [Pratt ‘64]Portfolio optimization? [Hanoch & Levy ‘70](Normalized) Prospect theory? [Kahnemann & Tversky ’79]
How should we design tests to distinguish theories?90
-10$ +10$0$
Prob.
.3
.7
-10$ +10$0$
Prob. .7
.3
What would you prefer?A) B)
Iowa Gambling as BEDEvery possible test Xs = (gs,1,gs,2) is a pair of gambles
Theories parameterized by θ
Each theory predicts utility forevery gamble U(g,y,θ)
91
YTheory
X1(g1,1,g1,2)
X2(g2,1,g2,2)
Xn(gn,1,gn,2)
θParam’s
…
Simulation Results
Adaptive submodular criterion (EC2)outperforms existing approaches 92
Experimental Study [with Colin Camerer, Deb Ray]
Indication of heterogeneity in populationUnexpectedly no support for CRRA Submodularity crucial for real-time performance! 93
32,000 designsin <5s;
8x fasterthrough lazy!
Study with 57 naïve subjects
94
Structure in learning problems
MATLAB Toolbox for optimizing submodular functions (JMLR 10)Series of NIPS Workshops on Discrete Optimization in ML (DISCML)(talks recorded on videolectures.net)
AI/ML last 10 years:
ConvexityKernel machines
SVMs, GPs, MLE…
AI/ML “next 10 years:”
Submodularity New structural
properties
Submodularity in ML / AI
95
Fast inference forhigh-order submodular
MAP problems [NIPS ’10]
Submodular dictionary selection forsparse representation [ICML ’10]
Learning to solving zero-sumsubmodular games [IJCAI ’11]
First convergence ratesGP optimization [ICML ’10]
ConclusionsDiscrete optimization abundant in applicationsFortunately, some of those have structure: submodularitySubmodularity can be exploited to develop efficient, scalable algorithms with strong guaranteesCan handle complex constraintsCan learn to optimize (online, adaptive, …)
96