![Page 1: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/1.jpg)
A reinforcement learning algorithm for sampling design in Markov random fields
Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN
INRA-MIA ToulouseE-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr
MSTGA, Bordeaux, 7 2012
![Page 2: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/2.jpg)
1. Problem statement.
2. General Approach.
3. Formulation using dynamic model.
4. Reinforcement learning solution.
5. Experiments.
6. Conclusions.
Plan
![Page 3: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/3.jpg)
PROBLEM STATEMENT
Adaptive sampling
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
• Adaptive selection of variables to observe for reconstruction of the random vector
X=(X(1),….,X(n))
• c(A) -> Cost of observing variables X(A)
• B -> Initial budget
•Observations are reliable
![Page 4: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/4.jpg)
PROBLEM STATEMENT
Adaptive sampling
• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))
• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)
• B -> Initial budget•Observations are reliable
Problem: Find strategy / sampling policy to adaptively select variables to observe in order to :
Optimize Quality of the reconstruction of X / Respect Initial Budget
![Page 5: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/5.jpg)
PROBLEM STATEMENT
Adaptive sampling
• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))
• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)
• B -> Initial budget•Observations are reliable
Problem: Find strategies / sampling policy to adaptively select variables to observed in order to :
Optimize Quality of the reconstructed vector X / Respect Initial Budget
![Page 6: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/6.jpg)
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
DEFINITION: Adaptive sampling policy
![Page 7: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/7.jpg)
A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }
x(A1)=0
- - Example - -
x(A2)=(1,3)
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
DEFINITION: Adaptive sampling policy
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
![Page 8: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/8.jpg)
DEFINITIONS
A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }
•Vocabulary :
•A history {(A1, x (A1)) ,… ,(AH, x (AH) } is a trajectory followed when applying 𝛿• : set of all reachable histories of 𝛿• c(𝛿) ≤ B cost of any history respects the initial budget
δτ x(A2)=(1,3)x(A1)=0
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
![Page 9: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/9.jpg)
GENERAL APPROACH
![Page 10: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/10.jpg)
GENERAL APPROACH
1. Find a distribution ℙ that well describes the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
![Page 11: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/11.jpg)
STATE OF THE ART1. Find a distribution ℙ that well describes the phenomenon
under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector
ℙ multivariate Gaussian joint distribution
Entropy based criterion
Kriging variance
Greedy algorithm
![Page 12: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/12.jpg)
1. Find a distribution ℙ that well desribed the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector X discrete random vector
ℙ multivariate Gaussian joint distribution
ℙ Markov random field distribution
Entropy based criterions Maximum Posterior Marginals (MPM)
Kriging variance
OUR CONTRIBUTION
Greedy algorithm Reinforcement learning
![Page 13: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/13.jpg)
1. Find a distribution ℙ that well desribed the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector X discrete random vector
ℙ multivariate Gaussian joint distribution
ℙ Markov random field distribution
Entropy based criterions Maximum Posterior Marginals (MPM)
Kriging variance
OUR CONTRIBUTION
Greedy algorithm Reinforcement learning
![Page 14: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/14.jpg)
Formulation using dynamic model
An adapted framework for reinforcement learning
![Page 15: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/15.jpg)
Summarize knowledge on X in a random vector S of length n
•Observe variables update our knowledge on X Evolution of S
•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i
s1 s2 s3
U(sH+1)
A1 A2 A3
sH+1
![Page 16: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/16.jpg)
Summarize knowledge on X in a random vector S of length n
•Observe variables update our knowledge on X Evolution of S
•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i
s1 s2 s3
U(sH+1)
A1 A2 A3
sH+1
![Page 17: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/17.jpg)
Reinforcement learning solution
![Page 18: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/18.jpg)
Find optimal policy: The Q-function
•
Compute Q* Compute 𝛿 *
= « The expected value of the history when starting in st, observing variables X(At)and then
following policy 𝛿*»
![Page 19: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/19.jpg)
Find optimal policy: The Q-function
s1 s2 s3
U(s3)
A1 A2
• How to compute Q*: classical solution (Q-learning …)
with proba 1-𝜀random with proba 𝜀
1. Initialize Q
2. Simulate history
![Page 20: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/20.jpg)
Find optimal policy: The Q-function
s1 s2 s3
U(s3)
Update Q(s1,A1) Update Q(s2,A2)
A1 A2
with proba 1-𝜀random with proba 𝜀
many times!
• How to compute Q*: classical solution (Q-learning …)
1. Initialize Q
2. Simulate history
Q Q*C.V
![Page 21: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/21.jpg)
Alternative approach• Linear approximation of Q-function:
•
•
• Choice of function Φi:
![Page 22: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/22.jpg)
LSDP Algorithm
• Linear approximation of Q-function:
•
•
Define weights for each decision step
Compute weights using “backward induction”
H
![Page 23: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/23.jpg)
LSDP Algorithm: application to sampling1. Computation of Φi(st,At):
2. Computation of
3. Computation of
![Page 24: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/24.jpg)
LSDP Algorithm: application to sampling1. Computation of Φi(st,At):2. Computation of 3. Computation of
• We fix|At|=1 and use the approximation:
![Page 25: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/25.jpg)
Experiments
![Page 26: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/26.jpg)
Experiments• Regular grid with first order neighbourhood.
•X(i) are binary variables.
•ℙ is a Potts model with β=0.5
• Simple cost: observation of each variale cost 1
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
![Page 27: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/27.jpg)
Experiments• Comparison between:
Random policy
BP-max heuristic: at each time step observed variable
LSPI policy “ common reinforcement learning algorithm”
LSDP policy
• using score:
![Page 28: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/28.jpg)
Experiment: 100 variables (n=100)
![Page 29: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/29.jpg)
LSDP et BP-max : No observation
Action’s value for LSDP policy
Max marginals
![Page 30: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/30.jpg)
LSDP et BP-max : one observation
Action’s value for LSDP policy
Action’s value for LSDP policy
Max marginals
![Page 31: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/31.jpg)
LSDP et BP-max : two observations
Action’s value for LSDP policy
Action’s value for LSDP policy
Max marginals
![Page 32: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/32.jpg)
Experiment: 100 variables - constraint move
• Allowed to visit second ordre neighbourood only !
![Page 33: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/33.jpg)
Experiment: 100 variables - constraint move
• Allowed to visit second ordre neighbourood only !
![Page 34: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/34.jpg)
Experiment: 100 variables - constraint move
![Page 35: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/35.jpg)
Experiment: 200 variables - Different costRandom
BP-Max
LSDP Min Cost
Policy Value
60.27% 61.77%
64.8% 64.58%
Mean number
of observe
d variable
s
26.65 19 27.3 38
Initial Budget = 38
![Page 36: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/36.jpg)
Experiment: 200 variables - Different costRandom BP-
MaxLSDP Min Cost
Policy Value
60.27% 61.77% 64.8% 64.58%
Mean number
of observed variables
26.65 19 27.3 38
Initial Budget = 38
•Cost Repartition:
Random
BP-max LSDP
![Page 37: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/37.jpg)
Experiment: 100 variables - Different costRandom BP-Max LSDP
Policy Value 65.3% 66.2% 67%
Mean number of observed variables
15.4 15.4 15.4
WR 1 65.9% 65.4% 67%
WR 0 61% 63.2% 64%
•Initial Budget = 30
•Cost of 1 when observed variable is in state 0
•Cost of 3 when observed variable is in state 1
![Page 38: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/38.jpg)
Conclusions• An adapted framework for adaptive sampling in discrete random variables
•LSDP: a reinforcement learning approach for finding near optimal policy
Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem
Computation of near optimal policy « off-line »
Design of new policies that outperform simple heuristics and usual RL method
• Possible application?
Weeds sampling in crop field
![Page 39: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/39.jpg)
THANK YOU!
![Page 40: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/40.jpg)
Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }
A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0
xA2=(4,2)=Reconstruction of XR
• Maximum Posterior Marginal for reconstruction:
![Page 41: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/41.jpg)
Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }
A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0
xA2=(4,2)
• Maximum Posterior Marginal for reconstruction:
• Quality of trajectory:
![Page 42: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/42.jpg)
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
sH+1s1 s2 sH-1U(sH+1)
a1 a2
sH
aH-1 aH
sH+1s1 s2 sH-1
a1 a2
sH
aH-1 aH
U(sH+1)
![Page 43: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/43.jpg)
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
sH+1s1 s2 sH-1U(sH+1)
a1 a2
sH
aH-1 aH
sH+1s1 s2 sH-1
a1 a2 aH-1
U(sH+1)sH
aH
LINEAR SYSTEM
wH
![Page 44: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:](https://reader035.vdocuments.site/reader035/viewer/2022070404/56649f335503460f94c50a79/html5/thumbnails/44.jpg)
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
s1 s2 sH-1
a1 a2 aH-1
s1 s2 sH-1
a1 a2 aH-1
LINEAR SYSTEM
wH-1