optimal sensor scheduling via classification reduction of policy search (crops)
DESCRIPTION
Optimal Sensor Scheduling via Classification Reduction of Policy Search (CROPS). MURI Workshop 2006 Doron Blatt and Alfred Hero University of Michigan. Motivating Example: Landmine Confirmation/Detection. EMI. GPR. Seismic. - PowerPoint PPT PresentationTRANSCRIPT
Optimal Sensor Scheduling via Optimal Sensor Scheduling via Classification Reduction ofClassification Reduction of
Policy Search (CROPS)Policy Search (CROPS)
MURI Workshop 2006MURI Workshop 2006
Doron Blatt and Alfred HeroDoron Blatt and Alfred Hero
University of MichiganUniversity of Michigan
Motivating Example: Landmine Motivating Example: Landmine Confirmation/DetectionConfirmation/Detection
A vehicle carries three A vehicle carries three sensors for land-mine sensors for land-mine detection, each with its detection, each with its own characteristics.own characteristics.
The goal is to optimally The goal is to optimally schedule the three schedule the three sensors for mine detection.sensors for mine detection.
This is a sequential choice This is a sequential choice of experiment problem of experiment problem (DeGroot 1970).(DeGroot 1970).
Optimal policy Optimal policy
maximizes average rewardmaximizes average reward
Plastic Anti-personnel Mine
EMI GPR Seismic
Nail
Rock
Plastic Anti-tank Mine
New location
Seismic dataGPR dataEMI data
EMI GPR Seismic
EMI Seismic
Final detection Seismic dataEMI data
Seismic data Final detection
Final detection
Reinforcement LearningReinforcement Learning General objective: To find optimal policies General objective: To find optimal policies
for controlling stochastic decision for controlling stochastic decision processes:processes:– without an explicit model.without an explicit model.– when the exact solution is intractable.when the exact solution is intractable.
Applications:Applications:– Sensor scheduling.Sensor scheduling.– Treatment design.Treatment design.– Elevator dispatching.Elevator dispatching.– Robotics.Robotics.– Electric power system control.Electric power system control.– Job-shop Scheduling.Job-shop Scheduling.
Learning from Generative ModelsLearning from Generative Models It is possible to evaluate the value of any policy It is possible to evaluate the value of any policy from trajectory trees: from trajectory trees:
Let be the sum of rewards on the path that agrees with policy Let be the sum of rewards on the path that agrees with policy on on the the iith tree. Then,th tree. Then,
O0
O11
a0=1a0=0
O200
O3000 O3
001 O3010 O3
011 O3100 O3
101 O3110 O3
111
O201 O2
10 O211
O10
a1=0
a2=0 a2=0 a2=0 a2=0
a1=0 a1=1
a2=1a2=1
a1=1
a2=1 a2=1
Three sources of error in RLThree sources of error in RL Coupling of optimal decisions at each stage: finding the optimal decision
rule at a certain stage hinges on knowing the optimal decision rule for future stages
Misallocation of approximation resources to state space: without knowing the optimal policy one cannot sample from the distribution that it induces on the stochastic system’s state space
Inadequate control of generalization errors: without a model ensemble averages must be approximated from training trajectories
– J. Bagnell, S. Kakade, A. Ng, and J. Schneider, “Policy search by dynamic programming,” in Advances in Neural Information Processing Systems, vol. 16. 2003.
– A. Fern, S. Yoon, and R. Givan, “Approximate policy iteration with a policy language bias,” in Advances in Neural Information Processing Systems, vol. 16, 2003.
– M. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” in Proceedings of the Twentieth International Conference on Machine Learning, 2003.
– J. Langford and B. Zadrozny, “Reducing T-step reinforcement learning to classification,” http://hunch.net/ jl/projects/reductions/reductions.html, 2003.∼
– M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.
– S. A. Murphy, “A generalization error for Q-learning,” Journal of Machine Learning Research, vol. 6, pp. 1073–1097, 2005.
Learning from Generative ModelsLearning from Generative Models Drawbacks:Drawbacks:
– The combinatorial optimization problem:The combinatorial optimization problem:
can only be solved for small can only be solved for small nn and small and small .. Our remedies:Our remedies:
– Break the multi-stage search problem into a sequence of Break the multi-stage search problem into a sequence of single-stage optimization problems.single-stage optimization problems.
– Use a convex surrogate to simplify each optimization Use a convex surrogate to simplify each optimization problem.problem.
Will obtain generalization bounds similar to Will obtain generalization bounds similar to (Kearns…,’00) but that (Kearns…,’00) but that apply to the case in which the decision rules are estimated sequentially by reduction to classification
Fitting the Hindsight PathFitting the Hindsight Path
Zadrozny & Langford 2003: on each tree find the reward Zadrozny & Langford 2003: on each tree find the reward maximizing path.maximizing path.
Fit T+1 classifiers to these paths.Fit T+1 classifiers to these paths. Driving the classification error to zero is equivalent to finding Driving the classification error to zero is equivalent to finding
the optimal policy.the optimal policy. Drawback: In stochastic problems, no classifier can predict the Drawback: In stochastic problems, no classifier can predict the
hindsight action choices.hindsight action choices.
O0
O11
a0=1a0=0
O200
O3000 O3
001 O3010 O3
011 O3100 O3
101 O3110 O3
111
O201 O2
10 O211
O10
a1=0
a2=0 a2=0 a2=0 a2=0
a1=0 a1=1
a2=1a2=1
a1=1
a2=1 a2=1
Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach Assume the policy class has the form:Assume the policy class has the form:
Estimating Estimating TT via tree pruning: via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
Call the resulting policy Call the resulting policy
O0
a0=0
O3010 O3
011
O201
O10
a2=0 a2=1
a1=1
Choose random actions
Solve single-stage RL problem
Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach Estimating Estimating T-1T-1 given via tree pruning: given via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
O0
a0=0
O200
O3000 O3
011
O201
O10
a1=0
a2=0 a2=1
a1=1
Choose random actions
Solve single-stage RL problem
Propagate rewards according to
Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach
Estimating Estimating T-2T-2==00 given and via tree pruning: given and via tree pruning:
This is the empirical equivalent of:This is the empirical equivalent of:
Solve single-stage RL problem
O0
O11
a0=1a0=0
O3011 O3
101
O201 O2
10
O10
a1=0
a2=1
a1=1
a2=1
Propagate rewards according to
Reduction to Weighted ClassificationReduction to Weighted Classification Our approximate dynamic programming algorithm converts the multi-stage Our approximate dynamic programming algorithm converts the multi-stage
optimization problem into a sequence of single-stage optimization problems. optimization problem into a sequence of single-stage optimization problems. Unfortunately each sequence is still a combinatorial optimization problem. Unfortunately each sequence is still a combinatorial optimization problem. Our solution: reduce this to learning classifiers with convex surrogate.Our solution: reduce this to learning classifiers with convex surrogate. This classification reduction is different from that of Langford&Zarodny:03This classification reduction is different from that of Langford&Zarodny:03
Consider a single-stage RL problem:Consider a single-stage RL problem:
Consider a class of real valued functions Consider a class of real valued functions Each induces a policy:Each induces a policy: Optimal action classifies (Blatt&Hero:NIPS05):Optimal action classifies (Blatt&Hero:NIPS05):
O0
O11
a0=1a0=-1
O1-1
Reduction to Weighted ClassificationReduction to Weighted Classification
It is often much easier to solveIt is often much easier to solve
where where is a convex function. is a convex function. For example:For example:
– In neural network training In neural network training is the truncated quadratic loss. is the truncated quadratic loss.– In boosting In boosting is the exponential loss. is the exponential loss.– In support vector machines In support vector machines is the hinge loss. is the hinge loss.– In logistic regression In logistic regression is the scaled deviance. is the scaled deviance.
The effect of introducing The effect of introducing is well understood for the is well understood for the classification problem and the results can be applied classification problem and the results can be applied to the single-stage RL problem as well.to the single-stage RL problem as well.
Reduction to Weighted ClassificationReduction to Weighted ClassificationMulti-Stage ProblemMulti-Stage Problem
Let Let be the policy estimated be the policy estimated by the approximate dynamic programming algorithm, where by the approximate dynamic programming algorithm, where each single-stage RL problem is solved via each single-stage RL problem is solved via minimization. minimization.
Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with probability greater than 1-probability greater than 1- over the set of trajectory trees, over the set of trajectory trees,
for n satisfyingfor n satisfying
Proof uses recent results in Proof uses recent results in P. L. Bartlett, M. I. Jordan, and J. D. McAulie, “Convexity,
classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, March 2006.
Tighter than analogous Q-learning bound (Murphy:JMLR2005).Tighter than analogous Q-learning bound (Murphy:JMLR2005).
Landsat MSS ExperimentLandsat MSS Experiment LANDSAT Multispectral Scanner LANDSAT Multispectral Scanner
(MSS)(MSS) Multispectral scanning radiometer that Multispectral scanning radiometer that
was carried on board Landsat 1-5. was carried on board Landsat 1-5. MSS data consists of four spectral MSS data consists of four spectral
bands: bands: 1.1. Visible greenVisible green2.2. Visible red Visible red 3.3. Near-infrared 1Near-infrared 14.4. Near-infrared 2Near-infrared 2
The resolution for all bands is 79 meters, The resolution for all bands is 79 meters, and the approximate scene size is 185 x and the approximate scene size is 185 x 170 kilometers.170 kilometers.
STATLOG ProjectSTATLOG Project (Michie&etal:94): (Michie&etal:94): anotated dataset for testing classifier anotated dataset for testing classifier performance.performance.
Data consists of 4435 training cases and Data consists of 4435 training cases and 2000 test cases. 2000 test cases.
Each case is a 3x3x4 image stack in 36 Each case is a 3x3x4 image stack in 36 dimensions having 1 class attributedimensions having 1 class attribute
There are 6 class labels:There are 6 class labels:1.1. Red soilRed soil2.2. CottonCotton3.3. Vegetation stubbleVegetation stubble4.4. Gray soilGray soil5.5. Damp gray soilDamp gray soil6.6. Very damp gray soilVery damp gray soil
Unequal class sizes in both training and Unequal class sizes in both training and test sets.test sets.
• For each image location we adopt two stage policy to classify its label: • Select one of 6 possible pairs of 4 MSS bands for initial illumination• Based on initial measurement either:
• Make final decision on terrain class and stop• Illuminate with remaining two MSS bands and make final decision
• Reward is average probability of correct decision minus stopping time (energy)
Waveform Scheduling: CROPSWaveform Scheduling: CROPS
New location
Bands (2,4)Bands (2,3)
Classify
Bands (3,4)Bands (1,2) Bands (1,3) Bands (1,4)
Bands (1,4)
Classify
Reward=I(correct)
Reward=I(correct)-c
Optimal sub-band usage under energy constraints
Best two sensors
Best four sensors
Myopic is good
Non-myopic isbetter
Sub-band performanceSub-band performance11 22 33 44 55 66 PcPc
1+21+2 0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806
1+31+3 0.900.90 0.840.84 0.910.91 0.550.55 0.560.56 0.80.8 0.7960.796
1+41+4 0.960.96 0.930.93 0.920.92 0.480.48 0.560.56 0.760.76 0.8030.803
2+32+3 0.910.91 0.940.94 0.840.84 0.560.56 0.650.65 0.820.82 0.8120.812
2+42+4 0.900.90 0.920.92 0.90.9 0.180.18 0.760.76 0.870.87 0.8050.805
3+43+4 0.860.86 0.920.92 0.760.76 0.50.5 0.420.42 0.790.79 0.7390.739
AllAll 0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862
SBCLT
Best myopic choice.
Best non-myopic choice when likely to take more than one observation.
Sub-band optimal schedulingSub-band optimal scheduling Optimal initial sub-bands are 1+2 Optimal initial sub-bands are 1+2
Clutter Clutter
typetype11 22 33 44 55 66 PcPc
Performance Performance of sub-bands of sub-bands 1+21+2
0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806
Performance Performance of all sub-of all sub-bandsbands
0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862
Policy Policy
statisticsstatistics
Use full Use full spectrum spectrum only 60% only 60% of timeof time
Performance Performance of optimal of optimal schedulerscheduler
0.980.98 0.940.94 0.930.93 0.510.51 0.840.84 0.820.82 0.8610.861
*
*
Cla
ssify
Add
ition
al
band
s
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20.095
0.1
0.105
0.11
0.115
0.12
0.125
0.13
0.135
C = 0.03
C = 0
C = 0.06C = 0.09
number of dwells
Pe
C = 0
C = 0.02
C = 0.04C = 0.06
C = 0.08
C = 0.1
C = 0.12
C = 0.14
C = 0.16
C = 0.18
Neural Networkk-Nearest Neighbors
Alternative ComparisonsAlternative Comparisons
Best myopic initial pair: (2,3)
Non-myopic initial pair: (2,3)
* C is the cost of using the additional two bands.
Performance with all four bands
LANDSAT data: total of 4 bands, each produce a 9 dimensional vector.
Performance of all four bands
ConclusionsConclusions
Elements of CROPSElements of CROPS– Gauss-Seidel-type DP approximation reduces multi-Gauss-Seidel-type DP approximation reduces multi-
stage to sequence of single-stage RL problemsstage to sequence of single-stage RL problems– Classification reduction is used to solve each of these Classification reduction is used to solve each of these
signal stage RL problemssignal stage RL problems
Obtained tight finite sample generalization error Obtained tight finite sample generalization error bounds for RL based on classification theorybounds for RL based on classification theory
CROPS methodology illustrated for energy CROPS methodology illustrated for energy constrained landmine detection and waveform constrained landmine detection and waveform selectionselection
PublicationsPublications
– Blatt D., “Adaptive Sensing in Uncertain Environments ,” Blatt D., “Adaptive Sensing in Uncertain Environments ,” PhD Thesis, Dept EECS, University of Michigan, 2006.PhD Thesis, Dept EECS, University of Michigan, 2006.
– Blatt D. and Hero A. O., "From weighted classification to Blatt D. and Hero A. O., "From weighted classification to policy search", Nineteenth Conference on Neural Information policy search", Nineteenth Conference on Neural Information Processing Systems (NIPS), 2005.Processing Systems (NIPS), 2005.
– Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive multi-modalitysensor scheduling for detection and tracking of multi-modalitysensor scheduling for detection and tracking of smart targets'', Digital Signal Processing, 2005.smart targets'', Digital Signal Processing, 2005.
– Blatt D., Murphy S.A., and Zhu J. "A-learning for Blatt D., Murphy S.A., and Zhu J. "A-learning for Approximate Planning",Approximate Planning", Technical Report 04-63, The Technical Report 04-63, The Methodology Center, Pennsylvania State University.Methodology Center, Pennsylvania State University. 2004. 2004.