optimal sensor scheduling via classification reduction of policy search (crops)

Optimal Sensor Scheduling via Optimal Sensor Scheduling via Classification Reduction ofClassification Reduction of

Policy Search (CROPS)Policy Search (CROPS)

MURI Workshop 2006MURI Workshop 2006

Doron Blatt and Alfred HeroDoron Blatt and Alfred Hero

University of MichiganUniversity of Michigan

Motivating Example: Landmine Motivating Example: Landmine Confirmation/DetectionConfirmation/Detection

A vehicle carries three A vehicle carries three sensors for land-mine sensors for land-mine detection, each with its detection, each with its own characteristics.own characteristics.

The goal is to optimally The goal is to optimally schedule the three schedule the three sensors for mine detection.sensors for mine detection.

This is a sequential choice This is a sequential choice of experiment problem of experiment problem (DeGroot 1970).(DeGroot 1970).

Optimal policy Optimal policy

maximizes average rewardmaximizes average reward

Plastic Anti-personnel Mine

EMI GPR Seismic

Nail

Rock

Plastic Anti-tank Mine

New location

Seismic dataGPR dataEMI data

EMI GPR Seismic

EMI Seismic

Final detection Seismic dataEMI data

Seismic data Final detection

Final detection

Reinforcement LearningReinforcement Learning General objective: To find optimal policies General objective: To find optimal policies

for controlling stochastic decision for controlling stochastic decision processes:processes:– without an explicit model.without an explicit model.– when the exact solution is intractable.when the exact solution is intractable.

Applications:Applications:– Sensor scheduling.Sensor scheduling.– Treatment design.Treatment design.– Elevator dispatching.Elevator dispatching.– Robotics.Robotics.– Electric power system control.Electric power system control.– Job-shop Scheduling.Job-shop Scheduling.

Learning from Generative ModelsLearning from Generative Models It is possible to evaluate the value of any policy It is possible to evaluate the value of any policy from trajectory trees: from trajectory trees:

Let be the sum of rewards on the path that agrees with policy Let be the sum of rewards on the path that agrees with policy on on the the iith tree. Then,th tree. Then,

O0

O11

a0=1a0=0

O200

O3000 O3

001 O3010 O3

011 O3100 O3

101 O3110 O3

111

O201 O2

10 O211

O10

a1=0

a2=0 a2=0 a2=0 a2=0

a1=0 a1=1

a2=1a2=1

a1=1

a2=1 a2=1

Three sources of error in RLThree sources of error in RL Coupling of optimal decisions at each stage: finding the optimal decision

rule at a certain stage hinges on knowing the optimal decision rule for future stages

Misallocation of approximation resources to state space: without knowing the optimal policy one cannot sample from the distribution that it induces on the stochastic system’s state space

Inadequate control of generalization errors: without a model ensemble averages must be approximated from training trajectories

– J. Bagnell, S. Kakade, A. Ng, and J. Schneider, “Policy search by dynamic programming,” in Advances in Neural Information Processing Systems, vol. 16. 2003.

– A. Fern, S. Yoon, and R. Givan, “Approximate policy iteration with a policy language bias,” in Advances in Neural Information Processing Systems, vol. 16, 2003.

– M. Lagoudakis and R. Parr, “Reinforcement learning as classification: Leveraging modern classifiers,” in Proceedings of the Twentieth International Conference on Machine Learning, 2003.

– J. Langford and B. Zadrozny, “Reducing T-step reinforcement learning to classification,” http://hunch.net/ jl/projects/reductions/reductions.html, 2003.∼

– M. Kearns, Y. Mansour, and A. Ng, “Approximate planning in large POMDPs via reusable trajectories,” in Advances in Neural Information Processing Systems, vol. 12. MIT Press, 2000.

– S. A. Murphy, “A generalization error for Q-learning,” Journal of Machine Learning Research, vol. 6, pp. 1073–1097, 2005.

Learning from Generative ModelsLearning from Generative Models Drawbacks:Drawbacks:

– The combinatorial optimization problem:The combinatorial optimization problem:

can only be solved for small can only be solved for small nn and small and small .. Our remedies:Our remedies:

– Break the multi-stage search problem into a sequence of Break the multi-stage search problem into a sequence of single-stage optimization problems.single-stage optimization problems.

– Use a convex surrogate to simplify each optimization Use a convex surrogate to simplify each optimization problem.problem.

Will obtain generalization bounds similar to Will obtain generalization bounds similar to (Kearns…,’00) but that (Kearns…,’00) but that apply to the case in which the decision rules are estimated sequentially by reduction to classification

Fitting the Hindsight PathFitting the Hindsight Path

Zadrozny & Langford 2003: on each tree find the reward Zadrozny & Langford 2003: on each tree find the reward maximizing path.maximizing path.

Fit T+1 classifiers to these paths.Fit T+1 classifiers to these paths. Driving the classification error to zero is equivalent to finding Driving the classification error to zero is equivalent to finding

the optimal policy.the optimal policy. Drawback: In stochastic problems, no classifier can predict the Drawback: In stochastic problems, no classifier can predict the

hindsight action choices.hindsight action choices.

O0

O11

a0=1a0=0

O200

O3000 O3

001 O3010 O3

011 O3100 O3

101 O3110 O3

111

O201 O2

10 O211

O10

a1=0

a2=0 a2=0 a2=0 a2=0

a1=0 a1=1

a2=1a2=1

a1=1

a2=1 a2=1

Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach Assume the policy class has the form:Assume the policy class has the form:

Estimating Estimating TT via tree pruning: via tree pruning:

This is the empirical equivalent of:This is the empirical equivalent of:

Call the resulting policy Call the resulting policy

O0

a0=0

O3010 O3

011

O201

O10

a2=0 a2=1

a1=1

Choose random actions

Solve single-stage RL problem

Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach Estimating Estimating T-1T-1 given via tree pruning: given via tree pruning:


O0

a0=0

O200

O3000 O3

011

O201

O10

a1=0

a2=0 a2=1

a1=1

Choose random actions


Propagate rewards according to

Approximate Dynamic Programming ApproachApproximate Dynamic Programming Approach

Estimating Estimating T-2T-2==00 given and via tree pruning: given and via tree pruning:



O0

O11

a0=1a0=0

O3011 O3

101

O201 O2

10

O10

a1=0

a2=1

a1=1

a2=1

Propagate rewards according to

Reduction to Weighted ClassificationReduction to Weighted Classification Our approximate dynamic programming algorithm converts the multi-stage Our approximate dynamic programming algorithm converts the multi-stage

optimization problem into a sequence of single-stage optimization problems. optimization problem into a sequence of single-stage optimization problems. Unfortunately each sequence is still a combinatorial optimization problem. Unfortunately each sequence is still a combinatorial optimization problem. Our solution: reduce this to learning classifiers with convex surrogate.Our solution: reduce this to learning classifiers with convex surrogate. This classification reduction is different from that of Langford&Zarodny:03This classification reduction is different from that of Langford&Zarodny:03

Consider a single-stage RL problem:Consider a single-stage RL problem:

Consider a class of real valued functions Consider a class of real valued functions Each induces a policy:Each induces a policy: Optimal action classifies (Blatt&Hero:NIPS05):Optimal action classifies (Blatt&Hero:NIPS05):

O0

O11

a0=1a0=-1

O1-1

Reduction to Weighted ClassificationReduction to Weighted Classification

It is often much easier to solveIt is often much easier to solve

where where is a convex function. is a convex function. For example:For example:

– In neural network training In neural network training is the truncated quadratic loss. is the truncated quadratic loss.– In boosting In boosting is the exponential loss. is the exponential loss.– In support vector machines In support vector machines is the hinge loss. is the hinge loss.– In logistic regression In logistic regression is the scaled deviance. is the scaled deviance.

The effect of introducing The effect of introducing is well understood for the is well understood for the classification problem and the results can be applied classification problem and the results can be applied to the single-stage RL problem as well.to the single-stage RL problem as well.

Reduction to Weighted ClassificationReduction to Weighted ClassificationMulti-Stage ProblemMulti-Stage Problem

Let Let be the policy estimated be the policy estimated by the approximate dynamic programming algorithm, where by the approximate dynamic programming algorithm, where each single-stage RL problem is solved via each single-stage RL problem is solved via minimization. minimization.

Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with Theorem 2: Assume P-dim( ) = dt, t=0, …, T. Then, with probability greater than 1-probability greater than 1- over the set of trajectory trees, over the set of trajectory trees,

for n satisfyingfor n satisfying

Proof uses recent results in Proof uses recent results in P. L. Bartlett, M. I. Jordan, and J. D. McAulie, “Convexity,

classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, March 2006.

Tighter than analogous Q-learning bound (Murphy:JMLR2005).Tighter than analogous Q-learning bound (Murphy:JMLR2005).

Landsat MSS ExperimentLandsat MSS Experiment LANDSAT Multispectral Scanner LANDSAT Multispectral Scanner

(MSS)(MSS) Multispectral scanning radiometer that Multispectral scanning radiometer that

was carried on board Landsat 1-5. was carried on board Landsat 1-5. MSS data consists of four spectral MSS data consists of four spectral

bands: bands: 1.1. Visible greenVisible green2.2. Visible red Visible red 3.3. Near-infrared 1Near-infrared 14.4. Near-infrared 2Near-infrared 2

The resolution for all bands is 79 meters, The resolution for all bands is 79 meters, and the approximate scene size is 185 x and the approximate scene size is 185 x 170 kilometers.170 kilometers.

STATLOG ProjectSTATLOG Project (Michie&etal:94): (Michie&etal:94): anotated dataset for testing classifier anotated dataset for testing classifier performance.performance.

Data consists of 4435 training cases and Data consists of 4435 training cases and 2000 test cases. 2000 test cases.

Each case is a 3x3x4 image stack in 36 Each case is a 3x3x4 image stack in 36 dimensions having 1 class attributedimensions having 1 class attribute

There are 6 class labels:There are 6 class labels:1.1. Red soilRed soil2.2. CottonCotton3.3. Vegetation stubbleVegetation stubble4.4. Gray soilGray soil5.5. Damp gray soilDamp gray soil6.6. Very damp gray soilVery damp gray soil

Unequal class sizes in both training and Unequal class sizes in both training and test sets.test sets.

• For each image location we adopt two stage policy to classify its label: • Select one of 6 possible pairs of 4 MSS bands for initial illumination• Based on initial measurement either:

• Make final decision on terrain class and stop• Illuminate with remaining two MSS bands and make final decision

• Reward is average probability of correct decision minus stopping time (energy)

Waveform Scheduling: CROPSWaveform Scheduling: CROPS

New location

Bands (2,4)Bands (2,3)

Classify

Bands (3,4)Bands (1,2) Bands (1,3) Bands (1,4)

Bands (1,4)

Classify

Reward=I(correct)

Reward=I(correct)-c

Optimal sub-band usage under energy constraints

Best two sensors

Best four sensors

Myopic is good

Non-myopic isbetter

Sub-band performanceSub-band performance11 22 33 44 55 66 PcPc

1+21+2 0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806

1+31+3 0.900.90 0.840.84 0.910.91 0.550.55 0.560.56 0.80.8 0.7960.796

1+41+4 0.960.96 0.930.93 0.920.92 0.480.48 0.560.56 0.760.76 0.8030.803

2+32+3 0.910.91 0.940.94 0.840.84 0.560.56 0.650.65 0.820.82 0.8120.812

2+42+4 0.900.90 0.920.92 0.90.9 0.180.18 0.760.76 0.870.87 0.8050.805

3+43+4 0.860.86 0.920.92 0.760.76 0.50.5 0.420.42 0.790.79 0.7390.739

AllAll 0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862

SBCLT

Best myopic choice.

Best non-myopic choice when likely to take more than one observation.

Sub-band optimal schedulingSub-band optimal scheduling Optimal initial sub-bands are 1+2 Optimal initial sub-bands are 1+2

Clutter Clutter

typetype11 22 33 44 55 66 PcPc

Performance Performance of sub-bands of sub-bands 1+21+2

0.980.98 0.850.85 0.960.96 00 0.60.6 0.940.94 0.8060.806

Performance Performance of all sub-of all sub-bandsbands

0.970.97 0.950.95 0.920.92 0.540.54 0.840.84 0.820.82 0.8620.862

Policy Policy

statisticsstatistics

Use full Use full spectrum spectrum only 60% only 60% of timeof time

Performance Performance of optimal of optimal schedulerscheduler

0.980.98 0.940.94 0.930.93 0.510.51 0.840.84 0.820.82 0.8610.861

*

*

Cla

ssify

Add

ition

al

band

s

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20.095

0.1

0.105

0.11

0.115

0.12

0.125

0.13

0.135

C = 0.03

C = 0

C = 0.06C = 0.09

number of dwells

Pe

C = 0

C = 0.02

C = 0.04C = 0.06

C = 0.08

C = 0.1

C = 0.12

C = 0.14

C = 0.16

C = 0.18

Neural Networkk-Nearest Neighbors

Alternative ComparisonsAlternative Comparisons

Best myopic initial pair: (2,3)

Non-myopic initial pair: (2,3)

* C is the cost of using the additional two bands.

Performance with all four bands

LANDSAT data: total of 4 bands, each produce a 9 dimensional vector.

Performance of all four bands

ConclusionsConclusions

Elements of CROPSElements of CROPS– Gauss-Seidel-type DP approximation reduces multi-Gauss-Seidel-type DP approximation reduces multi-

stage to sequence of single-stage RL problemsstage to sequence of single-stage RL problems– Classification reduction is used to solve each of these Classification reduction is used to solve each of these

signal stage RL problemssignal stage RL problems

Obtained tight finite sample generalization error Obtained tight finite sample generalization error bounds for RL based on classification theorybounds for RL based on classification theory

CROPS methodology illustrated for energy CROPS methodology illustrated for energy constrained landmine detection and waveform constrained landmine detection and waveform selectionselection

PublicationsPublications

– Blatt D., “Adaptive Sensing in Uncertain Environments ,” Blatt D., “Adaptive Sensing in Uncertain Environments ,” PhD Thesis, Dept EECS, University of Michigan, 2006.PhD Thesis, Dept EECS, University of Michigan, 2006.

– Blatt D. and Hero A. O., "From weighted classification to Blatt D. and Hero A. O., "From weighted classification to policy search", Nineteenth Conference on Neural Information policy search", Nineteenth Conference on Neural Information Processing Systems (NIPS), 2005.Processing Systems (NIPS), 2005.

– Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive Kreucher C., Blatt D., Hero A. O., and Kastella K., ``Adaptive multi-modalitysensor scheduling for detection and tracking of multi-modalitysensor scheduling for detection and tracking of smart targets'', Digital Signal Processing, 2005.smart targets'', Digital Signal Processing, 2005.

– Blatt D., Murphy S.A., and Zhu J. "A-learning for Blatt D., Murphy S.A., and Zhu J. "A-learning for Approximate Planning",Approximate Planning", Technical Report 04-63, The Technical Report 04-63, The Methodology Center, Pennsylvania State University.Methodology Center, Pennsylvania State University. 2004. 2004.

optimal sensor scheduling via classification reduction of policy search (crops)

Documents

optimal policies

policy qt

optimal sensor scheduling

optimal decision rule

approximate policy iteration

policy language bias

multistage search problem

tstep reinforcement