a real time approximate dynamic programming algorithm for planning

A Real Time Approximate Dynamic Programming Algorithm for Planning A Real Time Approximate Dynamic Programming Algorithm for Planning

Nikolaos E. Pratikakis , Jay H. Lee , and Matthew J. Realff

AgendaAgenda

Motivation

Background information

The Curse of Dimensionality

Algorithm

Exploitation Vs Exploration

Results

Conclusions and Future Directions

Motivation – Capacity Planning (1)Motivation – Capacity Planning (1)

Main Processing

Station 1

Queue

Completed Jobs

Reconstruction Area

D (Demand)

R(Recirculation

Rate)

1-R

Testing Area

Station 2

Queue

Station 3

Queue

Motivation – Capacity Planning (2)Motivation – Capacity Planning (2)

MIP with deterministic future well studied

MIP with uncertain future - computational bottlenecks(a) Solving for an expected value by sampling the future. That misses the opportunity to revise the actions depending on the state.(b) Solving the full problem where actions can depend on the state. Requires the consideration of branching the future scenarios

• Number of branching points and scenarios• You need fairly restrictive assumptions about how the actions and the

future interact

Rolling horizon compromises the (a) and (b)

Our solution strategy relates to Approximate Dynamic Programming (ADP) - Key advantage :

ADP is based on a procedural representation of the problem (simulation code)MIP declare the alternatives explicitly before it starts

Background InformationBackground Information

Markov Decision Processes (MDP): A mathematical representation of a sequential decision making problem in which:

A system evolves through timeA decision maker controls it by taking actions at pre-specified points of timeActions incur immediate costs or rewards and affect the subsequent system state

Basic Model IngredientsState space - (generic state s)Action space – (generic action α)Rewards –Transition probabilities –

A model is called stationary if rewards and transition probabilities are independent of t

Dynamic Programming (DP) the computational tool to address MDP

Model Ingredients Model Ingredients

State Space –

Queue lengths (wi) -

Finished stock (St) -

States of random variables (D,R)

Conservative estimation more than 1 billion discrete states

Action Space –

Number of Machines Available at each stage –

Percentage of Machines Used at each stage –

More than 1 million discrete controls per state

To achieve substantial performance

Meet Demand

Stock level is controlled around SSP

The queue levels (w2,w3) are minimized

Transition equations (material balances)

Demand – Recirculation modeled as a first order Markov model

Formal Definition of Value FunctionFormal Definition of Value Function

Given a policy the value function of state s0 is the expected reward

Optimal value function corresponds to

Optimal Value Functions are the solution of the optimality equations

Optimal action can easily be computed, if one has the knowledge of the optimal value functions for all the states

0 2 4 6 8 100

1

2

3

4

5

6

Starting stateGoal state

*

The Curse of Dimensionality – Motivation for ADPThe Curse of Dimensionality – Motivation for ADP

The Cardinality of the State Space

Approximate Dynamic Programming (Lee et al.,2004)

Explicit, Explore or Exploit (E3) (Kearns, 1998)

Real Time Dynamic Programming (Barto, 1995)

The Cardinality of the Action Space

No significant effort can be found in literature to minimize this source of computational bottleneck (cannot guarantee convergence)

The Calculation of the Expectation over all dimensions of our random quantities

Stochastic gradient method (Powell , 2005)

Monte Carlo Sampling

Bellman equation:

Trial Based Real Time Dynamic Programming (Barto, 1995)Trial Based Real Time Dynamic Programming (Barto, 1995)

The algorithm proposed in Bartos et al.(1995) paper to address stochastic shortest path problems (1 goal state and finite but large state space)

Inputs:

Initialize number of trials n.

Initialize the value functions for all the states in S

0 2 4 6 8 100

1

2

3

4

5

6

Basic Steps for each Trial:

1. Start from starting state st2. Use Bellman equation and pick

3. a) Simulate the system using b) Update 4. Set5. End at goal state

st

St+1st

On Modifying RTDPOn Modifying RTDP

The Concept of the Relevant state space ( )

We want to solve for the value function only in the region that the system normally operates

Greedy heuristic, for example, is a good initial policy to define the operating space

RTDP modification:

The concept to “evolve” the state space from an empty one

The concept of the “adaptive action set”

The Exploration Vs Exploitation by tuning the initialization of the value function for the unseen successive states

The usage of k-NN (local value function approximators)

Schematic Illustration Concerning State Space TerminologySchematic Illustration Concerning State Space Terminology

1 2' { : St> or }is w S S

The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space

The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space

subA

is

Uncertainty

Sample frompossible transitions

Evaluate using Bellman equation

α*

js

Adaptive Action Set ( )

1

2

34

57

Possible successive next stateCandidate optimal action for Initial state

Pratikakis, N.E, Realff M.J and Lee, J.H “Strategic Capacity Decisions In Manufacturing Using Real-Time Adaptive Dynamic Programming”, Submitted to Naval Research Logistics.

6

subA

The Adaptive Action SetThe Adaptive Action Set

Designed to circumvent the curse of dimensionality concerning

The idea: Use optimization and heuristics to selectively choose a small number of controls for each state

How to Construct the ?

Heuristic actions

Mathematical Programming actions

Random actions

Best known Actions

subA A

On Evaluating the Actions in the AsubOn Evaluating the Actions in the Asub

js

js

js

1st Scenario: All sj’s belong to Retrieve from the look up table

2nd Scenario: Some of the sj’s do not belong to

3rd Scenario: What ifInitial estimation schemes

• Underestimating the optimal value function for all sj

• Overestimating the optimal value function for all sj

( ) max{ ( , ) ( | , ) ( )}sub

j REL

i ij j

as

suba J s r s a P s s a J s

A

S

A

( )

( ) { : },

1 ( ) ( ) ( )

j

def T

j REL i j j

i ij j

x N s

Find N s s d s s W s s

If N s k J s J xk

S

( )jN s k

Exploration Vs ExploitationExploration Vs Exploitation

The initialization of the space values turns out to be an important parameter in the algorithm to control exploitation Vs exploration

Consistent underestimation scheme as initialization for the optimal value function intuitively leads to minimum exploration

Consistent overestimation scheme as initialization for the optimal value function intuitively leads to maximum exploration

|S|

Number of Iterations

RTADP with Over-estimationRTADP with Under-estimation

sSimulation

with RTADP

α)RTADP with under-estimator α*=α1

(bias to choose actions that lead to already explored space)

b)RTADP with over-estimator α*=α2

(bias to choose actions that lead to unexplored regions of the state space)

α1

α2

ResultsResults

RTADP with different exploration strategy

Comparison MIP full information – RTADP – Heuristic – MIP rolling horizon

Behavior of the architectures in Time Series (Smooth Stock level control)

RTADP combined with different exploration strategy RTADP combined with different exploration strategy

'S

Estimation of Value Function Used when evaluating sj’s with no neighbors (Scenario 3)

Estimation of Value Function Used when evaluating sj’s that

belong to set

RTADP – a

(under-estimator)

0 0

RTADP – b

(prior knowledge)

0 -10

RTADP – c

(over-estimator)

0*( ) ( )i

j jJ s J s

RTADP combined with different exploration strategy (2)RTADP combined with different exploration strategy (2)

RTADP + Under-Estimation RTADP + Prior KnowledgeRTADP + Over-Estimation

Performance ComparisonPerformance Comparison

13.3% Performance gap between RTADP-α and MIP with full information

Multi stage uncertainty not very dominant?

The Control of Stock Level using RTADP in Time SeriesThe Control of Stock Level using RTADP in Time Series

Conclusions and Future DirectionsConclusions and Future Directions

RTADP successfully implemented for this manufacturing job shop

Alleviate curse of dimensionality

• Adaptive action set

• Evolving state space

Usage of k-NN as a local approximator

Performance gap given an upper bound solution is 13.3%

Importance of Multistage uncertainty

Initialization of the optimal value function can be used as a tuning parameter for exploitation Vs exploration

Possible extensions

Incorporate coherent* risk measures to the objective function to manipulate the profit (cost) distribution accordingly

AcknowledgementsAcknowledgements

Advisors:

Jay H. Lee

Matthew J. Realff

ISSICL Group members

Financial Support :NSF (CTS#03019993)

a real time approximate dynamic programming algorithm for planning

Technology

state goal state

state space terminology

relevant state space

state candidate optimal

large state space inputs

initial state pratikakis

goal state s t s t

adaptive action set