a real time approximate dynamic programming algorithm for planning
TRANSCRIPT
A Real Time Approximate Dynamic Programming Algorithm for Planning A Real Time Approximate Dynamic Programming Algorithm for Planning
Nikolaos E. Pratikakis , Jay H. Lee , and Matthew J. Realff
AgendaAgenda
Motivation
Background information
The Curse of Dimensionality
Algorithm
Exploitation Vs Exploration
Results
Conclusions and Future Directions
Motivation – Capacity Planning (1)Motivation – Capacity Planning (1)
Main Processing
Station 1
Queue
Completed Jobs
Reconstruction Area
D (Demand)
R(Recirculation
Rate)
1-R
Testing Area
Station 2
Queue
Station 3
Queue
Motivation – Capacity Planning (2)Motivation – Capacity Planning (2)
MIP with deterministic future well studied
MIP with uncertain future - computational bottlenecks(a) Solving for an expected value by sampling the future. That misses the opportunity to revise the actions depending on the state.(b) Solving the full problem where actions can depend on the state. Requires the consideration of branching the future scenarios
• Number of branching points and scenarios• You need fairly restrictive assumptions about how the actions and the
future interact
Rolling horizon compromises the (a) and (b)
Our solution strategy relates to Approximate Dynamic Programming (ADP) - Key advantage :
ADP is based on a procedural representation of the problem (simulation code)MIP declare the alternatives explicitly before it starts
Background InformationBackground Information
Markov Decision Processes (MDP): A mathematical representation of a sequential decision making problem in which:
A system evolves through timeA decision maker controls it by taking actions at pre-specified points of timeActions incur immediate costs or rewards and affect the subsequent system state
Basic Model IngredientsState space - (generic state s)Action space – (generic action α)Rewards –Transition probabilities –
A model is called stationary if rewards and transition probabilities are independent of t
Dynamic Programming (DP) the computational tool to address MDP
Model Ingredients Model Ingredients
State Space –
Queue lengths (wi) -
Finished stock (St) -
States of random variables (D,R)
Conservative estimation more than 1 billion discrete states
Action Space –
Number of Machines Available at each stage –
Percentage of Machines Used at each stage –
More than 1 million discrete controls per state
To achieve substantial performance
Meet Demand
Stock level is controlled around SSP
The queue levels (w2,w3) are minimized
Transition equations (material balances)
Demand – Recirculation modeled as a first order Markov model
Formal Definition of Value FunctionFormal Definition of Value Function
Given a policy the value function of state s0 is the expected reward
Optimal value function corresponds to
Optimal Value Functions are the solution of the optimality equations
Optimal action can easily be computed, if one has the knowledge of the optimal value functions for all the states
0 2 4 6 8 100
1
2
3
4
5
6
Starting stateGoal state
*
The Curse of Dimensionality – Motivation for ADPThe Curse of Dimensionality – Motivation for ADP
The Cardinality of the State Space
Approximate Dynamic Programming (Lee et al.,2004)
Explicit, Explore or Exploit (E3) (Kearns, 1998)
Real Time Dynamic Programming (Barto, 1995)
The Cardinality of the Action Space
No significant effort can be found in literature to minimize this source of computational bottleneck (cannot guarantee convergence)
The Calculation of the Expectation over all dimensions of our random quantities
Stochastic gradient method (Powell , 2005)
Monte Carlo Sampling
Bellman equation:
Trial Based Real Time Dynamic Programming (Barto, 1995)Trial Based Real Time Dynamic Programming (Barto, 1995)
The algorithm proposed in Bartos et al.(1995) paper to address stochastic shortest path problems (1 goal state and finite but large state space)
Inputs:
Initialize number of trials n.
Initialize the value functions for all the states in S
0 2 4 6 8 100
1
2
3
4
5
6
Basic Steps for each Trial:
1. Start from starting state st2. Use Bellman equation and pick
3. a) Simulate the system using b) Update 4. Set5. End at goal state
st
St+1st
On Modifying RTDPOn Modifying RTDP
The Concept of the Relevant state space ( )
We want to solve for the value function only in the region that the system normally operates
Greedy heuristic, for example, is a good initial policy to define the operating space
RTDP modification:
The concept to “evolve” the state space from an empty one
The concept of the “adaptive action set”
The Exploration Vs Exploitation by tuning the initialization of the value function for the unseen successive states
The usage of k-NN (local value function approximators)
Schematic Illustration Concerning State Space TerminologySchematic Illustration Concerning State Space Terminology
1 2' { : St> or }is w S S
The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space
The Proposed Method to Overcome The COD in the State,Action and Uncertainty Space
subA
is
Uncertainty
Sample frompossible transitions
Evaluate using Bellman equation
α*
js
Adaptive Action Set ( )
1
2
34
57
Possible successive next stateCandidate optimal action for Initial state
Pratikakis, N.E, Realff M.J and Lee, J.H “Strategic Capacity Decisions In Manufacturing Using Real-Time Adaptive Dynamic Programming”, Submitted to Naval Research Logistics.
6
subA
The Adaptive Action SetThe Adaptive Action Set
Designed to circumvent the curse of dimensionality concerning
The idea: Use optimization and heuristics to selectively choose a small number of controls for each state
How to Construct the ?
Heuristic actions
Mathematical Programming actions
Random actions
Best known Actions
subA A
On Evaluating the Actions in the AsubOn Evaluating the Actions in the Asub
js
js
js
1st Scenario: All sj’s belong to Retrieve from the look up table
2nd Scenario: Some of the sj’s do not belong to
3rd Scenario: What ifInitial estimation schemes
• Underestimating the optimal value function for all sj
• Overestimating the optimal value function for all sj
( ) max{ ( , ) ( | , ) ( )}sub
j REL
i ij j
as
suba J s r s a P s s a J s
A
S
A
( )
( ) { : },
1 ( ) ( ) ( )
j
def T
j REL i j j
i ij j
x N s
Find N s s d s s W s s
If N s k J s J xk
S
( )jN s k
Exploration Vs ExploitationExploration Vs Exploitation
The initialization of the space values turns out to be an important parameter in the algorithm to control exploitation Vs exploration
Consistent underestimation scheme as initialization for the optimal value function intuitively leads to minimum exploration
Consistent overestimation scheme as initialization for the optimal value function intuitively leads to maximum exploration
|S|
Number of Iterations
RTADP with Over-estimationRTADP with Under-estimation
sSimulation
with RTADP
α)RTADP with under-estimator α*=α1
(bias to choose actions that lead to already explored space)
b)RTADP with over-estimator α*=α2
(bias to choose actions that lead to unexplored regions of the state space)
α1
α2
ResultsResults
RTADP with different exploration strategy
Comparison MIP full information – RTADP – Heuristic – MIP rolling horizon
Behavior of the architectures in Time Series (Smooth Stock level control)
RTADP combined with different exploration strategy RTADP combined with different exploration strategy
'S
Estimation of Value Function Used when evaluating sj’s with no neighbors (Scenario 3)
Estimation of Value Function Used when evaluating sj’s that
belong to set
RTADP – a
(under-estimator)
0 0
RTADP – b
(prior knowledge)
0 -10
RTADP – c
(over-estimator)
0*( ) ( )i
j jJ s J s
RTADP combined with different exploration strategy (2)RTADP combined with different exploration strategy (2)
RTADP + Under-Estimation RTADP + Prior KnowledgeRTADP + Over-Estimation
Performance ComparisonPerformance Comparison
13.3% Performance gap between RTADP-α and MIP with full information
Multi stage uncertainty not very dominant?
The Control of Stock Level using RTADP in Time SeriesThe Control of Stock Level using RTADP in Time Series
Conclusions and Future DirectionsConclusions and Future Directions
RTADP successfully implemented for this manufacturing job shop
Alleviate curse of dimensionality
• Adaptive action set
• Evolving state space
Usage of k-NN as a local approximator
Performance gap given an upper bound solution is 13.3%
Importance of Multistage uncertainty
Initialization of the optimal value function can be used as a tuning parameter for exploitation Vs exploration
Possible extensions
Incorporate coherent* risk measures to the objective function to manipulate the profit (cost) distribution accordingly
AcknowledgementsAcknowledgements
Advisors:
Jay H. Lee
Matthew J. Realff
ISSICL Group members
Financial Support :NSF (CTS#03019993)