![Page 1: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/1.jpg)
A Reinforcement Learning A Reinforcement Learning Approach for Product Delivery Approach for Product Delivery by Multiple Vehiclesby Multiple Vehicles
Scott Proper
Oregon State University
Prasad TadepalliHong Tang Rasaratnam Logendran
![Page 2: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/2.jpg)
Vehicle Routing & Product DeliveryVehicle Routing & Product Delivery
![Page 3: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/3.jpg)
Contributions of our ResearchContributions of our Research
Multiple vehicle product delivery is a well-studied problem in operations research
We have formulated this problem as an average reward reinforcement learning (RL) problem
We have combined inventory control with vehicle routing
We have scaled RL methods to work with large state spaces
![Page 4: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/4.jpg)
Markov Decision ProcessesMarkov Decision Processes
Action a
Actions are stochastic: Pi,j(a)
Actions have costs or rewards: ri(a)
Move
Unload
Unload
![Page 5: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/5.jpg)
Average Reward Reinforcement Average Reward Reinforcement LearningLearning
Goal: Maximize average reward/time step– Minimize stockout penalty + movement
penalty Policy: states → actions Value function: states → real values
– expected long-term reward from a state, relative to other states, when following the optimal policy
![Page 6: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/6.jpg)
H-LearningH-Learning
The value function satisfies the Bellman equation:
The optimal action a* maximizes the immediate reward + expected value of the next state
H-Learning is a real-time algorithm for solving the value function
![Page 7: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/7.jpg)
H-Learning: an example 1H-Learning: an example 1
-.1, 1/1
0, 9/9 0, 0/9
A
ED
CB
Value Table
A 0
B 0
C 0
D 0
E 0
![Page 8: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/8.jpg)
H-Learning: an example 2H-Learning: an example 2
Stockout penalty: -20
A
ED
CB-.1, 1/1
0, 9/10 -20, 1/10Value Table
A -.1
B 0
C 0
D 0
E 0
![Page 9: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/9.jpg)
H-Learning: an example 3H-Learning: an example 3
A
ED
CB-.1, 1/1
0, 9/10Value Table
A -.1
B 0
C 0
D 0
E 0
-20, 1/10
![Page 10: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/10.jpg)
H-Learning: an example 4H-Learning: an example 4
Move penalty: -.1
A
ED
CB-.1, 2/2
0, 9/10
Value Table
A -.1
B 0
C 0
D 0
E 0
-20, 1/10
![Page 11: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/11.jpg)
On-line Product DeliveryOn-line Product Delivery
Deliver 1 product 9 truck actions:
– 4 levels of unload – 4 move directions– wait
P(Inventory decrease | shop)
Stockout penalty: -20 Movement penalty: -.1
5 Shops
Depot
![Page 12: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/12.jpg)
The problem of state-space The problem of state-space explosionexplosion The loads of trucks and shop inventories
are discretized into 5 levels States grow exponentially in shops and
trucks– 10 locations, 5 shops, 2 trucks = (102)
(55)(52) = 7,812,500 states– 5 trucks = 976,592,500,000 states
Table-based methods take too much time and space
![Page 13: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/13.jpg)
Piecewise Linear Function Piecewise Linear Function ApproximationApproximation
We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks
Each function is linear in truck loads and shop inventories
Every function represents 10 million states
million-fold reduction of learnable parameters
![Page 14: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/14.jpg)
Piecewise linear function Piecewise linear function approximation vs. table-basedapproximation vs. table-based
10 locations, 5 shops, 2 trucks, 106 iterations
-8
-7
-6
-5
-4
-3
-2
-1
0
10 110 210 310 410 510 610 710 810 910
1000's of Iterations
Ave
rag
e R
ewar
d
Piecewise Linear Function Approximation
Table-based
![Page 15: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/15.jpg)
Storing and using the action modelsStoring and using the action models
Problem: exponential time to determine the expected value of the next state:
- Each shop’s consumption is independent
- Value function is piecewise linear
?
?
?
?
![Page 16: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/16.jpg)
Ignoring Truck IdentityIgnoring Truck Identity
m = number of locations (10)k = number of trucks (2-5)
5 trucks: 105 functions Learnable parameters:
1.1 million
2002 functions Learnable parameters:
22,022
mk
![Page 17: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/17.jpg)
The problem of action-space The problem of action-space explosionexplosion
Every action a is a vector of individual “truck actions” a = (a1, a2,…,an)
Actions grow exponentially in the number of trucks– 9 “truck actions”– For 2 trucks: 92 = 81 total actions– For 5 trucks: 95 = 59,049 total actions
![Page 18: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/18.jpg)
Hill Climbing SearchHill Climbing Search
We initialize the vector of truck actions a to all “wait” actions
We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat
This results in an order-of-magnitude improvement in search time
![Page 19: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/19.jpg)
Hill climbing vs. exhaustive search Hill climbing vs. exhaustive search for 4 and 5 trucksfor 4 and 5 trucks
10 locations, 5 shops, 5 trucks, 106 iterations
-3
-2.5
-2
-1.5
-1
-0.5
0
10 110 210 310 410 510 610 710 810 9101000's of Iterations
Av
era
ge
re
wa
rd
Hill Climbing with 5 trucks
Exhaustive search, 5 trucks
![Page 20: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/20.jpg)
ConclusionConclusion
Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery
Hill climbing shows great potential for speeding up search in domains with a large action space
Problems of scaling are surmountable
![Page 21: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran](https://reader035.vdocuments.site/reader035/viewer/2022062320/56649d7b5503460f94a5f4b3/html5/thumbnails/21.jpg)
Future WorkFuture Work
Scaling! More trucks, more locations, more shops, more depots, and more items
Allowing trucks to move with non-uniform speeds (event-based model needed)
Real-valued shop inventory and truck load levels