introduction to neuro-dynamic programming (or,...
TRANSCRIPT
Introduction to Neuro-Dynamic Programming(Or, how to count cards in blackjack and do other fun
things too.)
Eric B. Laber
February 12, 2008
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 1 / 32
Framework
Introduction
Objectives:
Define Neuro-Dynamic Programming (NDP)
Understand how NDP is used by learning to cheat at blackjack
Learn other (more noble) applications of NDP
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 2 / 32
Framework
What is NDP?
NDP is about sequential decision making
An agent (decision maker) is faced with a series of decisions
Each decision results in a rewardEach decision changes the environment
Agent’s objective: maximize accumulated reward over time
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 3 / 32
Framework
What is NDP?
S0 A1 R1 S1
Initial State
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 4 / 32
Framework
What is NDP?
S0 A1 R1 S1
First Decision
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 5 / 32
Framework
What is NDP?
S0 A1 R1 S1
First Reward
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 6 / 32
Framework
What is NDP?
S0 A1 R1 S1
Second State
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 7 / 32
Framework
What is NDP?
Actions affect future states so myopic decision making is NOT sufficient
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 8 / 32
Framework
What is NDP?
Actions affect future states so myopic decision making is NOT sufficient
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 9 / 32
Framework
What is NDP?
Actions affect future states so myopic decision making is NOT sufficient
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 10 / 32
Framework
What is NDP?
Actions affect future states so myopic decision making is NOT sufficient
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 11 / 32
Framework
What is NDP?
Actions affect future states so myopic decision making is NOT sufficient
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 12 / 32
Framework
What is NDP?
Solution: Go backwards!
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 13 / 32
Framework
What is NDP?
Solution: Go backwards!
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 14 / 32
Framework
What is NDP?
Solution: Go backwards!
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
r1 = 101
r1 = 2
a11
a12
a21
a22
a
′
21
a
′
22
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 15 / 32
Framework
What is NDP?
Solution: Go backwards!
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
r1 = 101
r1 = 2
a11
a12
a21
a22
a
′
21
a
′
22
Backup diagrams put the DP in NDP
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 16 / 32
Framework
What is NDP?
Solution: Go backwards!
S0 A1
r1 = 1
r1 = 2
S1
S′
1
A2
A′
2
r2 = 100
r2 = −100
r2 = 50
r2 = 0
r1 = 101
r1 = 2
a11
a12
a21
a22
a
′
21
a
′
22
Backup diagrams put the DP in NDP
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 16 / 32
Framework
What is NDP?
Real sequential problems are more sophisticated
Systems are stochastic
System dynamics are unknown:
Reward function is unknownTransition probabilities between states are unknownNumber of states and actions may be large or even infinite
Must use data to estimate some (or all) of the above
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 17 / 32
Framework
What is NDP?
NDP is a method for approximating the backup diagram method forsequential decision problems with unknown system dynamics, large state oraction spaces, or both.
The term Neuro in Neuro-Dynamic Programming refers toapproximation of elements in backup diagram (uses something calledNeural Networks in computer science)
The term Dynamic Programming refers to solving the system withapproximated components using backup diagram approach
Often the above steps of approximation and evaluation are alternatedrepeatedly
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 18 / 32
Example: Cheating at Blackjack
Cheating at Blackjack
Example: Counting cards in Blackjack
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 19 / 32
Example: Cheating at Blackjack
Intro to Blackjack
Blackjack aka Twenty-one or Pontoon is a popular casino game.
Object is to obtain cards whose numerical sum is large withoutexceeding 21
Player draws cards until he is satisfied with his total or it exceeds 21(loses)
Dealer draws cards according to a fixed policy: hit until total is 17 orhigher
Winner is person with highest numerical total less than our equal to 21
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 20 / 32
Example: Cheating at Blackjack
Intro to Blackjack
Available information at time t:
All cards used prior to time t
Players current total
One of dealers cards
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 21 / 32
Example: Cheating at Blackjack
Intro to Blackjack
Beginning of a blackjack hand as sequential decision problem:
Card History
Number Aces
Number Twos
Number Threes
. . .
Number Kings
Choose bet b in {minBest,maxBet} Reward : r1 = 0
Card History
Number Aces
Number Twos
Number Threes
. . .
Number Kings
Player’s Hand
One Dealer Card
hit: take another card
stand: take no more cards
Notice we essentially need two strategies
One for deciding which bet to place
One for deciding when to hit/stand
Should the strategies be independent?
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 22 / 32
Example: Cheating at Blackjack
Intro to Blackjack
Is the following hand a “good one”?
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 23 / 32
Example: Cheating at Blackjack
Intro to Blackjack
The “goodness” of a particular hand depends on the strategy beingemployed
Betting strategy depends on estimated “goodness” of next hand
Formally, we define “goodness” of a hand using a particular strategyas the expected total winnings from that hand and all future hands
We must estimate betting and playing strategies simultaneously
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 24 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Why solving blackjack directly is difficult:
1 No explicit model
2 Large number of states and actions
3 Variable number of decks (1,2,4, or 8)
What makes this a good NPD problem:
1 Easy to simulate blackjack
2 Important features of the game are easy to summarize
3 Can simultaneously solve for any number of decks
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 25 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Why solving blackjack directly is difficult:
1 No explicit model
2 Large number of states and actions
3 Variable number of decks (1,2,4, or 8)
What makes this a good NPD problem:
1 Easy to simulate blackjack
2 Important features of the game are easy to summarize
3 Can simultaneously solve for any number of decks
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 25 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Features for blackjack:
We could keep track of the total number of Aces, Twos, Threes, etc.
Better to keep track of total percentage of Aces, Twos, Threes, etc.that have appeared (IE at time t we’ve seen 23% of all Aces)
It is usually sufficient to keep track of less information
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 26 / 32
Example: Cheating at Blackjack
NDP and Blackjack
NDP Algorithm:For k = 1, 2, . . .:
1 Choose strategy πk which decides action for EVERY possible state sothat it improves on previous strategy πk−1
2 Estimate expected performance of πk on every possible scenario usingcomputer simulation
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 27 / 32
Example: Cheating at Blackjack
NDP and Blackjack
NDP Algorithm:
EvaluateImprove
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 28 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Questions:
How to improve a strategy πk?Suppose at time t we observe state st
1 We estimate performance of choosing action πk(st) and following πk
afterward2 We also estimate performance of choosing alternate actions when faced
with st and following πk afterward
If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates
How long to run algorithm?
We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed
How to choose starting policy?
Any starting policy will do, but some choices will lead to fasterconvergence
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Questions:
How to improve a strategy πk?Suppose at time t we observe state st
1 We estimate performance of choosing action πk(st) and following πk
afterward2 We also estimate performance of choosing alternate actions when faced
with st and following πk afterward
If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates
How long to run algorithm?
We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed
How to choose starting policy?
Any starting policy will do, but some choices will lead to fasterconvergence
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32
Example: Cheating at Blackjack
NDP and Blackjack
Questions:
How to improve a strategy πk?Suppose at time t we observe state st
1 We estimate performance of choosing action πk(st) and following πk
afterward2 We also estimate performance of choosing alternate actions when faced
with st and following πk afterward
If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates
How long to run algorithm?
We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed
How to choose starting policy?
Any starting policy will do, but some choices will lead to fasterconvergence
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32
Example: Cheating at Blackjack
NDP and Blackjack
The preceding algorithm produces a strategy π∞ which is near optimal.However,
Using π∞ requires memorizing every possible scenario!
Fortunately, NDP allows us to restrict ourselves to simpler strategies
Linear strategies like:
Bet Large if:2∗NumberAcesLeft+NumberFaceCardsLeft−NumberLowCardsLeft > 0
are currently popular
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 30 / 32
Other NDP Applications
Other Applications
NDP is utilized in a large number of applications including:
Autonomous flight
Tailored medical treatments for chronic illness
Adaptive standard tests (e.g. GRE)
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 31 / 32
End
Further Information
There exist several standard references for NDP:
Dynamic Programming and Optimal Control by Bertsekas, AthenaScientific
Neuro-Dynamic Programming by Bertseaks and Tsitsiklis, AthenaScientific
Reinforcement Learning by Sutton and Barto, MIT Press
Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 32 / 32