introduction to neuro-dynamic programming (or,...

Introduction to Neuro-Dynamic Programming(Or, how to count cards in blackjack and do other fun

things too.)

Eric B. Laber

February 12, 2008

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 1 / 32

Framework

Introduction

Objectives:

Define Neuro-Dynamic Programming (NDP)

Understand how NDP is used by learning to cheat at blackjack

Learn other (more noble) applications of NDP


Framework

What is NDP?

NDP is about sequential decision making

An agent (decision maker) is faced with a series of decisions

Each decision results in a rewardEach decision changes the environment

Agent’s objective: maximize accumulated reward over time


Framework

What is NDP?

S0 A1 R1 S1

Initial State


Framework

What is NDP?

S0 A1 R1 S1

First Decision


Framework

What is NDP?

S0 A1 R1 S1

First Reward


Framework

What is NDP?

S0 A1 R1 S1

Second State


Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

′

21

a

′

22


Framework

What is NDP?


S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

′

21

a

′

22


Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

′

21

a

′

22


Framework

What is NDP?


S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

′

21

a

′

22


Framework

What is NDP?


S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

r1 = 101

r1 = 2

a11

a12

a21

a22

a

′

21

a

′

22


Framework

What is NDP?


S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

r1 = 101

r1 = 2

a11

a12

a21

a22

a

′

21

a

′

22

Backup diagrams put the DP in NDP


Framework

What is NDP?

Real sequential problems are more sophisticated

Systems are stochastic

System dynamics are unknown:

Reward function is unknownTransition probabilities between states are unknownNumber of states and actions may be large or even infinite

Must use data to estimate some (or all) of the above


Framework

What is NDP?

NDP is a method for approximating the backup diagram method forsequential decision problems with unknown system dynamics, large state oraction spaces, or both.

The term Neuro in Neuro-Dynamic Programming refers toapproximation of elements in backup diagram (uses something calledNeural Networks in computer science)

The term Dynamic Programming refers to solving the system withapproximated components using backup diagram approach

Often the above steps of approximation and evaluation are alternatedrepeatedly


Example: Cheating at Blackjack

Cheating at Blackjack

Example: Counting cards in Blackjack



Intro to Blackjack

Blackjack aka Twenty-one or Pontoon is a popular casino game.

Object is to obtain cards whose numerical sum is large withoutexceeding 21

Player draws cards until he is satisfied with his total or it exceeds 21(loses)

Dealer draws cards according to a fixed policy: hit until total is 17 orhigher

Winner is person with highest numerical total less than our equal to 21



Intro to Blackjack

Available information at time t:

All cards used prior to time t

Players current total

One of dealers cards



Intro to Blackjack

Beginning of a blackjack hand as sequential decision problem:

Card History

Number Aces

Number Twos

Number Threes

. . .

Number Kings

Choose bet b in {minBest,maxBet} Reward : r1 = 0

Card History

Number Aces

Number Twos

Number Threes

. . .

Number Kings

Player’s Hand

One Dealer Card

hit: take another card

stand: take no more cards

Notice we essentially need two strategies

One for deciding which bet to place

One for deciding when to hit/stand

Should the strategies be independent?



Intro to Blackjack

Is the following hand a “good one”?



Intro to Blackjack

The “goodness” of a particular hand depends on the strategy beingemployed

Betting strategy depends on estimated “goodness” of next hand

Formally, we define “goodness” of a hand using a particular strategyas the expected total winnings from that hand and all future hands

We must estimate betting and playing strategies simultaneously



NDP and Blackjack

Why solving blackjack directly is difficult:

1 No explicit model

2 Large number of states and actions

3 Variable number of decks (1,2,4, or 8)

What makes this a good NPD problem:

1 Easy to simulate blackjack

2 Important features of the game are easy to summarize

3 Can simultaneously solve for any number of decks



NDP and Blackjack

Features for blackjack:

We could keep track of the total number of Aces, Twos, Threes, etc.

Better to keep track of total percentage of Aces, Twos, Threes, etc.that have appeared (IE at time t we’ve seen 23% of all Aces)

It is usually sufficient to keep track of less information



NDP and Blackjack

NDP Algorithm:For k = 1, 2, . . .:

1 Choose strategy πk which decides action for EVERY possible state sothat it improves on previous strategy πk−1

2 Estimate expected performance of πk on every possible scenario usingcomputer simulation



NDP and Blackjack

NDP Algorithm:

EvaluateImprove



NDP and Blackjack

Questions:

How to improve a strategy πk?Suppose at time t we observe state st

1 We estimate performance of choosing action πk(st) and following πk

afterward2 We also estimate performance of choosing alternate actions when faced

with st and following πk afterward

If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates

How long to run algorithm?

We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed

How to choose starting policy?

Any starting policy will do, but some choices will lead to fasterconvergence



NDP and Blackjack

The preceding algorithm produces a strategy π∞ which is near optimal.However,

Using π∞ requires memorizing every possible scenario!

Fortunately, NDP allows us to restrict ourselves to simpler strategies

Linear strategies like:

Bet Large if:2∗NumberAcesLeft+NumberFaceCardsLeft−NumberLowCardsLeft > 0

are currently popular


Other NDP Applications

Other Applications

NDP is utilized in a large number of applications including:

Autonomous flight

Tailored medical treatments for chronic illness

Adaptive standard tests (e.g. GRE)


End

Further Information

There exist several standard references for NDP:

Dynamic Programming and Optimal Control by Bertsekas, AthenaScientific

Neuro-Dynamic Programming by Bertseaks and Tsitsiklis, AthenaScientific

Reinforcement Learning by Sutton and Barto, MIT Press


introduction to neuro-dynamic programming (or,...

Documents