introduction to neuro-dynamic programming (or,...

36
Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.) Eric B. Laber February 12, 2008 Eric B. Laber () Introduction to Neuro-Dynamic Programming February 12, 2008 1 / 32

Upload: doanhuong

Post on 30-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Neuro-Dynamic Programming(Or, how to count cards in blackjack and do other fun

things too.)

Eric B. Laber

February 12, 2008

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 1 / 32

Framework

Introduction

Objectives:

Define Neuro-Dynamic Programming (NDP)

Understand how NDP is used by learning to cheat at blackjack

Learn other (more noble) applications of NDP

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 2 / 32

Framework

What is NDP?

NDP is about sequential decision making

An agent (decision maker) is faced with a series of decisions

Each decision results in a rewardEach decision changes the environment

Agent’s objective: maximize accumulated reward over time

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 3 / 32

Framework

What is NDP?

S0 A1 R1 S1

Initial State

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 4 / 32

Framework

What is NDP?

S0 A1 R1 S1

First Decision

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 5 / 32

Framework

What is NDP?

S0 A1 R1 S1

First Reward

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 6 / 32

Framework

What is NDP?

S0 A1 R1 S1

Second State

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 7 / 32

Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 8 / 32

Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 9 / 32

Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 10 / 32

Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 11 / 32

Framework

What is NDP?

Actions affect future states so myopic decision making is NOT sufficient

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 12 / 32

Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 13 / 32

Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 14 / 32

Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

r1 = 101

r1 = 2

a11

a12

a21

a22

a

21

a

22

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 15 / 32

Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

r1 = 101

r1 = 2

a11

a12

a21

a22

a

21

a

22

Backup diagrams put the DP in NDP

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 16 / 32

Framework

What is NDP?

Solution: Go backwards!

S0 A1

r1 = 1

r1 = 2

S1

S′

1

A2

A′

2

r2 = 100

r2 = −100

r2 = 50

r2 = 0

r1 = 101

r1 = 2

a11

a12

a21

a22

a

21

a

22

Backup diagrams put the DP in NDP

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 16 / 32

Framework

What is NDP?

Real sequential problems are more sophisticated

Systems are stochastic

System dynamics are unknown:

Reward function is unknownTransition probabilities between states are unknownNumber of states and actions may be large or even infinite

Must use data to estimate some (or all) of the above

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 17 / 32

Framework

What is NDP?

NDP is a method for approximating the backup diagram method forsequential decision problems with unknown system dynamics, large state oraction spaces, or both.

The term Neuro in Neuro-Dynamic Programming refers toapproximation of elements in backup diagram (uses something calledNeural Networks in computer science)

The term Dynamic Programming refers to solving the system withapproximated components using backup diagram approach

Often the above steps of approximation and evaluation are alternatedrepeatedly

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 18 / 32

Example: Cheating at Blackjack

Cheating at Blackjack

Example: Counting cards in Blackjack

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 19 / 32

Example: Cheating at Blackjack

Intro to Blackjack

Blackjack aka Twenty-one or Pontoon is a popular casino game.

Object is to obtain cards whose numerical sum is large withoutexceeding 21

Player draws cards until he is satisfied with his total or it exceeds 21(loses)

Dealer draws cards according to a fixed policy: hit until total is 17 orhigher

Winner is person with highest numerical total less than our equal to 21

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 20 / 32

Example: Cheating at Blackjack

Intro to Blackjack

Available information at time t:

All cards used prior to time t

Players current total

One of dealers cards

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 21 / 32

Example: Cheating at Blackjack

Intro to Blackjack

Beginning of a blackjack hand as sequential decision problem:

Card History

Number Aces

Number Twos

Number Threes

. . .

Number Kings

Choose bet b in {minBest,maxBet} Reward : r1 = 0

Card History

Number Aces

Number Twos

Number Threes

. . .

Number Kings

Player’s Hand

One Dealer Card

hit: take another card

stand: take no more cards

Notice we essentially need two strategies

One for deciding which bet to place

One for deciding when to hit/stand

Should the strategies be independent?

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 22 / 32

Example: Cheating at Blackjack

Intro to Blackjack

Is the following hand a “good one”?

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 23 / 32

Example: Cheating at Blackjack

Intro to Blackjack

The “goodness” of a particular hand depends on the strategy beingemployed

Betting strategy depends on estimated “goodness” of next hand

Formally, we define “goodness” of a hand using a particular strategyas the expected total winnings from that hand and all future hands

We must estimate betting and playing strategies simultaneously

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 24 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Why solving blackjack directly is difficult:

1 No explicit model

2 Large number of states and actions

3 Variable number of decks (1,2,4, or 8)

What makes this a good NPD problem:

1 Easy to simulate blackjack

2 Important features of the game are easy to summarize

3 Can simultaneously solve for any number of decks

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 25 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Why solving blackjack directly is difficult:

1 No explicit model

2 Large number of states and actions

3 Variable number of decks (1,2,4, or 8)

What makes this a good NPD problem:

1 Easy to simulate blackjack

2 Important features of the game are easy to summarize

3 Can simultaneously solve for any number of decks

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 25 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Features for blackjack:

We could keep track of the total number of Aces, Twos, Threes, etc.

Better to keep track of total percentage of Aces, Twos, Threes, etc.that have appeared (IE at time t we’ve seen 23% of all Aces)

It is usually sufficient to keep track of less information

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 26 / 32

Example: Cheating at Blackjack

NDP and Blackjack

NDP Algorithm:For k = 1, 2, . . .:

1 Choose strategy πk which decides action for EVERY possible state sothat it improves on previous strategy πk−1

2 Estimate expected performance of πk on every possible scenario usingcomputer simulation

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 27 / 32

Example: Cheating at Blackjack

NDP and Blackjack

NDP Algorithm:

EvaluateImprove

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 28 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Questions:

How to improve a strategy πk?Suppose at time t we observe state st

1 We estimate performance of choosing action πk(st) and following πk

afterward2 We also estimate performance of choosing alternate actions when faced

with st and following πk afterward

If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates

How long to run algorithm?

We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed

How to choose starting policy?

Any starting policy will do, but some choices will lead to fasterconvergence

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Questions:

How to improve a strategy πk?Suppose at time t we observe state st

1 We estimate performance of choosing action πk(st) and following πk

afterward2 We also estimate performance of choosing alternate actions when faced

with st and following πk afterward

If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates

How long to run algorithm?

We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed

How to choose starting policy?

Any starting policy will do, but some choices will lead to fasterconvergence

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32

Example: Cheating at Blackjack

NDP and Blackjack

Questions:

How to improve a strategy πk?Suppose at time t we observe state st

1 We estimate performance of choosing action πk(st) and following πk

afterward2 We also estimate performance of choosing alternate actions when faced

with st and following πk afterward

If improvement can be made at any state st we can improve πk bychoosing the optimal action at st and leaving πk unchanged at otherstates

How long to run algorithm?

We run algorithm until no further improvements can be madeConvergence to near-optimal strategy is guaranteed

How to choose starting policy?

Any starting policy will do, but some choices will lead to fasterconvergence

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 29 / 32

Example: Cheating at Blackjack

NDP and Blackjack

The preceding algorithm produces a strategy π∞ which is near optimal.However,

Using π∞ requires memorizing every possible scenario!

Fortunately, NDP allows us to restrict ourselves to simpler strategies

Linear strategies like:

Bet Large if:2∗NumberAcesLeft+NumberFaceCardsLeft−NumberLowCardsLeft > 0

are currently popular

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 30 / 32

Other NDP Applications

Other Applications

NDP is utilized in a large number of applications including:

Autonomous flight

Tailored medical treatments for chronic illness

Adaptive standard tests (e.g. GRE)

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 31 / 32

End

Further Information

There exist several standard references for NDP:

Dynamic Programming and Optimal Control by Bertsekas, AthenaScientific

Neuro-Dynamic Programming by Bertseaks and Tsitsiklis, AthenaScientific

Reinforcement Learning by Sutton and Barto, MIT Press

Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)February 12, 2008 32 / 32