adversarial search continued: stochastic gamesbarto/courses/cs383-fall11/383... · 2011. 10. 4. ·...

31
1 CMPSCI 383 October 4, 2011 Adversarial Search Continued: Stochastic Games

Upload: others

Post on 14-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

1

CMPSCI 383 October 4, 2011

Adversarial Search Continued: Stochastic Games

Page 2: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

2

Tip for doing well

•  Begin work on assignments early!

Page 3: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

3

Game tree (2=player, deterministic, turns)

Page 4: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

4

Minimax algorithm

•  Perfect play for deterministic games

•  Idea — select moves with highest minimax value. That is, select the best achievable payoff against best play by your opponent

Page 5: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

5

α-β pruning

Page 6: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

6

α-β pruning

Page 7: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

7

Example: α-β pruning

Page 8: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

8

Example: α-β pruning

Page 9: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

9

Example: α-β pruning

Page 10: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

10

Example: α-β pruning

Page 11: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

11

Why is it called α-β?

•  α is the value of the best (highest-value) choice found so far at any choice point along the path for MAX •  If v is worse than α, MAX will avoid it, so that branch can be

pruned.

•  β is the value of the best (lowest-value) choice found so far at any choice point along the path for MIN •  If v is worse than β , MIN will avoid it, so that branch can

be pruned.

If m is better than n, we will never get to n

Page 12: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

12

What if a game has a “chance element”?

Page 13: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

13

Chance nodes

We know how to value the other nodes. How do we value chance nodes?

Page 14: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

14

Expected value

•  The sum of the probability of each possible outcome multiplied by its value:

•  Are there pathological cases where this statistic could do something strange? •  Extreme values (“outliers”) •  Functions that are a non-linear transformation

of the probability of winning

E(X) = pixii∑

Page 15: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

15

Expected minimax value

•  Now three different cases to evaluate, rather than just two. •  MAX •  MIN •  CHANCE

EXPECTIMINIMAX(n) = UTILITY(n), If terminal node maxs ∈ successors(n) EXPRCTIMINIMAX(s), If n is MAX node mins ∈ successors(n) EXPECTIMINIMAX(s), If n is MIN node ∑s ∈ successors(n) P(s) • EXPECTIMINIMAX(s), If n is CHANCE node

Page 16: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

16

Expectiminimaxing

Page 17: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

17

In Backgammon

Page 18: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

18

Evaluation functions

•  So cut off the search and evaluate leaves with an evaluation function (as in H-MINIMAX)

•  But do evaluation functions behave the same way in stochastic games? •  Just need to order nodes in the right way, so

particular values are not so important.

•  Chance nodes make things more difficult.

Page 19: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

19

Particular values DO matter

Order-preserving transformation

Page 20: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

20

Page 21: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

21

Partially Observable Games (Games of Imperfect Information)

argmaxa ∑s P(s) MINIMAX(RESULT(s,a)),

argmaxa 1/N ∑ MINIMAX(RESULT(si,a)), N

i = 1

Monte Carlo approximation:

“averaging over clairvoyance”

Page 22: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

22

Example: Kriegspiel

Maintain belief states

Page 23: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

23

Rollouts

•  Play out a position to completion several thousand times with different random dice sequences.

•  The best play is assumed to be the one that produced the best outcome statistics in the rollout.

•  What program should select moves?

Page 24: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

24

TD Gammon

Tesauro 1992, 1994, 1995, ...

•  White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps

•  Objective is to advance all pieces to points 19-24

•  Hitting •  Doubling •  30 pieces, 24 locations implies

enormous number of configurations

•  Effective branching factor of 400

Page 25: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

25

Multi-layer Neural Network

Page 26: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

26

Summary of TD-Gammon Results

Bill Robertie: world-class human grandmaster and former World Champion

TD-Gammon 2.1 plays at a strong master level that is extremely close to equaling the world’s best human planers. ….would be the favorite in a long money game session or grueling tournament like the World Cup competition (it never gets tired or careless).

Page 27: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

27

A Few Details

•  Reward: 0 at all times except those in which the game is won, when it is 1

•  Episodic (game = episode), undiscounted •  Gradient descent TD(λ) with a multi-layer

neural network •  weights initialized to small random numbers •  backpropagation of TD error •  four input units for each point; unary encoding

of number of white pieces, plus other features •  Use of afterstates •  Learning during self-play

Page 28: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

28

MENACE (Michie, 1961) “Matchbox Educable Noughts and Crosses Engine”

x"

x"

x"

x"

o"

o"

o"

o"

o"

x"

x"

x"x"

o"

o"

o"x"

x"

x"o"

o"

o"x" x"

x" x"

x" x"

x" x"

o"

o"

x"o"

o"x"

x"

x"

x"x"

x" o"

o"

o"

o"

o"

o"

o"

o"

o"

o"

o"

x"

x"

x"

o"

o"

o"

o"o"

o"

o"

x"

x"

x"

o"o"x"

x"

x"

x"

o"

o"

o"

x" o"

o"

o"

o"

o"

o"

o"

o"

o"

x"

x"x"

x"

o"o"

o"

o"

o"

x"

x"

x"

x"

o"

o"

o"

o"

o"

o"

o"

o"

x"

x"

x"

x"

o"

o"

o"

o"

o"

o"

o"

x"

x"

x"

x"

o"

o"

o"

o"x"

x"

x"

x"

o"

o"

o"

o"x"

x"

Page 29: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

29

Tic-Tac-Toe

X X X O O X

X O X

O X O

X O X

X O X

O X O

X O X

O X O

X

} x’s move

} x’s move

} o’s move

} x’s move

} o’s move

...

... ... ...

... ... ... ... ...

x x x

x o x

o x o

x

x

x x o o

Assume an imperfect opponent: —he/she sometimes makes mistakes

Page 30: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

30

1. Make a table with one entry per state:

2. Now play lots of games. To pick our moves,

look ahead one step:

State V(s) – estimated probability of winning: value .5 ? .5 ? . . .

. . .

. . . . . .

1 win

0 loss

. . . . . .

0 draw

x

x x x o o

o o o x x

o o o o x x x x o

current state

various possible next states *

Just pick the next state with the highest estimated prob. of winning — the largest V(s); a greedy move.

But 10% of the time pick a move at random; an exploratory move.

Learning an evaluation function

Page 31: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon

31

“Exploratory” move

– the state before our greedy move – the state after our greedy move

x y

We increment each V(x) toward V(y) – a backup :

a small positive fraction, the step – size parameter

← V(x) V(x) + α [V(y) – V(x)]

A Temporal-Difference (TD) method

Backups