monte carlo tree search · depth. mcts + policy/ value networks • value neural net to evaluate...

37
Monte Carlo Tree Search Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science Spring 2020, CMU 10-403 Part of slides inspired by Sebag, Gaudel

Upload: others

Post on 25-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Monte Carlo Tree Search

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie Mellon

School of Computer Science

Spring 2020, CMU 10-403

Part of slides inspired by Sebag, Gaudel

Page 2: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Learning: the acquisi7on of knowledge or skills through experience, study, or by being taught.

Planning: any computa7onal process that uses a model to create or improve a policy

Defini7ons

Model PolicyPlanning

Page 3: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Given a determinis7c transi7on func7on , a root state and a simula7on policy (poten7ally random)

Simulate episodes from current (real) state:

Evaluate ac7on value func7on of the root by mean return:

Select root ac7on:

T sπ

K

{s, a, Rk1, Sk

1, Ak1, Rk

2, Sk2, Ak

2, . . . , SkT}K

k=1 ∼ T, π

Q(s, a) =1K

K

∑k=1

Gk → qπ(s, a)

a = argmaxa∈𝒜Q(s, a)

Simplest Monte-Carlo Search

Page 4: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Could we be improving our simula7on policy the more simula7ons we obtain?

• Yes we can! We can have two policies:

• Internal to the tree: keep track of ac7on values Q not only for the root but also for nodes internal to a tree we are expanding, and use to improve the simula7on policy over 7me

• External to the tree: we do not have Q es7mates and thus we use a random policy

In MCTS, the simula6on policy improves

• Can we think anything bePer than ?ϵ − greedy

Can we do bePer?

Page 5: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

1. Selec6on

• Used for nodes we have seen before

• Pick according to UCB

2. Expansion

• Used when we reach the fron7er

• Add one node per playout

3. Simula6on

• Used beyond the search fron7er

• Don’t bother with UCB, just play randomly

4. Back-propaga6on

• AWer reaching a terminal node

• Update value and visits for states expanded in selec7on and expansion

Monte-Carlo Tree Search

Bandit based Monte-Carlo Planning, Kocsis and Szepesvari, 2006

Page 6: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Monte-Carlo Tree Search

For every state within the search tree we bookkeep # of visits and # of wins

Page 7: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Monte-Carlo Tree Search

Sample ac7ons based on UCB score

Page 8: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Monte-Carlo Tree Search

Page 9: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Monte-Carlo Tree Search

Page 10: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Es7mates ac7on-state values by look-ahead planning.

• Ques7ons:

• Are those es7mates more or less accurate than those discovered with model-free RL methods, e.g., DQN?

• Why don’t we simply use MCTS to select ac7ons during playing of Atari games (no prior knowledge)?

• How can we use the es7mates discovered with MCTS but at the same 7me play fast at test 7me?

Q(s, a)

Monte-Carlo Tree Search planner

Page 11: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Use cases:

• As online planner for selec7ng the next move

• For state-ac7on value es7ma7on at training 7me. At test 7me just use the reac7ve policy network, without any lookahead planning.

• In combina7on with policy and value networks at test 7me (AlphaGo)

• In combina7on with policy and value networks at both train and test 7me (AlphaGoZero)

Monte-Carlo Tree Search

Page 12: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Can we do bePer?

Can we inject prior knowledge into state and ac7on values instead of ini7alizing them uniformly?

Page 13: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Value neural net to evaluate board posi7ons to help prune the tree depth.

MCTS + Policy/ Value networks

Page 14: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Value neural net to evaluate board posi7ons to help prune the tree depth.

• Policy neural net to select moves to help prune the tree breadth.

MCTS + Policy/ Value networks

Page 15: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

pπ pσ

MCTS + Policy/ Value networks

Page 16: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Then, train a new policy with RL and self-play ini7alized from SL policy.

pπ pσ

pρ pρ

MCTS + Policy/ Value networks

Page 17: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from SL policy.

• Train a value network that predicts the winner of games played by against itself.

pπ pσ

pρ pρ

MCTS + Policy/ Value networks

Page 18: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

AlphaGo: Learning-guided search

Page 19: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

Page 20: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Objec7ve: predic7ng expert moves

• Input: randomly sampled state-ac7on pairs (s, a) from expert games

• Output: a probability distribu7on over all legal moves a.

SL policy network: 13-layer policy network trained from 30 million posi7ons. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi7on and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.

Supervised learning of policy networks

Page 21: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

Page 22: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Objec7ve: improve over SL policy

• Weight ini7aliza7on from SL network

• Input: Sampled states during self-play

• Output: a probability distribu7on over all legal moves a.

Rewards are provided only at the end of the game, +1 for winning, -1 for loosing

The RL policy network won more than 80% of games against the SL policy network.

Δρ ∝∂ log pρ (at |st)

∂ρzt

Reinforcement learning of policy networks

Page 23: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

Page 24: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Objec7ve: Es7ma7ng a value func7on that predicts the outcome from posi7on s of games played by using RL policy p for both players.

• Input: Sampled states during self-play, 30 million dis7nct posi7ons, each sampled from a separate game.

• Output: a scalar value

Trained by regression on state-outcome pairs (s, z) to minimize the mean squared error between the predicted value v(s), and the corresponding outcome z.

vp(s)

Reinforcement learning of value networks

Page 25: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

Page 26: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Selec6on: selec7ng ac7ons within the expanded tree

Tree policy

• - ac7on selected at 7me step from state • - average reward collected so far from MC simula7ons • - prior expert probability provided by the SL policy • - number of 7mes we have visited parent node • acts as a bonus value

at t stQ (sr, a)P(s, a) pσN(s, a)u

at = argmaxa

(Q (st, a) + u (st, a))u(s, a) ∝

P(s, a)1 + N(s, a)

MCTS + Policy/ Value networks

Page 27: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Expansion: when reaching a leaf, play the ac7on with highest score from pσ

• When leaf node is reached, it has a chance to be expanded • Processed once by SL policy network and stored as prior probs

• Pick child node with highest prior prob

(pσ)P(s, a)

MCTS + Policy/ Value networks

Page 28: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

MCTS + Policy/ Value networks

• From the selected leaf node, run mul7ple simula7ons in parallel using the rollout policy

• Evaluate the leaf node as:

• : value from the trained value func7on for board posi7on

• : Reward from fast rollout • Played un7l terminal step

• - mixing parameter

V (sL) = (1 − λ)vρ (sL) + λzL

vρsL

zL px

λ

Simula6on/Evalua6on: use the rollout policy to reach to the end of the game

Page 29: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

MCTS + Policy/ Value networks• Backup: update visita7on counts and recorded rewards for the chosen

path inside the tree

N(s, a) =n

∑i=1

1(s,a)∈τi

Q(s, a) =1

N(s, a)

n

∑i=1

1(s,a)∈τiV (si

L)

• Extra index is to denote the i simula7on, total simula7ons • Update visit count and mean reward of simula7ons passing through

node • Once MCTS completes, the algorithm chooses the most visited move

from the root posi7on.

n

Page 30: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• So far, look-ahead search was used for online planning at test 7me!

• We saw in the last lecture that MCTS is also useful at training 7me: it in fact reaches superior Q values that vanilla model-free RL.

• AlphaGoZero uses MCTS during training instead.

• AlphaGoZero gets rid of human supervision.

AlphaGoZero: Lookahead search during training!

Page 31: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

• Given any policy, a MCTS guided by this policy for ac7on selec7on (as described earlier), will produce an improved policy for the root node (policy improvement operator)

• Train to mimic such improved policy

AlphaGoZero: Lookahead search during training!

Page 32: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

MCTS as policy improvement operator

• Train so that the policy network mimics this improved policy

• Train so that the posi7on evalua7on network output matches the outcome (same as in AlphaGo)

Page 33: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

MCTS: no MC rollouts 7ll termina7on

MCTS: using always value net evalua7ons of leaf nodes, no full rollouts!

Page 34: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

Separate policy/value nets Joint policy/value nets

Page 35: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

• Lookahead tremendously improves the basic policy

Page 36: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

RL VS SL

Page 37: Monte Carlo Tree Search · depth. MCTS + Policy/ Value networks • Value neural net to evaluate board posions to help prune the tree depth. • Policy neural net to select moves

Ques7on• Why don’t we always use MCTS (or some other planner) as supervision

for reac7ve policy learning?

• Because in many domains we do not have access to the dynamics.