monte carlo tree search · depth. mcts + policy/ value networks • value neural net to evaluate...

Monte Carlo Tree Search

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie Mellon

School of Computer Science

Spring 2020, CMU 10-403

Part of slides inspired by Sebag, Gaudel

Learning: the acquisi7on of knowledge or skills through experience, study, or by being taught.

Planning: any computa7onal process that uses a model to create or improve a policy

Defini7ons

Model PolicyPlanning

Given a determinis7c transi7on func7on , a root state and a simula7on policy (poten7ally random)

Simulate episodes from current (real) state:

Evaluate ac7on value func7on of the root by mean return:

Select root ac7on:

T sπ

K

{s, a, Rk1, Sk

1, Ak1, Rk

2, Sk2, Ak

2, . . . , SkT}K

k=1 ∼ T, π

Q(s, a) =1K

K

∑k=1

Gk → qπ(s, a)

a = argmaxa∈𝒜Q(s, a)

Simplest Monte-Carlo Search

• Could we be improving our simula7on policy the more simula7ons we obtain?

• Yes we can! We can have two policies:

• Internal to the tree: keep track of ac7on values Q not only for the root but also for nodes internal to a tree we are expanding, and use to improve the simula7on policy over 7me

• External to the tree: we do not have Q es7mates and thus we use a random policy

In MCTS, the simula6on policy improves

• Can we think anything bePer than ?ϵ − greedy

Can we do bePer?

1. Selec6on

• Used for nodes we have seen before

• Pick according to UCB

2. Expansion

• Used when we reach the fron7er

• Add one node per playout

3. Simula6on

• Used beyond the search fron7er

• Don’t bother with UCB, just play randomly

4. Back-propaga6on

• AWer reaching a terminal node

• Update value and visits for states expanded in selec7on and expansion

Monte-Carlo Tree Search

Bandit based Monte-Carlo Planning, Kocsis and Szepesvari, 2006


For every state within the search tree we bookkeep # of visits and # of wins


Sample ac7ons based on UCB score

• Es7mates ac7on-state values by look-ahead planning.

• Ques7ons:

• Are those es7mates more or less accurate than those discovered with model-free RL methods, e.g., DQN?

• Why don’t we simply use MCTS to select ac7ons during playing of Atari games (no prior knowledge)?

• How can we use the es7mates discovered with MCTS but at the same 7me play fast at test 7me?

Q(s, a)

Monte-Carlo Tree Search planner

Use cases:

• As online planner for selec7ng the next move

• For state-ac7on value es7ma7on at training 7me. At test 7me just use the reac7ve policy network, without any lookahead planning.

• In combina7on with policy and value networks at test 7me (AlphaGo)

• In combina7on with policy and value networks at both train and test 7me (AlphaGoZero)


Can we do bePer?

Can we inject prior knowledge into state and ac7on values instead of ini7alizing them uniformly?

• Value neural net to evaluate board posi7ons to help prune the tree depth.

MCTS + Policy/ Value networks

• Value neural net to evaluate board posi7ons to help prune the tree depth.

• Policy neural net to select moves to help prune the tree breadth.


• Train two policies, one cheap policy and one expensive by mimicking expert moves.

pπ pσ



• Then, train a new policy with RL and self-play ini7alized from SL policy.

pπ pσ

pρ pρ



• Train a new policy with RL and self-play ini7alized from SL policy.

• Train a value network that predicts the winner of games played by against itself.

pπ pσ

pρ pρ

pρ



• Train a new policy with RL and self-play ini7alized from the policy.


• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

pρ

AlphaGo: Learning-guided search





pπ pσ

pρ pρ pσ

pρ


• Objec7ve: predic7ng expert moves

• Input: randomly sampled state-ac7on pairs (s, a) from expert games

• Output: a probability distribu7on over all legal moves a.

SL policy network: 13-layer policy network trained from 30 million posi7ons. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi7on and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.

Supervised learning of policy networks

pσ





pπ pσ

pρ pρ pσ

pρ


• Objec7ve: improve over SL policy

• Weight ini7aliza7on from SL network

• Input: Sampled states during self-play

• Output: a probability distribu7on over all legal moves a.

Rewards are provided only at the end of the game, +1 for winning, -1 for loosing

The RL policy network won more than 80% of games against the SL policy network.

Δρ ∝∂ log pρ (at |st)

∂ρzt

Reinforcement learning of policy networks

pρ





pπ pσ

pρ pρ pσ

pρ


• Objec7ve: Es7ma7ng a value func7on that predicts the outcome from posi7on s of games played by using RL policy p for both players.

• Input: Sampled states during self-play, 30 million dis7nct posi7ons, each sampled from a separate game.

• Output: a scalar value

Trained by regression on state-outcome pairs (s, z) to minimize the mean squared error between the predicted value v(s), and the corresponding outcome z.

vp(s)

Reinforcement learning of value networks





pπ pσ

pρ pρ pσ

pρ


Selec6on: selec7ng ac7ons within the expanded tree

Tree policy

• - ac7on selected at 7me step from state • - average reward collected so far from MC simula7ons • - prior expert probability provided by the SL policy • - number of 7mes we have visited parent node • acts as a bonus value

at t stQ (sr, a)P(s, a) pσN(s, a)u

•

at = argmaxa

(Q (st, a) + u (st, a))u(s, a) ∝

P(s, a)1 + N(s, a)


Expansion: when reaching a leaf, play the ac7on with highest score from pσ

• When leaf node is reached, it has a chance to be expanded • Processed once by SL policy network and stored as prior probs

• Pick child node with highest prior prob

(pσ)P(s, a)



• From the selected leaf node, run mul7ple simula7ons in parallel using the rollout policy

• Evaluate the leaf node as:

• : value from the trained value func7on for board posi7on

• : Reward from fast rollout • Played un7l terminal step

• - mixing parameter

V (sL) = (1 − λ)vρ (sL) + λzL

vρsL

zL px

λ

Simula6on/Evalua6on: use the rollout policy to reach to the end of the game

MCTS + Policy/ Value networks• Backup: update visita7on counts and recorded rewards for the chosen

path inside the tree

N(s, a) =n

∑i=1

1(s,a)∈τi

Q(s, a) =1

N(s, a)

n

∑i=1

1(s,a)∈τiV (si

L)

• Extra index is to denote the i simula7on, total simula7ons • Update visit count and mean reward of simula7ons passing through

node • Once MCTS completes, the algorithm chooses the most visited move

from the root posi7on.

n

• So far, look-ahead search was used for online planning at test 7me!

• We saw in the last lecture that MCTS is also useful at training 7me: it in fact reaches superior Q values that vanilla model-free RL.

• AlphaGoZero uses MCTS during training instead.

• AlphaGoZero gets rid of human supervision.

AlphaGoZero: Lookahead search during training!

• Given any policy, a MCTS guided by this policy for ac7on selec7on (as described earlier), will produce an improved policy for the root node (policy improvement operator)

• Train to mimic such improved policy

AlphaGoZero: Lookahead search during training!

MCTS as policy improvement operator

• Train so that the policy network mimics this improved policy

• Train so that the posi7on evalua7on network output matches the outcome (same as in AlphaGo)

MCTS: no MC rollouts 7ll termina7on

MCTS: using always value net evalua7ons of leaf nodes, no full rollouts!

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

Separate policy/value nets Joint policy/value nets

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

• Lookahead tremendously improves the basic policy

RL VS SL

Ques7on• Why don’t we always use MCTS (or some other planner) as supervision

for reac7ve policy learning?

• Because in many domains we do not have access to the dynamics.

monte carlo tree search · depth. mcts + policy/ value networks • value neural net to evaluate...

Documents