monte carlo tree search · depth. mcts + policy/ value networks • value neural net to evaluate...

Post on 25-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Monte Carlo Tree Search

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie Mellon

School of Computer Science

Spring 2020, CMU 10-403

Part of slides inspired by Sebag, Gaudel

Learning: the acquisi7on of knowledge or skills through experience, study, or by being taught.

Planning: any computa7onal process that uses a model to create or improve a policy

Defini7ons

Model PolicyPlanning

Given a determinis7c transi7on func7on , a root state and a simula7on policy (poten7ally random)

Simulate episodes from current (real) state:

Evaluate ac7on value func7on of the root by mean return:

Select root ac7on:

T sπ

K

{s, a, Rk1, Sk

1, Ak1, Rk

2, Sk2, Ak

2, . . . , SkT}K

k=1 ∼ T, π

Q(s, a) =1K

K

∑k=1

Gk → qπ(s, a)

a = argmaxa∈𝒜Q(s, a)

Simplest Monte-Carlo Search

• Could we be improving our simula7on policy the more simula7ons we obtain?

• Yes we can! We can have two policies:

• Internal to the tree: keep track of ac7on values Q not only for the root but also for nodes internal to a tree we are expanding, and use to improve the simula7on policy over 7me

• External to the tree: we do not have Q es7mates and thus we use a random policy

In MCTS, the simula6on policy improves

• Can we think anything bePer than ?ϵ − greedy

Can we do bePer?

1. Selec6on

• Used for nodes we have seen before

• Pick according to UCB

2. Expansion

• Used when we reach the fron7er

• Add one node per playout

3. Simula6on

• Used beyond the search fron7er

• Don’t bother with UCB, just play randomly

4. Back-propaga6on

• AWer reaching a terminal node

• Update value and visits for states expanded in selec7on and expansion

Monte-Carlo Tree Search

Bandit based Monte-Carlo Planning, Kocsis and Szepesvari, 2006

Monte-Carlo Tree Search

For every state within the search tree we bookkeep # of visits and # of wins

Monte-Carlo Tree Search

Sample ac7ons based on UCB score

Monte-Carlo Tree Search

Monte-Carlo Tree Search

• Es7mates ac7on-state values by look-ahead planning.

• Ques7ons:

• Are those es7mates more or less accurate than those discovered with model-free RL methods, e.g., DQN?

• Why don’t we simply use MCTS to select ac7ons during playing of Atari games (no prior knowledge)?

• How can we use the es7mates discovered with MCTS but at the same 7me play fast at test 7me?

Q(s, a)

Monte-Carlo Tree Search planner

Use cases:

• As online planner for selec7ng the next move

• For state-ac7on value es7ma7on at training 7me. At test 7me just use the reac7ve policy network, without any lookahead planning.

• In combina7on with policy and value networks at test 7me (AlphaGo)

• In combina7on with policy and value networks at both train and test 7me (AlphaGoZero)

Monte-Carlo Tree Search

Can we do bePer?

Can we inject prior knowledge into state and ac7on values instead of ini7alizing them uniformly?

• Value neural net to evaluate board posi7ons to help prune the tree depth.

MCTS + Policy/ Value networks

• Value neural net to evaluate board posi7ons to help prune the tree depth.

• Policy neural net to select moves to help prune the tree breadth.

MCTS + Policy/ Value networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

pπ pσ

MCTS + Policy/ Value networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Then, train a new policy with RL and self-play ini7alized from SL policy.

pπ pσ

pρ pρ

MCTS + Policy/ Value networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from SL policy.

• Train a value network that predicts the winner of games played by against itself.

pπ pσ

pρ pρ

MCTS + Policy/ Value networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

AlphaGo: Learning-guided search

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

• Objec7ve: predic7ng expert moves

• Input: randomly sampled state-ac7on pairs (s, a) from expert games

• Output: a probability distribu7on over all legal moves a.

SL policy network: 13-layer policy network trained from 30 million posi7ons. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi7on and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.

Supervised learning of policy networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

• Objec7ve: improve over SL policy

• Weight ini7aliza7on from SL network

• Input: Sampled states during self-play

• Output: a probability distribu7on over all legal moves a.

Rewards are provided only at the end of the game, +1 for winning, -1 for loosing

The RL policy network won more than 80% of games against the SL policy network.

Δρ ∝∂ log pρ (at |st)

∂ρzt

Reinforcement learning of policy networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

• Objec7ve: Es7ma7ng a value func7on that predicts the outcome from posi7on s of games played by using RL policy p for both players.

• Input: Sampled states during self-play, 30 million dis7nct posi7ons, each sampled from a separate game.

• Output: a scalar value

Trained by regression on state-outcome pairs (s, z) to minimize the mean squared error between the predicted value v(s), and the corresponding outcome z.

vp(s)

Reinforcement learning of value networks

• Train two policies, one cheap policy and one expensive by mimicking expert moves.

• Train a new policy with RL and self-play ini7alized from the policy.

• Train a value network that predicts the winner of games played by against itself.

• Combine the policy and value networks with MCTS at test 7me.

pπ pσ

pρ pρ pσ

MCTS + Policy/ Value networks

Selec6on: selec7ng ac7ons within the expanded tree

Tree policy

• - ac7on selected at 7me step from state • - average reward collected so far from MC simula7ons • - prior expert probability provided by the SL policy • - number of 7mes we have visited parent node • acts as a bonus value

at t stQ (sr, a)P(s, a) pσN(s, a)u

at = argmaxa

(Q (st, a) + u (st, a))u(s, a) ∝

P(s, a)1 + N(s, a)

MCTS + Policy/ Value networks

Expansion: when reaching a leaf, play the ac7on with highest score from pσ

• When leaf node is reached, it has a chance to be expanded • Processed once by SL policy network and stored as prior probs

• Pick child node with highest prior prob

(pσ)P(s, a)

MCTS + Policy/ Value networks

MCTS + Policy/ Value networks

• From the selected leaf node, run mul7ple simula7ons in parallel using the rollout policy

• Evaluate the leaf node as:

• : value from the trained value func7on for board posi7on

• : Reward from fast rollout • Played un7l terminal step

• - mixing parameter

V (sL) = (1 − λ)vρ (sL) + λzL

vρsL

zL px

λ

Simula6on/Evalua6on: use the rollout policy to reach to the end of the game

MCTS + Policy/ Value networks• Backup: update visita7on counts and recorded rewards for the chosen

path inside the tree

N(s, a) =n

∑i=1

1(s,a)∈τi

Q(s, a) =1

N(s, a)

n

∑i=1

1(s,a)∈τiV (si

L)

• Extra index is to denote the i simula7on, total simula7ons • Update visit count and mean reward of simula7ons passing through

node • Once MCTS completes, the algorithm chooses the most visited move

from the root posi7on.

n

• So far, look-ahead search was used for online planning at test 7me!

• We saw in the last lecture that MCTS is also useful at training 7me: it in fact reaches superior Q values that vanilla model-free RL.

• AlphaGoZero uses MCTS during training instead.

• AlphaGoZero gets rid of human supervision.

AlphaGoZero: Lookahead search during training!

• Given any policy, a MCTS guided by this policy for ac7on selec7on (as described earlier), will produce an improved policy for the root node (policy improvement operator)

• Train to mimic such improved policy

AlphaGoZero: Lookahead search during training!

MCTS as policy improvement operator

• Train so that the policy network mimics this improved policy

• Train so that the posi7on evalua7on network output matches the outcome (same as in AlphaGo)

MCTS: no MC rollouts 7ll termina7on

MCTS: using always value net evalua7ons of leaf nodes, no full rollouts!

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

Separate policy/value nets Joint policy/value nets

Architectures

• Resnets help

• Jointly training the policy and value func7on using the same main feature extractor helps

• Lookahead tremendously improves the basic policy

RL VS SL

Ques7on• Why don’t we always use MCTS (or some other planner) as supervision

for reac7ve policy learning?

• Because in many domains we do not have access to the dynamics.

top related