monte carlo tree search · depth. mcts + policy/ value networks • value neural net to evaluate...
TRANSCRIPT
Monte Carlo Tree Search
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie Mellon
School of Computer Science
Spring 2020, CMU 10-403
Part of slides inspired by Sebag, Gaudel
Learning: the acquisi7on of knowledge or skills through experience, study, or by being taught.
Planning: any computa7onal process that uses a model to create or improve a policy
Defini7ons
Model PolicyPlanning
Given a determinis7c transi7on func7on , a root state and a simula7on policy (poten7ally random)
Simulate episodes from current (real) state:
Evaluate ac7on value func7on of the root by mean return:
Select root ac7on:
T sπ
K
{s, a, Rk1, Sk
1, Ak1, Rk
2, Sk2, Ak
2, . . . , SkT}K
k=1 ∼ T, π
Q(s, a) =1K
K
∑k=1
Gk → qπ(s, a)
a = argmaxa∈𝒜Q(s, a)
Simplest Monte-Carlo Search
• Could we be improving our simula7on policy the more simula7ons we obtain?
• Yes we can! We can have two policies:
• Internal to the tree: keep track of ac7on values Q not only for the root but also for nodes internal to a tree we are expanding, and use to improve the simula7on policy over 7me
• External to the tree: we do not have Q es7mates and thus we use a random policy
In MCTS, the simula6on policy improves
• Can we think anything bePer than ?ϵ − greedy
Can we do bePer?
1. Selec6on
• Used for nodes we have seen before
• Pick according to UCB
2. Expansion
• Used when we reach the fron7er
• Add one node per playout
3. Simula6on
• Used beyond the search fron7er
• Don’t bother with UCB, just play randomly
4. Back-propaga6on
• AWer reaching a terminal node
• Update value and visits for states expanded in selec7on and expansion
Monte-Carlo Tree Search
Bandit based Monte-Carlo Planning, Kocsis and Szepesvari, 2006
Monte-Carlo Tree Search
For every state within the search tree we bookkeep # of visits and # of wins
Monte-Carlo Tree Search
Sample ac7ons based on UCB score
Monte-Carlo Tree Search
Monte-Carlo Tree Search
• Es7mates ac7on-state values by look-ahead planning.
• Ques7ons:
• Are those es7mates more or less accurate than those discovered with model-free RL methods, e.g., DQN?
• Why don’t we simply use MCTS to select ac7ons during playing of Atari games (no prior knowledge)?
• How can we use the es7mates discovered with MCTS but at the same 7me play fast at test 7me?
Q(s, a)
Monte-Carlo Tree Search planner
Use cases:
• As online planner for selec7ng the next move
• For state-ac7on value es7ma7on at training 7me. At test 7me just use the reac7ve policy network, without any lookahead planning.
• In combina7on with policy and value networks at test 7me (AlphaGo)
• In combina7on with policy and value networks at both train and test 7me (AlphaGoZero)
Monte-Carlo Tree Search
Can we do bePer?
Can we inject prior knowledge into state and ac7on values instead of ini7alizing them uniformly?
• Value neural net to evaluate board posi7ons to help prune the tree depth.
MCTS + Policy/ Value networks
• Value neural net to evaluate board posi7ons to help prune the tree depth.
• Policy neural net to select moves to help prune the tree breadth.
MCTS + Policy/ Value networks
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
pπ pσ
MCTS + Policy/ Value networks
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Then, train a new policy with RL and self-play ini7alized from SL policy.
pπ pσ
pρ pρ
MCTS + Policy/ Value networks
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from SL policy.
• Train a value network that predicts the winner of games played by against itself.
pπ pσ
pρ pρ
pρ
MCTS + Policy/ Value networks
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from the policy.
• Train a value network that predicts the winner of games played by against itself.
• Combine the policy and value networks with MCTS at test 7me.
pπ pσ
pρ pρ pσ
pρ
AlphaGo: Learning-guided search
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from the policy.
• Train a value network that predicts the winner of games played by against itself.
• Combine the policy and value networks with MCTS at test 7me.
pπ pσ
pρ pρ pσ
pρ
MCTS + Policy/ Value networks
• Objec7ve: predic7ng expert moves
• Input: randomly sampled state-ac7on pairs (s, a) from expert games
• Output: a probability distribu7on over all legal moves a.
SL policy network: 13-layer policy network trained from 30 million posi7ons. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi7on and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.
Supervised learning of policy networks
pσ
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from the policy.
• Train a value network that predicts the winner of games played by against itself.
• Combine the policy and value networks with MCTS at test 7me.
pπ pσ
pρ pρ pσ
pρ
MCTS + Policy/ Value networks
• Objec7ve: improve over SL policy
• Weight ini7aliza7on from SL network
• Input: Sampled states during self-play
• Output: a probability distribu7on over all legal moves a.
Rewards are provided only at the end of the game, +1 for winning, -1 for loosing
The RL policy network won more than 80% of games against the SL policy network.
Δρ ∝∂ log pρ (at |st)
∂ρzt
Reinforcement learning of policy networks
pρ
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from the policy.
• Train a value network that predicts the winner of games played by against itself.
• Combine the policy and value networks with MCTS at test 7me.
pπ pσ
pρ pρ pσ
pρ
MCTS + Policy/ Value networks
• Objec7ve: Es7ma7ng a value func7on that predicts the outcome from posi7on s of games played by using RL policy p for both players.
• Input: Sampled states during self-play, 30 million dis7nct posi7ons, each sampled from a separate game.
• Output: a scalar value
Trained by regression on state-outcome pairs (s, z) to minimize the mean squared error between the predicted value v(s), and the corresponding outcome z.
vp(s)
Reinforcement learning of value networks
• Train two policies, one cheap policy and one expensive by mimicking expert moves.
• Train a new policy with RL and self-play ini7alized from the policy.
• Train a value network that predicts the winner of games played by against itself.
• Combine the policy and value networks with MCTS at test 7me.
pπ pσ
pρ pρ pσ
pρ
MCTS + Policy/ Value networks
Selec6on: selec7ng ac7ons within the expanded tree
Tree policy
• - ac7on selected at 7me step from state • - average reward collected so far from MC simula7ons • - prior expert probability provided by the SL policy • - number of 7mes we have visited parent node • acts as a bonus value
at t stQ (sr, a)P(s, a) pσN(s, a)u
•
at = argmaxa
(Q (st, a) + u (st, a))u(s, a) ∝
P(s, a)1 + N(s, a)
MCTS + Policy/ Value networks
Expansion: when reaching a leaf, play the ac7on with highest score from pσ
• When leaf node is reached, it has a chance to be expanded • Processed once by SL policy network and stored as prior probs
• Pick child node with highest prior prob
(pσ)P(s, a)
MCTS + Policy/ Value networks
MCTS + Policy/ Value networks
• From the selected leaf node, run mul7ple simula7ons in parallel using the rollout policy
• Evaluate the leaf node as:
• : value from the trained value func7on for board posi7on
• : Reward from fast rollout • Played un7l terminal step
• - mixing parameter
V (sL) = (1 − λ)vρ (sL) + λzL
vρsL
zL px
λ
Simula6on/Evalua6on: use the rollout policy to reach to the end of the game
MCTS + Policy/ Value networks• Backup: update visita7on counts and recorded rewards for the chosen
path inside the tree
N(s, a) =n
∑i=1
1(s,a)∈τi
Q(s, a) =1
N(s, a)
n
∑i=1
1(s,a)∈τiV (si
L)
• Extra index is to denote the i simula7on, total simula7ons • Update visit count and mean reward of simula7ons passing through
node • Once MCTS completes, the algorithm chooses the most visited move
from the root posi7on.
n
• So far, look-ahead search was used for online planning at test 7me!
• We saw in the last lecture that MCTS is also useful at training 7me: it in fact reaches superior Q values that vanilla model-free RL.
• AlphaGoZero uses MCTS during training instead.
• AlphaGoZero gets rid of human supervision.
AlphaGoZero: Lookahead search during training!
• Given any policy, a MCTS guided by this policy for ac7on selec7on (as described earlier), will produce an improved policy for the root node (policy improvement operator)
• Train to mimic such improved policy
AlphaGoZero: Lookahead search during training!
MCTS as policy improvement operator
• Train so that the policy network mimics this improved policy
• Train so that the posi7on evalua7on network output matches the outcome (same as in AlphaGo)
MCTS: no MC rollouts 7ll termina7on
MCTS: using always value net evalua7ons of leaf nodes, no full rollouts!
Architectures
• Resnets help
• Jointly training the policy and value func7on using the same main feature extractor helps
Separate policy/value nets Joint policy/value nets
Architectures
• Resnets help
• Jointly training the policy and value func7on using the same main feature extractor helps
• Lookahead tremendously improves the basic policy
RL VS SL
Ques7on• Why don’t we always use MCTS (or some other planner) as supervision
for reac7ve policy learning?
• Because in many domains we do not have access to the dynamics.