chakrabarti alpha go analysis
TRANSCRIPT
Mastering the game of GO
• DeepMind problem domain • Deep learning and reinforcement learning concepts
• Design of AlphaGo • Execu6on
Reduce search space
• Reduce breadth – Not all moves are equally likely – Some moves are bePer – Leverage moves made by expert players
• Reduce depth – Evaluate strength of board (likelihood of winning) – Collapse symmetrical or similar boards – Simulate the games
Reinforcement learning Reinforcement"Learning""
State:" St
Reward"(Feedback):"Rt
AcIon:"At
• Feedback"is"delayed."• No"supervisor,"only"a"reward"signal."• Rules"of"the"game"are"unknown."• Agent’s"acIons"affect"the"subsequent"state"
Agent"Environment"
40"
Predic6ng the move 1.*Reducing*“action*candidates”
(1) Imitating+expert+moves+(supervised+learning)
Expert$Moves$Imitator$Model(w/$CNN)
Current$Board Next$Action
Training:
1.*Reducing*“action*candidates”
(1) Imitating+expert+moves+(supervised+learning)
Expert$Moves$Imitator$Model(w/$CNN)
Current$Board Next$Action
Training:
1.*Reducing*“action*candidates”
(1) Imitating+expert+moves+(supervised+learning)
Prediction$Model
0 0+ 0 0 0+ 0 0 0 00 0+ 0 0 0 1 0 0 00 H1 0 0 1 H1 1 0 00 1 0 0 1 H1 0 0 00 0+ 0 0 H1 0 0 0 00 0+ 0 0 0+ 0 0 0 00 H1 0 0 0+ 0 0 0 00 0+ 0 0 0+ 0 0 0 0
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
s g:$s! p(a|s) p(a|s) aargmax
Current$Board Next$Action
1.*Reducing*“action*candidates”
(1) Imitating+expert+moves+(supervised+learning)
Expert$Moves$Imitator$Model(w/$CNN)
Current$Board Next$Action
Training:
Two kinds of policies
● used a large database of online expert games
● learned two versions of the neural network
○ a fast network Pᶢ for use in evaluation
○ an accurate network Pᶥ for use in selection
Step 1: learn to predict human movesCS63 topic
neural networksweek 7, 14?
Further reduce search space Symmetries"
62"
Input"RotaIon""90"degrees"
RotaIon""180"degrees"
RotaIon""270"degrees"
VerIcal"reflecIon"
VerIcal"reflecIon"
VerIcal"reflecIon"
VerIcal"reflecIon"
Reduce depth by board evalua6on
2.*Board*Evaluation
Updated$Modelver 1,000,000
Board$Position
Training:
Win$/$Loss
Win(0~1)
Value$Prediction$Model
(Regression)
Adds$a regression$layer$to$the$modelPredicts$values$between$0~1Close$to$1:$a$good$board$positionClose$to$0:$a$bad$board$position
2.*Board*Evaluation
Updated$Modelver 1,000,000
Board$Position
Training:
Win$/$Loss
Win(0~1)
Value$Prediction$Model
(Regression)
Adds$a regression$layer$to$the$modelPredicts$values$between$0~1Close$to$1:$a$good$board$positionClose$to$0:$a$bad$board$position2.*Board*Evaluation
Updated$Modelver 1,000,000
Board$Position
Training:
Win$/$Loss
Win(0~1)
Value$Prediction$Model
(Regression)
Adds$a regression$layer$to$the$modelPredicts$values$between$0~1Close$to$1:$a$good$board$positionClose$to$0:$a$bad$board$position
2.*Board*Evaluation
Updated$Modelver 1,000,000
Board$Position
Training:
Win$/$Loss
Win(0~1)
Value$Prediction$Model
(Regression)
Adds$a regression$layer$to$the$modelPredicts$values$between$0~1Close$to$1:$a$good$board$positionClose$to$0:$a$bad$board$position
Value follows from policy Step 3: learn a board evaluation network, Vᶚ
● use random samples from the self-play database
● prediction target: probability that black wins from a given board
PuWng it all together Looking*ahead*(w/*Monte*Carlo*Search*Tree)
Action$Candidates$Reduction(Policy$Network)
Board$Evaluation(Value$Network)
(Rollout):$Faster$version$of$estimating$p(a|s)! uses shallow$networks$(3$ms! 2µs)
Expansion Expansion"
s
a
s0
Insert"the"node"for"the"successor"state""""".""s0
1"
2"
Nv(s0, a0) = Nr(s
0, a0) = 0
Wr(s0, a0) = Wv(s
0, a0) = 0
P (s0, a0) = p�(a0|s0)
p�(a0|s0)
If"visit"count"exceed"a"threshold":""""""","Nr(s, a) > nthr
a0 a0
For"every"possible"""""","iniIalize"the"staIsIcs:""""
a0
75"
Evalua6on EvaluaIon"
p⇡
1"
2" Simulate"the"acIon"by""rollout"policy"network""""""""."p⇡
Evaluate""""""""""""""by"value"network""""""."v✓(s0) v✓
r(sT )
v✓(s0) When"reaching"terminal""""""",""
calculate"the"reward""""""""""""".""sT
r(sT )
76"
Distribute search through GPUs Distributed"Search""
p⇡
r(sT )
v✓(s0)
p�(a0|s0)
Main"search"tree"Master"CPU"
Policy"&"value"networks"176"GPUs"
Rollout"policy"networks"1,202"CPUs""
78"
Apply trained networks to tasks with different loss func6on Takeaways
Use+the+networks+trained+for+a+certain+task+(with+different+loss+objectives)+for+several+other+tasks
Single most important takeaway
• Feature abstrac6on is the key component of any machine learning algorithm
• Convolu6onal neural networks are great at automated feature abstrac6on
Reference
Silver et. al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature. 529, 484–489. January 2016.