go in numbers€¦ · evaluating seoul alphago against computers crazy stone zen 3000 2500 2000...

Go in numbers

10^17040MPlayers Positions

3,000Years Old

The Rules of Go

Capture Territory

Brute force search intractable:

1. Search space is huge

2. “Impossible” for computers to evaluate who is winning

Game tree complexity = bd

Why is Go hard for computers to play?

Convolutional neural network

Value networkEvaluation

Position

s

v (s)

Policy network Move probabilities

Position

s

p (a|s)

Exhaustive search

Monte-Carlo rollouts

Reducing depth with value network

Reducing breadth with policy network

Deep reinforcement learning in AlphaGo

Human expertpositions

Supervised Learningpolicy network

Self-play data Value networkReinforcement Learningpolicy network

Policy network: 12 layer convolutional neural network

Training data: 30M positions from human expert games (KGS 5+ dan)

Training algorithm: maximise likelihood by stochastic gradient descent

Training time: 4 weeks on 50 GPUs using Google Cloud

Results: 57% accuracy on held out test data (state-of-the art was 44%)

Supervised learning of policy networks

Policy network: 12 layer convolutional neural network

Training data: games of self-play between policy network

Training algorithm: maximise wins z by policy gradient reinforcement learning

Training time: 1 week on 50 GPUs using Google Cloud

Results: 80% vs supervised learning. Raw network ~3 amateur dan.

Reinforcement learning of policy networks

Value network: 12 layer convolutional neural network

Training data: 30 million games of self-play

Training algorithm: minimise MSE by stochastic gradient descent

Training time: 1 week on 50 GPUs using Google Cloud

Results: First strong position evaluation function - previously thought impossible

Reinforcement learning of value networks

Monte-Carlo tree search in AlphaGo: selection

P prior probabilityQ action value

Monte-Carlo tree search in AlphaGo: expansion

Policy networkP prior probability p

Monte-Carlo tree search in AlphaGo: evaluation

Value networkv

Monte-Carlo tree search in AlphaGo: rollout

Value networkr Game scorerv

Monte-Carlo tree search in AlphaGo: backup

Value networkr Game scorerv Q Action value

Deep Blue

Handcrafted chess knowledge

Alpha-beta search guided by

heuristic evaluation function

200 million positions / second

AlphaGo

Knowledge learned from expert

games and self-play

Monte-Carlo search guided by

policy and value networks

60,000 positions / second

Nature AlphaGo Seoul AlphaGo

Evaluating Nature AlphaGo against computers

Zen

Crazy Stone

3000

2500

2000

1500

1000

500

0

3500

Go Programs

AlphaGo

9p7p5p3p1p9d

7d

5d

3d

1d1k

3k

5k

7k

Professional dan (p)

Amateur

dan (d)Beginner kyu (k)

Pachi

494/495 against computer opponents

>75% winning rate with 4 stone handicap

Even stronger using distributed machines

Fuego

Gnu

Go

Elo rating

Evaluating Nature AlphaGo against humans

Fan Hui (2p): European Champion 2013 - 2016

Match was played in October 2015

AlphaGo won the match 5-0

First program ever to beat a professional

on a full size 19x19 in an even game

Deep Reinforcement Learning (as Nature AlphaGo)

● Improved value network

● Improved policy network

● Improved search

● Improved hardware (TPU vs GPU)

Seoul AlphaGo

Evaluating Seoul AlphaGo against computers

Zen

Crazy Stone

3000

2500

2000

1500

1000

500

0

3500

AlphaGo (N

ature v13)

Pachi

Fuego

Gnu

Go

4000

4500 AlphaGo (Seoul v18)

9p7p5p3p1p9d

7d

5d

3d

1d1k

3k

5k

7k

Professional dan (p)

Amateur

dan (d)Beginner kyu (k)

>50% against Nature AlphaGo with 3 to 4 stone handicap

CAUTION: ratings based on self-play results

Evaluating Seoul AlphaGo against humans

Lee Sedol (9p): winner of 18 world titles

Match was played in Seoul, March 2016

AlphaGo won the match 4-1

AlphaGo vs Lee Sedol: Game 1

Deep Reinforcement Learning: Beyond AlphaGo

http://www.youtube.com/watch?v=U_WY2qxUqHQ

http://www.youtube.com/watch?v=nMR5mjCFZCw

http://www.youtube.com/watch?v=0xo1Ldx3L5Q

What’s Next?

Team

David Silver1*, Aja Huang1*, Chris J. Maddison1, Arthur Guez1, Laurent Sifre1, George van den Driessche1,Julian Schrittwieser1, Ioannis Antonoglou1, , Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1, Dominik Grewe1John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, Timothy Lillicrap1, Madeleine Leach1, Koray Kavukcuoglu1,Thore Graepel1 & Demis Hassabis1

With thanks to: Lucas Baker, David Szepesvari, Malcolm Reynolds, Ziyu Wang, Nando De Freitas, Mike Johnson, Ilya Sutskever, Jeff Dean, Mike Marty, Sanjay Ghemawat.

DaveSilver

ChrisMaddison

ArthurGuez

LaurentSifre

GeorgeVan Den Driessche

JulianSchrittwieser

IoannisAntonoglou

VedaPanneershelvam

MarcLanctot

SanderDieleman

DominikGrewe

JohnNham

NalKalchbrenner

TimLillicrap

MaddyLeach

KorayKavukcuoglu

ThoreGraepel

AjaHuang

DemisHassabis

YutianChen

Demis Hassabis

go in numbers€¦ · evaluating seoul alphago against computers crazy stone zen 3000 2500 2000...

Documents