gpu-based a3c for deep reinforcement learning · m. babaeizadeh†,‡ , i.frosio‡ , s.tyree ‡...

M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree‡ , J. Clemons‡ , J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

An ICLR 2017 paper

A github project

M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA


4

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

5


Learning to accomplish a task

Image from www.33rdsquare.com

6


Definitions

St

at = p(St)

~Rt

StRL agent

✓ Environment

✓ Agent

✓ Observable status St

✓ Reward Rt

✓ Action at

✓ Policy at = p(St)

, Rt

7


Definitions

St, Rt

at = p(St)

~Rt

StDeep RL agent

8


Definitions

St, Rt

at = p(St)

Dp(∙)

R0 R1 R2 R3 R4

~Rt

St

9


10


Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

11


12


The GPU

13


LOW OCCUPANCY

LOW OCCUPANCY (33%)

14


HIGH OCCUPANCY (100%)

Batc

h s

ize

15


HIGH OCCUPANCY (100%), LOW UTILIZATION (40%)

time time

16


HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%)

timetime

17


BANDWIDTH LIMITED

timetime

18

+

Dp(∙)

p’(∙)Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

19

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, …

Empty, empty, …

Fig, fig, fig, fig, fig, fig,

Strawberry, Strawberry,

Strawberry, …

…

data

labels

100% utilization / occupancy

status,

reward

action

?

20

A3C

Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

21

A3C

Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

22

A3C

Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

23

A3C

Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

24

A3C

Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

Dp(∙)

25

A3C

Dp(∙)

p’(∙)Master

model

St, Rt


St, Rt


St, Rt


…

Agent

1A

gent

2A

gent

16

26

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

27

GA3C (INFERENCE)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

…

Agent

N

{at}

at

28

GA3C (TRAINING)

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Agent

1A

gent

2

…

Agent

N

St,Rt

Master

model

29

GA3C

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

…

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

30

GPU-based A3C

GA3C


31

GPU-based A3C

GA3C


Learn how to balance

32

GA3C: PREDICTIONS PER SECOND (PPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

…

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

33

GA3C: TRAININGS PER SECOND (TPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

…

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

34

AUTOMATIC SCHEDULINGBalancing the system at run time

ATARI Boxing ATARI Pong

NP = # predictors, NT = # trainers, NA = # agents, TPS = training per seconds

35

THE ADVANTAGE OF SPEEDMore frames = faster convergence

36

LARGER DNNS

A3C (ATARI) Others (robotics)

Timothy P. Lillicrap et al., Continuouscontrol with deep reinforcement learning,International Conference on LearningRepresentations, 2016.

S. Levine et al., End-to-end training of deepvisuomotor policies, Journal of MachineLearning Research, 17:1-40, 2016.

For real world applications (e.g. robotics, automotive)

Conv 16

8x8

filters,

stride 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 32

8x8

filters,

stride 1,

2, 3, 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 64

4x4

filters,

stride 2 ?

37

GA3C VS. A3C*: PREDICTIONS PER SECONDS* Our Tensor Flow implementation on a CPU

1

10

100

1000

10000

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

4x 11x 12x20x

45x

PP

S

A3C

GA3C

38

CPU & GPU UTILIZATION IN GA3CFor larger DNNs

0

10

20

30

40

50

60

70

80

90

100

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

Uti

lizat

ion

(%

)

CPU %

GPU %

39

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

…

Agent

N

{at}

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

GA3C POLICY LAGAsynchronous playing and training

DNN A

DNN B

40

STABILITY AND CONVERGENCE SPEEDReducing policy lag through min training batch size

41

GPU-based A3C

GA3C (45x faster)


Balancing computational

resources, speed and

stability.

42

THEORY

RESOURCES

M. Babaeizadeh, I. Frosio, S. Tyree, J.Clemons, J. Kautz, ReinforcementLearning through AsynchronousAdvantage Actor-Critic on a GPU,ICLR 2017 (available athttps://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl).

CODING

GA3C, a GPU implementation of A3C (open source at https://github.com/NVlabs/GA3C).

A general architecture to generate and consume training data.

https://github.com/NVlabs/GA3C

43

QUESTIONS

QUESTIONS

Policy lag

Multiple GPUs

Why TensorFlow

Replay memory

…

ATARI 2600

Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through

Asynchronous Advantage Actor-Critic on a GPU.

https://github.com/NVlabs/GA3C

BACKUP SLIDES

45

POLICY LAG IN GA3C

Potentially large time lag between training data generation and network update

46

SCORES

47

Training a larger DNN

48

BALANCING COMPUTATIONAL RESOURCESThe actors: CPU, PCI-E & GPU

Trainings per second Training queue size Prediction queue size

gpu-based a3c for deep reinforcement learning · m. babaeizadeh†,‡ , i.frosio‡ , s.tyree ‡...

Documents