gpu-based a3c for deep reinforcement...

48
M. Babaeizadeh †,‡ , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING An ICLR 2017 paper A github project

Upload: dodieu

Post on 06-Mar-2018

239 views

Category:

Documents


2 download

TRANSCRIPT

M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree‡ , J. Clemons‡ , J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

An ICLR 2017 paper

A github project

M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

4

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Learning to accomplish a task

Image from www.33rdsquare.com

6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St

at = p(St)

~Rt

StRL agent

✓ Environment

✓ Agent

✓ Observable status St

✓ Reward Rt

✓ Action at

✓ Policy at = p(St)

, Rt

7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = p(St)

~Rt

StDeep RL agent

8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = p(St)

Dp(∙)

R0 R1 R2 R3 R4

~Rt

St

9

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

11

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

12

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

The GPU

13

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

LOW OCCUPANCY

LOW OCCUPANCY (33%)

14

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%)

Batc

h s

ize

15

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), LOW UTILIZATION (40%)

time time

16

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%)

timetime

17

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

BANDWIDTH LIMITED

timetime

18

+

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

19

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, …

Empty, empty, …

Fig, fig, fig, fig, fig, fig,

Strawberry, Strawberry,

Strawberry, …

data

labels

100% utilization / occupancy

status,

reward

action

?

20

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

21

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

22

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

23

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

24

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Dp(∙)

25

A3C

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

26

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

27

GA3C (INFERENCE)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

28

GA3C (TRAINING)

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Agent

1A

gent

2

Agent

N

St,Rt

Master

model

29

GA3C

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

30

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

31

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

Learn how to balance

32

GA3C: PREDICTIONS PER SECOND (PPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

33

GA3C: TRAININGS PER SECOND (TPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

34

AUTOMATIC SCHEDULINGBalancing the system at run time

ATARI Boxing ATARI Pong

NP = # predictors, NT = # trainers, NA = # agents, TPS = training per seconds

35

THE ADVANTAGE OF SPEEDMore frames = faster convergence

36

LARGER DNNS

A3C (ATARI) Others (robotics)

Timothy P. Lillicrap et al., Continuouscontrol with deep reinforcement learning,International Conference on LearningRepresentations, 2016.

S. Levine et al., End-to-end training of deepvisuomotor policies, Journal of MachineLearning Research, 17:1-40, 2016.

For real world applications (e.g. robotics, automotive)

Conv 16

8x8

filters,

stride 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 32

8x8

filters,

stride 1,

2, 3, 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 64

4x4

filters,

stride 2 ?

37

GA3C VS. A3C*: PREDICTIONS PER SECONDS* Our Tensor Flow implementation on a CPU

1

10

100

1000

10000

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

4x 11x 12x20x

45x

PP

S

A3C

GA3C

38

CPU & GPU UTILIZATION IN GA3CFor larger DNNs

0

10

20

30

40

50

60

70

80

90

100

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

Uti

lizat

ion

(%

)

CPU %

GPU %

39

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

GA3C POLICY LAGAsynchronous playing and training

DNN A

DNN B

40

STABILITY AND CONVERGENCE SPEEDReducing policy lag through min training batch size

41

GPU-based A3C

GA3C (45x faster)

El Capitan big wall, Yosemite Valley

Balancing computational

resources, speed and

stability.

42

THEORY

RESOURCES

M. Babaeizadeh, I. Frosio, S. Tyree, J.Clemons, J. Kautz, ReinforcementLearning through AsynchronousAdvantage Actor-Critic on a GPU,ICLR 2017 (available athttps://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl).

CODING

GA3C, a GPU implementation of A3C (open source at https://github.com/NVlabs/GA3C).

A general architecture to generate and consume training data.

43

QUESTIONS

QUESTIONS

Policy lag

Multiple GPUs

Why TensorFlow

Replay memory

ATARI 2600

Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through

Asynchronous Advantage Actor-Critic on a GPU.

BACKUP SLIDES

45

POLICY LAG IN GA3C

Potentially large time lag between training data generation and network update

46

SCORES

47

Training a larger DNN

48

BALANCING COMPUTATIONAL RESOURCESThe actors: CPU, PCI-E & GPU

Trainings per second Training queue size Prediction queue size