gpu-based a3c for deep reinforcement learning · m. babaeizadeh†,‡ , i.frosio‡ , s.tyree ‡...

48
M. Babaeizadeh †,‡ , I.Frosio , S.Tyree , J. Clemons , J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING An ICLR 2017 paper A github project

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree‡ , J. Clemons‡ , J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

An ICLR 2017 paper

A github project

Page 2: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Page 3: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

M. Babaeizadeh†,‡, I.Frosio‡, S.Tyree‡, J. Clemons‡, J.Kautz‡

†University of Illinois at Urbana-Champaign, USA ‡ NVIDIA, USA

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Page 4: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

4

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

Page 5: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Learning to accomplish a task

Image from www.33rdsquare.com

Page 6: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St

at = p(St)

~Rt

StRL agent

✓ Environment

✓ Agent

✓ Observable status St

✓ Reward Rt

✓ Action at

✓ Policy at = p(St)

, Rt

Page 7: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = p(St)

~Rt

StDeep RL agent

Page 8: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = p(St)

Dp(∙)

R0 R1 R2 R3 R4

~Rt

St

Page 9: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

9

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

Page 10: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 11: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

11

GPU-BASED A3C FORDEEP REINFORCEMENT LEARNING

Page 12: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

12

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

The GPU

Page 13: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

13

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

LOW OCCUPANCY

LOW OCCUPANCY (33%)

Page 14: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

14

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%)

Batc

h s

ize

Page 15: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

15

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), LOW UTILIZATION (40%)

time time

Page 16: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

16

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%)

timetime

Page 17: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

17

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

BANDWIDTH LIMITED

timetime

Page 18: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

18

+

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 19: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

19

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, …

Empty, empty, …

Fig, fig, fig, fig, fig, fig,

Strawberry, Strawberry,

Strawberry, …

data

labels

100% utilization / occupancy

status,

reward

action

?

Page 20: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

20

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 21: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

21

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 22: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

22

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 23: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

23

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 24: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

24

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Dp(∙)

Page 25: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

25

A3C

Dp(∙)

p’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

St, Rt

R0 R1 R2 R3 R4at = p(St)

Agent

1A

gent

2A

gent

16

Page 26: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

26

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

Page 27: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

27

GA3C (INFERENCE)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

Page 28: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

28

GA3C (TRAINING)

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Agent

1A

gent

2

Agent

N

St,Rt

Master

model

Page 29: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

29

GA3C

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Page 30: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

30

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

Page 31: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

31

GPU-based A3C

GA3C

El Capitan big wall, Yosemite Valley

Learn how to balance

Page 32: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

32

GA3C: PREDICTIONS PER SECOND (PPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Page 33: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

33

GA3C: TRAININGS PER SECOND (TPS)

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

Page 34: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

34

AUTOMATIC SCHEDULINGBalancing the system at run time

ATARI Boxing ATARI Pong

NP = # predictors, NT = # trainers, NA = # agents, TPS = training per seconds

Page 35: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

35

THE ADVANTAGE OF SPEEDMore frames = faster convergence

Page 36: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

36

LARGER DNNS

A3C (ATARI) Others (robotics)

Timothy P. Lillicrap et al., Continuouscontrol with deep reinforcement learning,International Conference on LearningRepresentations, 2016.

S. Levine et al., End-to-end training of deepvisuomotor policies, Journal of MachineLearning Research, 17:1-40, 2016.

For real world applications (e.g. robotics, automotive)

Conv 16

8x8

filters,

stride 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 32

8x8

filters,

stride 1,

2, 3, 4

Conv 32

4x4

filters,

stride 2

FC 256

Conv 64

4x4

filters,

stride 2 ?

Page 37: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

37

GA3C VS. A3C*: PREDICTIONS PER SECONDS* Our Tensor Flow implementation on a CPU

1

10

100

1000

10000

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

4x 11x 12x20x

45x

PP

S

A3C

GA3C

Page 38: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

38

CPU & GPU UTILIZATION IN GA3CFor larger DNNs

0

10

20

30

40

50

60

70

80

90

100

Small DNN Large DNN,stride 4

Large DNN,stride 3

Large DNN,stride 2

Large DNN,stride 1

Uti

lizat

ion

(%

)

CPU %

GPU %

Page 39: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

39

Master

model

… …

St

predictors

{St}

prediction queue

Agent

1A

gent

2

Agent

N

{at}

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Dp(∙)

GA3C POLICY LAGAsynchronous playing and training

DNN A

DNN B

Page 40: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

40

STABILITY AND CONVERGENCE SPEEDReducing policy lag through min training batch size

Page 41: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

41

GPU-based A3C

GA3C (45x faster)

El Capitan big wall, Yosemite Valley

Balancing computational

resources, speed and

stability.

Page 42: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

42

THEORY

RESOURCES

M. Babaeizadeh, I. Frosio, S. Tyree, J.Clemons, J. Kautz, ReinforcementLearning through AsynchronousAdvantage Actor-Critic on a GPU,ICLR 2017 (available athttps://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl).

CODING

GA3C, a GPU implementation of A3C (open source at https://github.com/NVlabs/GA3C).

A general architecture to generate and consume training data.

Page 43: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

43

QUESTIONS

QUESTIONS

Policy lag

Multiple GPUs

Why TensorFlow

Replay memory

ATARI 2600

Github: https://github.com/NVlabs/GA3C ICLR 2017: Reinforcement Learning through

Asynchronous Advantage Actor-Critic on a GPU.

Page 44: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

BACKUP SLIDES

Page 45: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

45

POLICY LAG IN GA3C

Potentially large time lag between training data generation and network update

Page 46: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

46

SCORES

Page 47: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

47

Training a larger DNN

Page 48: GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING · M. Babaeizadeh†,‡ , I.Frosio‡ , S.Tyree ‡ , J. Clemons , J.Kautz‡ †University of Illinois at Urbana-Champaign, USA ‡

48

BALANCING COMPUTATIONAL RESOURCESThe actors: CPU, PCI-E & GPU

Trainings per second Training queue size Prediction queue size