improving systems management policies using hybrid reinforcement learning gerry tesauro ibm tj...

48
Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro <[email protected]> IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM), Nick Jong (U. Texas) Mohamed Bennani (George Mason Univ.)

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

Improving Systems

Management Policies Using

Hybrid Reinforcement

Learning

Gerry Tesauro <[email protected]>

IBM TJ Watson Research CenterJoint work with Rajarshi Das (IBM), Nick Jong (U. Texas) Mohamed Bennani (George Mason Univ.)

Page 2: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

2

Outline: Main points of the talk

Introduction: Brief Overview of “Autonomic Computing”

Grandiose Motivation: Combining Machine Learning with domain knowledge in Autonomic Computing

Problem Description

Scenario: Online server allocation in Internet Data Center

Data Center Prototype Implementation

Reinforcement Learning Approach

Quick RL Overview

Prior Online RL Approach

New Hybrid RL Approach

Results/Insights into Hybrid RL outperformance

Fresh results on new application: Power Management

Page 3: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

3

Challenges in Systems Management

IBM's Global IP NetworkAT&T

Description REV

SDC North Physical/Logical WAN Connectivity 1/2

SCALE N/A SHEET 1

DRAWN

ISSUED

Gregg Machovec SDC North Network Architect

9/13/2001

SDC North

Customer C:\temp\SDC North Physical-Logical IP WAN Connectivity.vsdIBM Global Services Network Services

FDDI 100MbpsOSPF 0.0.0.0 Cost 9

Seg BB19.32.236.145-158.0

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6109.32.236.217-222.0 (.217/.218)

FDDI 100MbpsOSPF 9.130.0.0 Cost 100

Seg B0E9.130.104.0 (.12/.9)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6119.32.236.193-198.0 (.193/.194)

FDDI 100MbpsOSPF 9.66.0.0 Cost 100

Seg B519.66.7.0. (.3/.7)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

9.32.236.185-190 (.185/.186)

FDDI 100MbpsOSPF 9.50.0.0 Cost 8

9.50.123.0 (.4/.2)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

IBM's Global IP NetworkAT&T

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6209.32.236.33-38.0 (.33/.34)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6409.32.236.40-46.0 (.41/.42)

SOMCSB-2DD3 (D)

SOMCSB-1DD2 (C)

SBY002-1DD6 (C)

SBY002-2DD7 (D)

POK010-3DBC (F)

POK918-3DBB (E)

Token-Ring 16MbpsOSPF 9.2.0.0 Cost 160

Seg 5519.32.237.65-78 (.65)

HAW790-1DD9 (D)

YKT801-1DD8 (C)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6709.32.236.105-110.0 (.105/.106)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6909.32.236.128 (.129/.130)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

9.32.236.65-70 (.69/.70)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 1609.32.236.48 (.49/.50)

Sync 1.544Mbps OSPF 0.0.0.0 Cost 648

9.32.232.24 (.25/.26)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6319.32.236.177-182.0 (.178/.177)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SQV257-1D15 (E)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SQV014-1D14 (F) FDDI 100Mbps

OSPF 9.5.0.0 Cost 1Seg F03

9.5.101.0 (.24/.25)

.147

.147

FDDI 100MbpsOSPF 9.117.0.0 Cost 10

Seg BB09.117.1.0 (.2/.19)

ATM 155MbpsOSPF 9.117.220.0 Cost 80

USPOKTR0BC1_IP109.117.220.0 (.249/229)

.149.149

.137

FSH330-3D90 (E)

BTV963-IGSNSD52 (F)

BTV863-5D61 (E)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

RCHRTE25FD9 (B)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

RCHRTE24FD8 (A)

POK010-1DB2 (D)

FSH640-3DA6 (F)

FDDI 100MbpsStatic 9.38.80-85.0

9.38.80.193 (.219/.218)

RCHSDR-1DEB (C)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

RCHSDR-2DEC (D)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

Token-Ring 16MbpsSeg C01

9.242.96-103.0 (.128/.129)9.242.104-111.0 (.128/.129)

Token-Ring 16MbpsSeg 099

9.242.64-71.0 (.128/.129)

PAL001-2DE2 (D)

PAL001-1DE1 (C)

Token-Ring 16MbpsSeg C04

9.242.48-55.0 (.127/.128)9.242.80-87.0 (.127/.128)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

STF001-2DDD (D)

Token-Ring 16MbpsSeg DF3,DF1 Armonk

9.242.144-151.0 (.128/.129)9.242.152-159.0 (.128/.129)

Seg BB0 North Castle9.38.32.97-110.0 (.100/.105)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

STF001-1DDC (C)

Token-Ring 16MbpsLIG 32.225.9.0

204.146.137-142.0 (.141)

Token-Ring 16MbpsLIG 32.226..113.0,

32.226.175.032.96.121.49-54.0 (.53)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

ARM001-4D01 (D)

ARM001-3D00 (C)

44S001-2DD5 (D)

44S001-1DD4 (C)

ATM 155MbpsOSPF 9.2.0.0 Cost 80

9.32.237.33.-46.0 (.33)

ATM 155MbpsOSPF 9.2.0.0 Cost 80

9.32.237.17-30.0 (.17)

Token-Ring 16MbpsOSPF 9.2.0.0 Cost 160

Seg 3969.32.237.49-62.0 (.49)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SomersCampus Network

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

Southbury CampusNetwork

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

.148

.148

.146

.146

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

Token-Ring 16MbpsLIG 32.224.10.0

204.146.252.249-254.0(.133)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE Bay Networks

Sync 3Mbps OSPF 9.66.0.0 Cost 220 Seg E52

9.66.124.0 (.1/.2) BTV617-2SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE Bay Networks

44 SouthBroadway

PoughkeepsieNY

Endicott NY

Rochester MN

Burlington VT

Fishkill NY

Southbury CT

Palisades NJ

Sterling ForestNY

Armonk NYRochester NY

Hawthorne NY

Yorktown NY

Somers NY

ATM 155MbpsOSPF 0.0.0.0 Cost 8

USPOKTR0BC3_GW109.32.237.145-158.0

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

FDDI 100MbpsBGP-4

9.32.236.136 (.139/.138)

POK918-1DBD (B)

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

SD

RESETPOWERRUNBOOTDIAGBACKBONE NODE

Bay Networks

Morning rebootingp-patterns

Weekly rebootingp-patterns

Outages

Event bursts

Excessive DM events

Hosts

Large-scale, heterogeneous distributed systems with highly dynamic, complex multi-component interactions

Large volumes of real-time high-dimensional data, but also lots of missing information and uncertainty

Too much complexity, too few (skilled) administrators

Need for “self-managing” systems autonomic computing

Page 4: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

4

What is Autonomic Computing?

“Computing systems that manage themselves in accordance with high-level objectives from humans”Kephart and Chess, A Vision of Autonomic Computing, IEEE Computer, 2003

“Self-management” capabilities include

Self-Configuration: Automated configuration of components, systems according to high-level policies; rest of system adjusts seamlessly.

Self-Healing: Automated detection, diagnosis, and repair of localized software/hardware problems.

Self-Optimization: Automatic and continual adaptive tuning of hundreds of parameters (database params, server params,…) affecting performance & efficiency

Self-Protection: Automated defense against malicious attacks or cascading failures; use early warning to anticipate and prevent system-wide failures.

Good application domain for ML: rich opportunities, little previously done

Page 5: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

5

A “Knowledge Bottleneck” in Autonomic Computing

Managed Element

ES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic Manager

ES

Page 6: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

6

Machine Learning to the Rescue

Can avoid knowledge bottleneck: automatically extract knowledge from observations of data

Examples: Supervised Learning: Input Predicted Output

(classification, regression) Unsupervised Learning: Input Structure among

input variables (clustering, data mining) Reinforcement Learning: Learns behavioral

policies: State Action

Page 7: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

7

Will ML Without Built-In Knowledge Work?

Managed Element

ES

Monitor

Analyze

Execute

Plan

Tabula Rasa ML

Autonomic Manager

ES

Tabula Rasa = “blank slate” (Latin)

Page 8: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

8

A Hybrid Approach Combining Knowledge + ML

Initial Knowledge Behavioral Data ML Improved Knowledge

Several advantages: No direct interface between ML and Initial

Knowledge; don’t engineer knowledge into ML Initial knowledge can be virtually anything:

very simple (e.g. crude heuristic) highly sophisticated (multi-tier closed queuing network) could even be human behavior

Can do multiple iterations to keep improving

Page 9: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

9

Outline: Main points of the talk

Introduction:

Problem Description

Scenario: Online server allocation in Internet Data Center

Data Center Prototype Implementation

Reinforcement Learning Approach

Results

Insights into Hybrid RL outperformance

Wrapup

Page 10: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

10

Application: Allocating Server Resources in a Data Center

Scenario: Data center serving multiple customers, each running high-volume web apps with independent time-varying workloads

Macy’s Online Shopping

ApplicationManager

ServersServersServers

DB2

Router

E-Trade: online trading

ApplicationManager

ServersServersServers

DB2

Router

Citibank: online banking

ApplicationManager

ServersServersServers

DB2

Router

SLA $$ SLA $$ SLA $$

ResourceArbiter

Data Center Maximize business value

across all customers

Page 11: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

11

Problem Description Scenario: Online server allocation in Internet Data Center Data Center Prototype Implementation:

Real servers: Linux cluster (X series machines) Realistic Web-based workload: Trade3 (online trading emulation)

Runs on top of WebSphere and DB2

Realistic time-varying demand generation: Open-loop scenario: Poisson HTTP requests; vary mean arrival rate Closed-loop scenario: Finite number of customers M with fixed think time

distribution; M varies with time Use Squillante-Yao-Zhang time-series model to vary M or above

Page 12: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

12

Data Center Prototype: Experimental setup

8 xSeries servers

Value(#srvrs)

Trade3

AppManager

Value(RT)

ResourceArbiter

Batch

AppManager

Trade3

Server Server Server Server Server Server Server Server

Value(#srvrs)

Value(#srvrs)

Demand(HTTP req/sec)

WebSphere 5.1

DB2

AppManager

WebSphere 5.1

DB2

Value(#srvrs)

Maximize Total SLA Revenue

5 sec

Value(RT)

Demand(HTTP req/sec)

SLA SLA SLA

Page 13: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

13

Standard Approach: Queuing Models Design an appropriate model of flows and queues (arrival

process/ routing discipline/service process etc.) in system Estimate model parameters offline or online Model estimates Value(numServers) by estimating (asymptotic)

performance changes due to changes in numServers Has worked well in many deployed systems

Two main limitations: Model design is difficult and knowledge-intensive Model assumptions don’t exactly match real system

Real systems have complex dynamics; standard models assume steady-state behavior

Two prospective benefits of machine learning approach: Avoid knowledge bottleneck Decisions can reflect dynamic consequences of actions

e.g. properly handle transients and switching delays

Page 14: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

14

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach Quick RL Overview

Results

Insights into Hybrid RL outperformance

Wrapup

Page 15: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

15

Reinforcement Learning (RL) approach

ActionReward State

Alg?

App 1

Value(RT) # serversMonitored data streams

RL

System

Page 16: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

16

Reinforcement Learning: 1-slide Tutorial A learning agent interacts with the environment

Observes current state s of the environment Takes an action a Receives an (immediate) scalar reward r

Agent learns a long-range value function V(s,a)

estimating cumulative future reward:

We use a standard RL algorithm “Sarsa”: learns state-action value function

By design RL does “trial-and-error” learning without model of environment Naturally handles long-range dynamic consequences of actions (e.g., transients,

switching delays) Solid theoretical grounding for MDPs; recent practical success stories

System

Agent

ActionReward

State

01

tt

k rR

),()','(),( asVasVrasV

Page 17: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

17

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach Quick RL Overview

Online RL Approach

Results

Insights into Hybrid RL outperformance

Wrapup

Page 18: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

18

Will ML Without Built-In Knowledge Work?

Managed Element

ES

Monitor

Analyze

Execute

Plan

Tabula Rasa ML

Autonomic Manager

ES

Tabula Rasa = “blank slate” (Latin)

Page 19: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

19

Application: Allocating Server Resources in a Data Center

Scenario: Data center serving multiple customers, each running high-volume web apps with independent time-varying workloads

Macy’s Online Shopping

ApplicationManager

ServersServersServers

DB2

Router

E-Trade: online trading

ApplicationManager

ServersServersServers

DB2

Router

Citibank: online banking

ApplicationManager

ServersServersServers

DB2

Router

SLA $$ SLA $$ SLA $$

ResourceArbiter

Data Center Maximize business value

across all customers

Page 20: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

20

Assumptions Behind RL Formulation

Macy’s Online Shopping

ApplicationManager

ServersServersServers

DB2

Router

E-Trade: online trading

ApplicationManager

ServersServersServers

DB2

Router

Citibank: online banking

ApplicationManager

ServersServersServers

DB2

Router

SLA $$ SLA $$ SLA $$

ResourceArbiter

• Each application has local state; unaffected by other apps• Each app. has local state transitions and local rewards, depending

only on local state and local resource Collection of separate local MDPs, but global decision maker wants

to maximize sum of local rewards

Page 21: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

21

Global RL versus Local RL• One approach: Make the Resource Arbiter a global Q-Learner• Advantages:

• Arbiter’s problem is a true MDP• Can rely on convergence guarantee

• Main Disadvantage:• Arbiter’s state space is huge: cross product of all local state spaces

Serious curse-of-dimensionality if many applications

• Alternative Approach: Local RL• Each application does local Sarsa(0) based on local state, local

provisioning, and local reward learns local value function• Each application conveys current V(resource) estimates to arbiter• Arbiter then acts to maximize sum of current value functions• Local learning should be much easier than global learning; but• No longer have a convergence guarantee• Related work: Russell & Zimdars, ICML-03. (local rewards only)

Page 22: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

22

Online RL in Trade3 Application Manager (AAAI 2005)

Application Environment

TRADE3 App Mgr

SLA (RT)

ResponseTime

V(n)

U

RL

Demand

Servers

V(, n)

Observed state = current demand only

Arbiter action = # servers provided (n)

Instantaneous reward U = SLA payment

Learns long-range expected value function V(state,action) = V(, n)

(two-dimensional lookup table)

Data Center results:

good asymptotic performance, but

poor performance during long training period

method scales poorly with state space size

ResourceArbiter

Server Server Server

Page 23: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

23

Amazingly Enough, RL Works! :-)Results of overnight training (~25k RL updates = 16 hours real time) with random initial condition

Page 24: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

24

Comparison of Performance: 2 Application Environments

Page 25: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

25

3 Application Environments: Performance

Page 26: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

26

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach

Quick RL Overview

Online RL Approach

Hybrid RL Approach (Tesauro et al., ICAC 2006)

Results

Insights into Hybrid RL outperformance

Wrapup

Page 27: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

27

RL

RL

RL

System

MBP

ActionReward

State

Run RL offline on data from initial policy Bellman Policy Improvement Theorem (1957)

V(state,action) defines a new policy guaranteed better than original policy

Combines best aspects of both RL and model-based (e.g. queuing) methods

Very general method that automatically improves any existing systems management policy

In Data Center prototype:

• Implement best queuing models within each Trade3 mgr• Log system data in overnight run (~12-20 hrs)• Train RL on log data (~2 cpu hrs) new value functions• Replace queuing models by RL value functions and rerun experiment

Hybrid Reinforcement Learning Illustrated

Page 28: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

28

Two key ingredients of Trade3 implementation 1. “Delay-Aware” State Representation:

Include previous allocation decision as part of current state V = V( t , nt-1 , nt )

Can learn to properly evaluate switching delay (provided that delay < allocation interval)

e.g. can distinguish V(, 2, 3) from V(, 3, 3) delay need not be directly observable: RL only observes

delayed reward Also handles transient suboptimal performance

2. Nonlinear Function Approximation (Neural Nets) Generalizes across states and actions

Obviates visiting every state in space Greatly reduces need for “exploratory” actions

Much better scaling to larger state spaces From 2-3 state variables to 20-30, potentially

But: lose guaranteed optimality

Page 29: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

29

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach

Results

Insights into Hybrid RL outperformance

Wrapup

Page 30: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

30

Results: Open Loop, No Switching Delay

+2.6% Trade3 RT+12.7% Batch thrput

-0.4% Trade3 RT+38.9% Batch thrput

+73% Trade3 RT+221% Batch thrput

Page 31: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

31

Results: Closed Loop, No Switching Delay

Page 32: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

32

Results: Effects of Switching Delay

Page 33: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

33

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach

Results

Insights into Hybrid RL outperformance

Wrapup

Page 34: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

34

Insights into Hybrid RL outperformance 1. Less biased estimation errors

Queuing model predicts indirectly: RT SLA(RT) V Nonlinear SLA induces overprovisioning bias

RL estimates utility directly less biased estimate of V

2. RL handles transients and switching delays Steady-state queuing models cannot

3. RL learns to avoid thrashing

Page 35: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

35

Policy Hysteresis in Learned Value FunctionStable joint allocations (T1, T2, Batch) at fixed 2

Page 36: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

36

Hybrid RL learns not to thrash

Closed Loop Demand: #Customers in T1 & T2Allocation Delay 4.5s

Queuing Model Servers(T2)

Queuing Model Servers(T1)

Hybrid RL Servers(T1)

Hybrid RL Servers(T2)T1

T2

Page 37: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

37

<n>

Experiment

Hybrid RL does less swapping than QM

0.5780.464

0.581

0.269

0.654

0.486

0.736

0.331

00.10.20.30.40.50.60.70.80.9

QM RL QM RL QM RL QM RL

Delay=0 Delay=4.5 Delay=0 Delay=4.5

Open Open Closed Closed

Page 38: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

38

Outline: Main points of the talk

Introduction

Problem Description

Reinforcement Learning Approach

Results

Insights into Hybrid RL outperformance

Power Management (Kephart et al., ICAC 2007)

Page 39: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

39

StockTrading

Prioritization andFlow Control

Routing andLoad Balancing

Classification

Computing Resources

WebSphere On Demand Router

WebSphere XDControllers

AccountMgmt

FinancialAdvice

HighImportance

MediumImportance

LowImportance

AMSTNode

2

FASTNode

3

Node4

Node1

PlacementExecutions

StockTrading

AccountMgmt

FinancialAdvice

PlacementDecisions

WebSphere XDPerformance Manager

AM

FAST

FAST

Load balancing parameters

U(RT)

Power Executive

Control CPU speeds

IBM DirectorManipulate power controls dynamically

Power and Performance Management

{U(RT) – C(Pwr)}

Page 40: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

40

Architecture Overview

ICAC 2007, to appear

(IBM Director)

Page 41: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

2007 Tivoli/AC Joint Program © 2007 IBM Corporation41

IBM Software Group | Tivoli software

Experiment with hand-tuned policyNo power management Power management, using Hand-tuned Policy

Avg power = 96.6 watts (savings: 11.3 watts = 10.5%) Avg power = 107.9 watts

Workload intensity

CPU

Power

Response time

Workload intensity

CPU

Power

Response time

Time Time

Page 42: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

42

Hybrid RL Results

Learn V=V(s,a) state s uses single input variable (numClients) Both response time performance and power consumption comparable to

hand-crafted policy

Page 43: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

43

Hybrid RL results (15 input variables)

Avg power = 98.3 watts (savings = 8.9%) SLA violations = 1.5% vs 21%

Page 44: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

44

Conclusions

Hybrid RL works quite well for server allocation combines disparate strengths of RL and queuing models exploits domain knowledge built into queuing model but doesn’t need access to knowledge: only uses externally observable behavior

of queuing model policy

Initial promising results in power management suggests a basic 2-d value function V(load_intensity, resource_knob) may be

generally useful and easy to learn

Potential for wide usage of Hybrid RL in systems management managing other resource types: memory, storage, VMs etc. manage control params: OS/DB params etc. simultaneous management of multiple criteria: performance/utilization,

performance/availability etc.

Page 45: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

45

For further info/reading material

Papers: “Online Resource Allocation using Decompositional Reinforcement

Learning,” G. Tesauro, Proc. of AAAI-05. “A Hybrid Reinforcement Learning Approach to Autonomic Computing”

G. Tesauro et al., Proc. of ICAC-06. “Coordinating Multiple Autonomic Managers to Achieve Specified Power-

Performance Tradeoffs,” J. Kephart et al., Proc. of ICAC-07. More info about R & D in Autonomic Computing:

Our work: www.research.ibm.com/nedar AC toolkit (Autonomic Manager ToolSet): AMTS v1.0 available as

part of Emerging Technologies Toolkit v1.1 on IBM alphaWorks: www.alphaworks.com

IBM: www.research.ibm.com/autonomic Intl. Conf. on Autonomic Computing (ICAC-07):

www.autonomic-conference.org Summer internships: email me: [email protected] Thanks! Any questions??

Page 46: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

46

The End

Page 47: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

47

IBM's Global IP NetworkAT&T

Description REV

SDC North Physical/Logical WAN Connectivity 1/2

SCALE N/A SHEET 1

DRAWN

ISSUED

Gregg Machovec SDC North Network Architect

9/13/2001

SDC North

Customer C:\temp\SDC North Physical-Logical IP WAN Connectivity.vsdIBM Global Services Network Services

FDDI 100MbpsOSPF 0.0.0.0 Cost 9

Seg BB19.32.236.145-158.0

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6109.32.236.217-222.0 (.217/.218)

FDDI 100MbpsOSPF 9.130.0.0 Cost 100

Seg B0E9.130.104.0 (.12/.9)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6119.32.236.193-198.0 (.193/.194)

FDDI 100MbpsOSPF 9.66.0.0 Cost 100

Seg B519.66.7.0. (.3/.7)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

9.32.236.185-190 (.185/.186)

FDDI 100MbpsOSPF 9.50.0.0 Cost 8

9.50.123.0 (.4/.2)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Ne tworks

IBM's Global IP NetworkAT&T

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6209.32.236.33-38.0 (.33/.34)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 160

Seg 6409.32.236.40-46.0 (.41/.42)

SOMCSB-2DD3 (D)

SOMCSB-1DD2 (C)

SBY002-1DD6 (C)

SBY002-2DD7 (D)

POK010-3DBC (F)

POK918-3DBB (E)

Token-Ring 16MbpsOSPF 9.2.0.0 Cost 160

Seg 5519.32.237.65-78 (.65)

HAW790-1DD9 (D)

YKT801-1DD8 (C)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6709.32.236.105-110.0 (.105/.106)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6909.32.236.128 (.129/.130)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

9.32.236.65-70 (.69/.70)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 1609.32.236.48 (.49/.50)

Token-Ring 16MbpsOSPF 0.0.0.0 Cost 63

Seg 6319.32.236.177-182.0 (.178/.177)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SQV257-1D15 (E)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

SQV014-1D14 (F) FDDI 100Mbps

OSPF 9.5.0.0 Cost 1Seg F03

9.5.101.0 (.24/.25)

.147

.147

FDDI 100MbpsOSPF 9.117.0.0 Cost 10

Seg BB09.117.1.0 (.2/.19)

ATM 155MbpsOSPF 9.117.220.0 Cost 80

USPOKTR0BC1_IP109.117.220.0 (.249/229)

.149.149

.137

FSH330-3D90 (E)

BTV963-IGSNSD52 (F)

BTV863-5D61 (E)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

RCHRTE25FD9 (B)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Ba y Network s

RCHRTE24FD8 (A)

POK010-1DB2 (D)

FSH640-3DA6 (F)

FDDI 100MbpsStatic 9.38.80-85.0

9.38.80.193 (.219/.218)

RCHSDR-1DEB (C)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

RCHSDR-2DEC (D)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

Token-Ring 16MbpsSeg C01

9.242.96-103.0 (.128/.129)9.242.104-111.0 (.128/.129)

Token-Ring 16MbpsSeg 099

9.242.64-71.0 (.128/.129)

PAL001-2DE2 (D)

PAL001-1DE1 (C)

Token-Ring 16MbpsSeg C04

9.242.48-55.0 (.127/.128)9.242.80-87.0 (.127/.128)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

STF001-2DDD (D)

Token-Ring 16MbpsSeg DF3,DF1 Armonk

9.242.144-151.0 (.128/.129)9.242.152-159.0 (.128/.129)

Seg BB0 North Castle9.38.32.97-110.0 (.100/.105)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Ba y Network s

STF001-1DDC (C)

Token-Ring 16MbpsLIG 32.225.9.0

204.146.137-142.0 (.141)

Token-Ring 16MbpsLIG 32.226..113.0,

32.226.175.032.96.121.49-54.0 (.53)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

ARM001-4D01 (D)

ARM001-3D00 (C)

44S001-2DD5 (D)

44S001-1DD4 (C)

ATM 155MbpsOSPF 9.2.0.0 Cost 80

9.32.237.33.-46.0 (.33)

ATM 155MbpsOSPF 9.2.0.0 Cost 80

9.32.237.17-30.0 (.17)

Token-Ring 16MbpsOSPF 9.2.0.0 Cost 160

Seg 3969.32.237.49-62.0 (.49)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Ba y Network s

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SomersCampus Network

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Ne tworks

Southbury CampusNetwork

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Ne tworks

.148

.148

.146

.146

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

Token-Ring 16MbpsLIG 32.224.10.0

204.146.252.249-254.0(.133)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Network s

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

44 SouthBroadway

PoughkeepsieNY

Endicott NY

Rochester MN

Burlington VT

Fishkill NY

Southbury CT

Palisades NJ

Sterling ForestNY

Armonk NYRochester NY

Hawthorne NY

Yorktown NY

Somers NY

ATM 155MbpsOSPF 0.0.0.0 Cost 8

USPOKTR0BC3_GW109.32.237.145-158.0

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Ne tworks

FDDI 100MbpsBGP-4

9.32.236.136 (.139/.138)

POK918-1DBD (B)

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

SD

RESET

POWER

RUN

BOOT

DIAG

B A C K B O N E N O D E

Bay Networks

Page 48: Improving Systems Management Policies Using Hybrid Reinforcement Learning Gerry Tesauro IBM TJ Watson Research Center Joint work with Rajarshi Das (IBM),

48

Evolution of Computing