risk, reward & reinforcement

24
Risk, Reward & Reinforcement Risk, Reward & Reinforcement John Moody Department of Computer Science OGI School of Science & Engineering Oregon Health & Science University Machine Learning, Statistics & Discovery AMS Workshop, Snowbird Utah, June 25, 2003 Copyright 2003 Copyright 2003 – John Moody John Moody Direct Reinforcement Direct Reinforcement Risk & Reward Risk & Reward Goals of This Talk Goals of This Talk Introduce Reinforcement Learning Present Direct Reinforcement Contrast w/ Value Function RL Methods Causal, Non-Markovian, Partially-Observed Describe Risk-Averse Reinforcement Demonstrate application to A Competitive Game Trading & Asset Allocation

Upload: others

Post on 25-May-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Risk, Reward & Reinforcement

Risk, Reward & ReinforcementRisk, Reward & Reinforcement

John MoodyDepartment of Computer Science

OGI School of Science & EngineeringOregon Health & Science University

Machine Learning, Statistics & DiscoveryAMS Workshop, Snowbird Utah, June 25, 2003

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Goals of This TalkGoals of This Talk

• Introduce Reinforcement Learning

• Present Direct Reinforcement– Contrast w/ Value Function RL Methods– Causal, Non-Markovian, Partially-Observed

• Describe Risk-Averse Reinforcement

• Demonstrate application to– A Competitive Game– Trading & Asset Allocation

Page 2: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardPreview: S&PPreview: S&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation

100

101

Equ

ity

RRL−Trader System vs Q−Trader System

Buy and HoldRRL−Trader Q−Trader

−1

0

1

RR

L−Tr

ader

1970 1975 1980 1985 1990−1

0

1

Q−T

rade

r

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

What is Reinforcement Learning?What is Reinforcement Learning?

RL Considers:• A Goal-Directed Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility

RL is a Dynamic Learning Paradigm:• Trial & Error Discovery of Strategy• Actions result in Reinforcement

Time Plays a Critical Role:• Rewards depend on sequences of actions• Rewards may be delayed or received over time

Page 3: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Reinforcement vs. Supervised LearningReinforcement vs. Supervised Learning

Sound Bites:• “Learning from Examples” (SL)• “Learning by Trial and Error” (RL)Distinctions:• Static (SL) vs. Dynamic (RL)• Feedback: “Instructive” (SL) vs. “Evaluative” (RL)• SL usually ignores larger problem (goals, utility)• RL agents take action, may influence environmentCharacteristics of RL Applications: • Dynamical model of world is not known• Labeled examples expensive or unavailable• Temporal credit assignment problem

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Origins of Reinforcement LearningOrigins of Reinforcement Learning

• Psychology and Animal Behavior – Thorndike's “Law of Effect”, Animal Intelligence (1911)– Skinner’s “Operant Conditioning”, The Behavior of Organisms (1938)– “Trial and Error” Learning & “Reinforcement” Theories

• Computational Intelligence – Turing, “Computing Machinery and Intelligence” (1950) – Minsky, “Neural-Analog Reinforcement” (1954)– * Farley & Clark’s Policy Gradient Learner (1954)– Samuel's Checkers Program (1959)

• Operations Research & Control Engineering– Bellman, Dynamic Programming (1957)

Page 4: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardModern RLModern RL

• Value Function Methods– Sutton's "Temporal Difference " (1988)– Watkins' "Q-Learning" (1989)– Tesauro's "TD-Gammon" (1994)

• Actor-Critic Methods– Barto, Sutton & Anderson (1983)– Werbos’ Taxonomy (1992)– Konda & Tsitsiklis, NIPS*1999 (2000)

• Direct Reinforcement: Policy Gradient & Policy Search– Williams' “REINFORCE” (1988, 1992) – Moody, et al: "RRL" and Finance (1996 -- present)– Baxter & Bartlett: "Direct Gradient-Based RL" (1999)– Ng & Jordan: “Pegasus: Policy Search” (2000)– NIPS*2000 Workshop

"Learn the Policy or Learn the Value-Function?"

Is a paradigm shift occurring?

( )TD λ

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Dynamic ProgrammingDynamic ProgrammingDiscrete Time Stochastic ControlMarkov Decision Process (MDP): Agent

operating with discrete states xtaking actions areceiving rewards R

MDP System Model:

Value Function and Policy:

Goal: find optimal V and hence π

{ }

00

( ) [ , ( )]

"policy" ( ) ; 0, , , ( , )

tt t

t

t t

V x E R x x x x

a x t P x a

π

π

γ π

π π

=

= = = = = = ∞

∑…

1Transition probability ( ) for actions :

Distribution ( , ) of rewards ( , )xy t t t

R

P a a x y

P x a R x a+→

Page 5: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Dynamic Programming, contDynamic Programming, contBellman’s Recursion Equation:

The optimal policy satisfies:

Optimal Policy is defined implicitly:

Finding requires also determining knowing the System Model:

and computing expectations!

{ }( ) ( , ) ( ) ( ( , )) ( )xya y

V x P x a P a E R x a V yπ ππ γ= +∑ ∑

* *( ) max ( ( , )) ( ) ( )xyya

V x E R x a P a V yγ

= +

* *( ) max ( ) and ( ) arg max ( )V x V x x V xπ π

π ππ= =

( ) , ( , ) .xy RP a P x a

*π *V

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Reinforcement Learning: Reinforcement Learning: Beyond Dynamic ProgrammingBeyond Dynamic Programming

RL Algorithms offer approximate solutions to:• Dynamic Programming Problems• Stochastic Control Problems

RL Algorithms:• Do not require a model of the system

– Learn via simulation or live trial & error experience• Avoid Bellman’s “curse of dimensionality”

– By using “function approximation” to smooth state space• Learn on-line:

– Stochastic optimization– “Exploration” vs. “Exploitation”

Page 6: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

QQ--Learning: Adaptive DPLearning: Adaptive DPQ-Function: state ⊗ action → value

Q-Learning (Watkins 1989): estimates iteratively via simulation without a system model

Optimal Policy is defined implicitly!

Problems: representations & robustness

*π̂

( ), ( , ) :xy RP a P x a

* *

* *

( ) max ( , )

( , ) ( ( , )) ( ) max ( , )xyy

a

b

V x Q x a

Q x a E R x a P a Q y bγ

=

= + ∑

ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )b

Q x a R x a Q y b Q x aη γ ∆ = + −

*Q̂

* *ˆˆ ( ) arg max ( , )b

a x Q x b =

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Direct ReinforcementDirect ReinforcementRepresent / learn the policy: observation → action

directly without learning a value function!

Motivation:• Simpler, more natural problem representations• Only the local gradient of the value function matters

– Local estimates of performance often available• Solve non-Markovian problems• Find solutions with problems with only “partial observability”• Seek a “good” policy, not an “optimal” policy

RRL (Recurrent Reinforcement Learning):a “policy gradient” algorithm

Page 7: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Learning via Direct ReinforcementLearning via Direct ReinforcementDR Agent:

• “Partially Observes”: information , not full state

• “Non-Markovian”: Recurrent policy

• Takes action, Receives reward

• Causal performance function(Generally path-dependent)

• Learn policy by varying

GOAL: Maximize performanceor marginal performance

1( ; , )t t t tF F F Iθ −=

( )1, ;t t t tR F F S−

1 1( , ,..., )t tU R R R−

1t t t tD U U U −≡ ∆ = −

1( ; , )t t tF F Iθ − tθ

tI tS

TU

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Recurrent Reinforcement Learning (RRL) Recurrent Reinforcement Learning (RRL) Deterministic gradient (batch):

with recursion:

Stochastic gradient (on-line):

stochastic recursion:

Stochastic parameter update (on-line):

Constant : adaptive learning. Declining : stochastic approx.

( ) 1

1 1

TT t t t tT

t t t t

dU dR dF dR dFdUd dR dF d dF d

θθ θ θ

= −

= +

( ) 1

1 1

t t t t t t

t t t t t t

dU dU dR dF dR dFd dR dF d dF d

θθ θ θ

− −

≈ +

( )t tt

t

dUd

θθ ρ

θ∆ =

1

1

t t t t

t

dF F dF dFd dF dθ θ θ

∂= +∂

1

1 1

t t t t

t t t t

dF F dF dFd dF dθ θ θ

− −

∂≈ +∂

ρ ρ

Page 8: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RL Algorithms ComparedRL Algorithms Compared

Q-Learning• Learn Q-Function• Value=Q(Action)• Q: state ⊗ action → value• Action

Properties• Bellman’s Equation: A-Causal• MDP Assumption• Complex representations • Curse of Dimensionality• Computations expensive• Policies often unstable

Direct Reinforcement• Learn Policy F• Use local performance est.• F: observations → action • Action

Properties• Causal: Forward in Time• Recurrent, Partially Observable• Enables simpler reps• Reduces Curse of Dimensionality• More efficient in practice• Yields more robust policies

( , )F N x θ=b

argmax ( , , )F N x b θ=

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard(How to Turn an Easy Problem into a Harder Problem)

Problem Description• Binary actions: • Rewards for {incorrect, correct} :• Vector of inputs:

– Oracle input: X1 (tells the agent the correct action At)– Noise inputs: X2, X3,, . . . XN (boolean random variables)

Complexity of DR Agent• Perceptron Policy Learner:

Optimal policy w/ single threshold:

Complexity of Q Agent• Optimal policy is XOR function:

Representation requires at least two thresholds.

The Oracle ProblemThe Oracle Problem

s ig n ( )t tA W X= ⋅

tA

1tX

+

+ −

{ }1, 1tA ∈ − +{ }1, 1tR ∈ − +

tX

( )1s ig nt tA X=

1( , )t tQ A X

Page 9: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

The Oracle SimulationThe Oracle Simulation

• Measure how many trials required to learn representation

• Convergence criteria: – RRL: Correct policy for all possible input vectors– Q-Learner: MSE < 0.01 for all possible input vectors– Maximum of 30,000 trials per run

• Repeated runs for multiple learning rates– Choose learning rate with quickest convergence on

average• 50 random initializations; N = 1,2,3,4,5,10 inputs

RRL Q-Learner– Min # trials: 1 1150– Max #trials: 39 29350– # Runs Non-Converged: 0 15

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

The Oracle: Simulation ResultsThe Oracle: Simulation Results

RRL−3 Q−3 RRL−5 Q−5 RRL−10 Q−10 10

0

101

102

103

104

Oracle Simulation: RRL vs Q−Learner

log(

# T

rials

)

Page 10: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RoShamBoRoShamBo (RPS) (RPS)

Rules : Rock beats Scissors.

Paper beats Rock.

Scissors beats Paper.

“ The rules are simple, but the game itself is as complex as the mind of your opponent.”

(www.worldrps.com)

Character of human throws:• Rock: commonly perceived as the most aggressive throw

• Paper: considered the most subtle throw

• Scissors: often perceived as clever or crafty

Competitions: World Championship, Computer Olympiad

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RPS Player RepresentationRPS Player RepresentationSoftmax representation for probability(action):

Inputs & Weights:

Opponent’s two previous moves:

Player’s two previous moves (recurrent):

Learning Algorithm:

Stochastic Direct Reinforcement (SDR)

A generalization of RRL

maf

fF m

ba

aa ,....,1for

()]exp[()]exp[()

1

==∑ =

W Bi

V Ai

Page 11: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Simulated Competition Simulated Competition

‘ Human Champions’ Dataset:The “Great Eight” Gambits are the eight most widely used by recent champions.

1. Avalanche (RRR) 2. Bureaucrat (PPP) 3. Crescendo (PSR) 4. Dénouement (RSP)5. Fistfull o’ Dollars (RPP) 6. Paper Dolls (PSS) 7. Scissor Sandwich (PSP) 8. Toolbox (SSS)

“Great Eight Plus Four” yields equal probability of R,P,S.

Training a ‘Dynamic Opponent’ using SDR Learn by playing against “Great Eight Plus Four”

Learn to beat the opponent using Recurrent SDRLearn by playing against the Dynamic Opponent

Opponent tries to predict Player’s move, and vice versa

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Three Model Players: Learning Curves Three Model Players: Learning Curves

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5The fraction of Wins,Draws,and Losses with SDR.

No

recu

rren

ce

Wins: 0.357 Draws: 0.383 Losses: 0.260

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5

Onl

y re

curr

ence

Wins: 0.422 Draws: 0.305 Losses: 0.273

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5

All

wei

ghts

Wins: 0.463 Draws: 0.298 Losses: 0.239

Page 12: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

[Wins [Wins –– Losses] for Three Model PlayersLosses] for Three Model Players

No Recurrence Only Recurrence All Weights

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Summary Stats: 10 Simulations

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Three Players: {Losses, Draws, Wins}Three Players: {Losses, Draws, Wins}

LossesDraws Wins0.2

0.25

0.3

0.35

0.4

0.45

0.5

No recurrenceLossesDraws Wins

0.2

0.25

0.3

0.35

0.4

0.45

0.5The fraction of Wins,Draws,and Losses with SDR.

Only recurrenceLossesDraws Wins

0.2

0.25

0.3

0.35

0.4

0.45

0.5

All weights

Summary Stats: 10 Simulations

Page 13: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Risk Averse ReinforcementRisk Averse Reinforcement

Sources of Uncertainty in RL:• Stochastic Rewards• Stochastic Environment• Stochastic Policies

Standard RL framework takes expectations of above!• “Optimal Policies” are only “optimal” in expectation

Partial Observability: Another source of uncertainty

Risk: Distributions of trajectories, rewards, outcomes• Probability of Poor Performance• Risk of Ruin

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Asset Management as a MicrocosmAsset Management as a Microcosmfor Reinforcement Learning Researchfor Reinforcement Learning Research

• Simulations– Bang-Bang Control Problem

Simple Buy / Sell Decisions (“Long” / “Short” Positions)– Transaction Costs → Recurrent, Non-Markovian Rep.

• Uncertain Environment:– Details of markets / economy are Unobservable– Prices / fundamentals / economic data are Very Noisy– Markets react quickly to Unpredictable News / Events

• Challenging Problem:– Efficient Markets Theory: You Can’t Beat the Market.– Competitive Game: Trading opportunities will be discovered,

exploited and eliminated by others.– Prediction Accuracy Limited: “1/2 + ε”

Page 14: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

InputSeries

TargetSeries

TransactionCosts

Trades/PortfolioWeights

Forecasts

Bottleneck

Profits/Losses

( )U θ θ, ′

ForecastingSystem θ

TradingRules ′θ

-Supervised Learning:

Error(θ)

Trading based on ForecastsTrading based on ForecastsFour limitations:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement

InputSeries

TargetSeries

TransactionCosts

Trades/PortfolioWeights

ReinforcementLearning:

TradingSystem

θ

Profits/Losses( )U θ

Delay

( )U θ

Four advantages:

• One set of parameters

• A single utility function

• U includes transaction costs

• Direct mapping from inputs to actions

Page 15: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Structure of TradersStructure of Traders

• Single Asset

- Price series

- Return series simple returns

or rates of return

• Traders

- Discrete position size

- Recurrent policy

• Information Set:

– Full system state is not known

tz

1t t tr z z −= −

1

1tt

t

zrz −

= −

}{ 1,0,1tF ∈ −

1( ; , )t t t tF F F Iθ −=

}{ 1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y− − − −=

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Returns, Profit & Wealth for TradersReturns, Profit & Wealth for Traders• Simple Trading Returns and Profit:

• Compounded Trading Returns and Wealth:

• Transaction Costs: represented by .

• Risk Free Rate:suppressed for simplicity ( ).

• Note: Market impact: = function(trade size)

1 1

1

t t t t t

T

t tt

R F r F F

P R

δ

µ

− −

=

= − −

= ∑

}{1 1

01

(1 ) (1 ) 1

1

t t t t t

T

T tt

R F r F F

W W R

δ− −

=

= + ⋅ − − −

= +∏

0ftr =

δ

δ

Page 16: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Balancing Reward with Risk:Balancing Reward with Risk:Financial Performance MeasuresFinancial Performance Measures

Performance Functions:• Path independent:

(Standard Utility Functions)• Path dependent: • In general:

Performance Ratios:• Sharpe Ratio:

• Downside Deviation Ratio:

• Sterling Ratio:

For Learning:• Per-Period Returns: • Marginal Performance:

( )t tU U W=

1 1 0( , ,..., , )t t tU U W W W W−=

1 0( , ,..., )t t tU U R R W−=

Average( )Standard Deviation( )

t

t

RR

Average( )Downside Deviation( )

t

t

RR

Average( )Draw-Down( )

t

t

RR

tR1t t t tD U U U −≡ ∆ = −

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Maximizing the Sharpe RatioMaximizing the Sharpe Ratio

Sharpe Ratio:

Exponential Moving Average Sharpe Ratio:

with time scale and

Motivation: EMA Sharpe ratio • emphasizes recent patterns;• is causal & can be updated incrementally.

Average( )Standard Deviation( )

tT

t

RSR

=

2 1 2( )( )

t

t t

AS tK B Aη =

1 1( )t t t tA A R Aη− −= + −2

1 1( )t t t tB B R Bη− −= + −

1 21 21

K ηη

−= −

1η −

Page 17: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Differential Sharpe Ratio Differential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization

Expand to first order in :

Define Differential Sharpe Ratio as:

where

Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic optimization).

1 1

2 3 21 1

1( ) 2( )

( )

t t t t

t t

B A A BdS tD t

d B Aη

η η− −

− −

∆ − ∆≡ =

20

( )( ) ( 1) | ( ).

dS tS t S t O

η η ηη ηη =≈ − + +

1t t tA R A −∆ = −2

1t t tB R B −∆ = −

( )S tη

tR tU

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Long / Short Trader SimulationLong / Short Trader Simulation• Learns from scratch and on-line

• Moving average Sharpe Ratio with η = 0.01

0.5

1

Price

−1

0

1

Sign

al

0

100

200

Prof

it(%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.2

−0.1

0

0.1

Shar

pe R

.

time

Page 18: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Trader Simulation (summary stats)Trader Simulation (summary stats)Effects of transaction costs on performance

100 runs; costs = 0.2, 0.5, and 1%

0.2 0.5 1

2

3

4

5

6

7

Tra

din

g F

re

qu

en

cy (%

)

Transaction Cost (%)0.2 0.5 1

100

200

300

400

500

600

700

800

Cu

mu

lativ

e S

um

o

f P

ro

fits (%

)

Transaction Cost (%)0.2 0.5 1

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Sh

arp

e R

atio

Transaction Cost (%)

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Asset Allocation ExampleAsset Allocation ExampleS&PS&P--500 Index and 3500 Index and 3--Month TMonth T--BillBill

102

103

S&P−

500

S&P 500 Index With Divs. Reinvested

1970 1975 1980 1985 1990

4

6

8

10

12

14

16

Annu

aliz

ed Y

ield

s (%

)

Time

Treasury Bill YieldS&P 500 Div. Yield

Page 19: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Maximizing the Differential Sharpe Ratio:Maximizing the Differential Sharpe Ratio:S&PS&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation

100

101

Equ

ity

RRL−Trader System vs Q−Trader System

Buy and HoldRRL−Trader Q−Trader

−1

0

1

RR

L−Tr

ader

1970 1975 1980 1985 1990−1

0

1

Q−T

rade

r

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Gaining Economic Insights byGaining Economic Insights byOpening Up the “Black Box”Opening Up the “Black Box”

Which of the 85 economic / financial input series for the S&P-500 / T-Bill trader are most important?

Relative sensitivity of input i :

Each year, average the sensitivity for each input

Note: Sensitivity Analysis is straightforward for Direct RL,but not for Q-Learning

max

ii

jj

dFdx

SdFdx

=

Page 20: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

S&PS&P--500: Three Most Important Variables500: Three Most Important Variables85 series: Learned relationships are nonstationary over time

1970 1975 1980 1985 1990 19950

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Date

Nor

mal

ized

Abs

olut

e Se

nsiti

vity

Sensitivity Analysis: Average on RRL−Trader Committee

Yield Curve Slope 6 Month Diff. in AAA Bond yield6 Month Diff. in TBill Yield

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Minimizing Downside RiskMinimizing Downside Risk

Downside Deviation:

- Degree Downside Deviation:

Lower Partial Moment:

Downside Deviation Ratio:

1 22

1

1 min{ , 0}T

T t tt

DD RT

θ=

= − ∑

1

1( ) max{ ,0}T

nT t t

tLPM n R

=

= −∑

thN [ ]1( ) ( ) nnT TDD LPM n=

Average( )Downside Deviation( )

tT

t

RDDRR

=

thN

Page 21: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Artificial Price SeriesArtificial Price Series

Pric

e

0 2000 4000 6000 8000 10000

0.6

0.8

1.0

1.2

1.4

1.6

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Performance Results (cont'd)Performance Results (cont'd)

0 50 100 150 200 250−0.08

−0.06

−0.04

−0.02

0

Dra

wD

own

DDR TraderSR Trader

0 50 100 150 200 2500

2

4

6x 10

−3

Mov

ing

Ave

rage

Dev

iatio

ns

Time

DownsideStandard

Page 22: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

DrawDraw--Down ComparisonDown Comparison

−0.12 −0.1 −0.08 −0.06 −0.04 −0.02 00

0.5

1

1.5

2

2.5

3Log Histograms of Maximum DrawDowns

Maximum DrawDown

Log1

0 (C

ount

)

DDRSR

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Position ComparisonPosition Comparison

Short Neutral Long 0

5

10

15

20

25

30

35

40

Per

cent

of T

ime

in P

ositi

on

Trading Signal

Negatively Skewed Returns

SRDDR

Page 23: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

British Pound: British Pound: ReturnReturnAA=15%,=15%, SRSRAA=2.3,=2.3, DDRDDRAA=3.3=3.3

1.5

1.52

1.54

1.56Pr

ice

−1

0

1

Sign

al

1

1.02

1.04

1.06

1.08

Equi

ty

1000 2000 3000 4000 5000 6000 7000 8000

−0.2

0

0.2

Shar

pe R

.

time

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardComments on the British Pound ResultsComments on the British Pound Results

The Simulations are Suggestive, not Conclusive:• BP performance better than Deutschmark or Yen• Price data are Reuters “indicative” quotes, not transactions • Market microstructure effects could influence profitability.• We spent little time designing the trader.• From a real-world standpoint, the work is preliminary.

Efficacy of RRL for FX confirmed by Carl Gold (Caltech):• More extensive simulations w/ Olsen 30 minute FX quotes• IEEE CIFER *2003 Proceedings

Looking Ahead:• Further analysis of microstructure / transaction costs is

needed. FX broker transaction prices would help.• A true test requires live trading.

Page 24: Risk, Reward & Reinforcement

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Closing RemarksClosing Remarks• Direct Reinforcement

Advantages over Value Function RL:– Natural representations– Efficiency & robustness– Causal algorithms and nonstationary problems

• Recurrence – Naturally occurs in real-world problems– Must abandon MPD framework

• Risk Averse Reinforcement– Conventional RL considers only expected rewards– Risk: The distribution of rewards matters– Robust policies for the real world must be low risk!

• Interesting research opportunities for Statisticians, Computer Scientists

URL: www.cse.ogi.edu/~moody

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Some References:Some References:

URL: www.cse.ogi.edu/~moody

Papers:John Moody and Matthew Saffell, ‘Learning to Trade via Direct Reinforcement’, Special Issue on Financial Engineering, IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.

John Moody and Matthew Saffell, ‘Minimizing Downside Risk via StochasticDynamic Programming', in “Computational Finance 1999", Y. S. Abu-Mostafa,B. LeBaron, A. W. Lo, and A. S. Weigend, editors, MIT Press, Cambridge, MA, pp. 403-416, 2000.

John Moody and Matthew Saffell, `Reinforcement Learning for Trading’,in Advances in Neural Information Processing Systems, S. Solla, M. Kearnsand David Cohn, eds., v. 11, pp 917-923, MIT Press, 1999.

Moody, J., Wu, L., Liao, Y. & Saffell, M., `Performance Functions andReinforcement Learning for Trading Systems and Portfolios', Journal ofForecasting v. 17, pp. 441-470, 1998.