risk, reward & reinforcement

Risk, Reward & ReinforcementRisk, Reward & Reinforcement

John MoodyDepartment of Computer Science

OGI School of Science & EngineeringOregon Health & Science University

Machine Learning, Statistics & DiscoveryAMS Workshop, Snowbird Utah, June 25, 2003

Copyright 2003 Copyright 2003 –– John MoodyJohn MoodyDir

ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Goals of This TalkGoals of This Talk

• Introduce Reinforcement Learning

• Present Direct Reinforcement– Contrast w/ Value Function RL Methods– Causal, Non-Markovian, Partially-Observed

• Describe Risk-Averse Reinforcement

• Demonstrate application to– A Competitive Game– Trading & Asset Allocation


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardPreview: S&PPreview: S&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation

100

101

Equ

ity

RRL−Trader System vs Q−Trader System

Buy and HoldRRL−Trader Q−Trader

−1

0

1

RR

L−Tr

ader

1970 1975 1980 1985 1990−1

0

1

Q−T

rade

r


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

What is Reinforcement Learning?What is Reinforcement Learning?

RL Considers:• A Goal-Directed Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility

RL is a Dynamic Learning Paradigm:• Trial & Error Discovery of Strategy• Actions result in Reinforcement

Time Plays a Critical Role:• Rewards depend on sequences of actions• Rewards may be delayed or received over time


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Reinforcement vs. Supervised LearningReinforcement vs. Supervised Learning

Sound Bites:• “Learning from Examples” (SL)• “Learning by Trial and Error” (RL)Distinctions:• Static (SL) vs. Dynamic (RL)• Feedback: “Instructive” (SL) vs. “Evaluative” (RL)• SL usually ignores larger problem (goals, utility)• RL agents take action, may influence environmentCharacteristics of RL Applications: • Dynamical model of world is not known• Labeled examples expensive or unavailable• Temporal credit assignment problem


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Origins of Reinforcement LearningOrigins of Reinforcement Learning

• Psychology and Animal Behavior – Thorndike's “Law of Effect”, Animal Intelligence (1911)– Skinner’s “Operant Conditioning”, The Behavior of Organisms (1938)– “Trial and Error” Learning & “Reinforcement” Theories

• Computational Intelligence – Turing, “Computing Machinery and Intelligence” (1950) – Minsky, “Neural-Analog Reinforcement” (1954)– * Farley & Clark’s Policy Gradient Learner (1954)– Samuel's Checkers Program (1959)

• Operations Research & Control Engineering– Bellman, Dynamic Programming (1957)


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardModern RLModern RL

• Value Function Methods– Sutton's "Temporal Difference " (1988)– Watkins' "Q-Learning" (1989)– Tesauro's "TD-Gammon" (1994)

• Actor-Critic Methods– Barto, Sutton & Anderson (1983)– Werbos’ Taxonomy (1992)– Konda & Tsitsiklis, NIPS*1999 (2000)

• Direct Reinforcement: Policy Gradient & Policy Search– Williams' “REINFORCE” (1988, 1992) – Moody, et al: "RRL" and Finance (1996 -- present)– Baxter & Bartlett: "Direct Gradient-Based RL" (1999)– Ng & Jordan: “Pegasus: Policy Search” (2000)– NIPS*2000 Workshop

"Learn the Policy or Learn the Value-Function?"

Is a paradigm shift occurring?

( )TD λ


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Dynamic ProgrammingDynamic ProgrammingDiscrete Time Stochastic ControlMarkov Decision Process (MDP): Agent

operating with discrete states xtaking actions areceiving rewards R

MDP System Model:

Value Function and Policy:

Goal: find optimal V and hence π

{ }

00

( ) [ , ( )]

"policy" ( ) ; 0, , , ( , )

tt t

t

t t

V x E R x x x x

a x t P x a

π

π

γ π

π π

∞

=

= = = = = = ∞

∑…

1Transition probability ( ) for actions :

Distribution ( , ) of rewards ( , )xy t t t

R

P a a x y

P x a R x a+→


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Dynamic Programming, contDynamic Programming, contBellman’s Recursion Equation:

The optimal policy satisfies:

Optimal Policy is defined implicitly:

Finding requires also determining knowing the System Model:

and computing expectations!

{ }( ) ( , ) ( ) ( ( , )) ( )xya y

V x P x a P a E R x a V yπ ππ γ= +∑ ∑

* *( ) max ( ( , )) ( ) ( )xyya

V x E R x a P a V yγ

= +

∑

* *( ) max ( ) and ( ) arg max ( )V x V x x V xπ π

π ππ= =

*π

( ) , ( , ) .xy RP a P x a

*π *V


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Reinforcement Learning: Reinforcement Learning: Beyond Dynamic ProgrammingBeyond Dynamic Programming

RL Algorithms offer approximate solutions to:• Dynamic Programming Problems• Stochastic Control Problems

RL Algorithms:• Do not require a model of the system

– Learn via simulation or live trial & error experience• Avoid Bellman’s “curse of dimensionality”

– By using “function approximation” to smooth state space• Learn on-line:

– Stochastic optimization– “Exploration” vs. “Exploitation”


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

QQ--Learning: Adaptive DPLearning: Adaptive DPQ-Function: state ⊗ action → value

Q-Learning (Watkins 1989): estimates iteratively via simulation without a system model

Optimal Policy is defined implicitly!

Problems: representations & robustness

*π̂

( ), ( , ) :xy RP a P x a

* *

* *

( ) max ( , )

( , ) ( ( , )) ( ) max ( , )xyy

a

b

V x Q x a

Q x a E R x a P a Q y bγ

=

= + ∑

ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )b

Q x a R x a Q y b Q x aη γ ∆ = + −

*Q̂

* *ˆˆ ( ) arg max ( , )b

a x Q x b =


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Direct ReinforcementDirect ReinforcementRepresent / learn the policy: observation → action

directly without learning a value function!

Motivation:• Simpler, more natural problem representations• Only the local gradient of the value function matters

– Local estimates of performance often available• Solve non-Markovian problems• Find solutions with problems with only “partial observability”• Seek a “good” policy, not an “optimal” policy

RRL (Recurrent Reinforcement Learning):a “policy gradient” algorithm


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Learning via Direct ReinforcementLearning via Direct ReinforcementDR Agent:

• “Partially Observes”: information , not full state

• “Non-Markovian”: Recurrent policy

• Takes action, Receives reward

• Causal performance function(Generally path-dependent)

• Learn policy by varying

GOAL: Maximize performanceor marginal performance

1( ; , )t t t tF F F Iθ −=

( )1, ;t t t tR F F S−

1 1( , ,..., )t tU R R R−

1t t t tD U U U −≡ ∆ = −

1( ; , )t t tF F Iθ − tθ

tI tS

TU


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Recurrent Reinforcement Learning (RRL) Recurrent Reinforcement Learning (RRL) Deterministic gradient (batch):

with recursion:

Stochastic gradient (on-line):

stochastic recursion:

Stochastic parameter update (on-line):

Constant : adaptive learning. Declining : stochastic approx.

( ) 1

1 1

TT t t t tT

t t t t

dU dR dF dR dFdUd dR dF d dF d

θθ θ θ

−

= −

= +

∑

( ) 1

1 1

t t t t t t

t t t t t t

dU dU dR dF dR dFd dR dF d dF d

θθ θ θ

−

− −

≈ +

( )t tt

t

dUd

θθ ρ

θ∆ =

1

1

t t t t

t

dF F dF dFd dF dθ θ θ

−

−

∂= +∂

1

1 1

t t t t

t t t t

dF F dF dFd dF dθ θ θ

−

− −

∂≈ +∂

ρ ρ


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RL Algorithms ComparedRL Algorithms Compared

Q-Learning• Learn Q-Function• Value=Q(Action)• Q: state ⊗ action → value• Action

Properties• Bellman’s Equation: A-Causal• MDP Assumption• Complex representations • Curse of Dimensionality• Computations expensive• Policies often unstable

Direct Reinforcement• Learn Policy F• Use local performance est.• F: observations → action • Action

Properties• Causal: Forward in Time• Recurrent, Partially Observable• Enables simpler reps• Reduces Curse of Dimensionality• More efficient in practice• Yields more robust policies

( , )F N x θ=b

argmax ( , , )F N x b θ=


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard(How to Turn an Easy Problem into a Harder Problem)

Problem Description• Binary actions: • Rewards for {incorrect, correct} :• Vector of inputs:

– Oracle input: X1 (tells the agent the correct action At)– Noise inputs: X2, X3,, . . . XN (boolean random variables)

Complexity of DR Agent• Perceptron Policy Learner:

Optimal policy w/ single threshold:

Complexity of Q Agent• Optimal policy is XOR function:

Representation requires at least two thresholds.

The Oracle ProblemThe Oracle Problem

s ig n ( )t tA W X= ⋅

tA

1tX

+

+ −

−

{ }1, 1tA ∈ − +{ }1, 1tR ∈ − +

tX

( )1s ig nt tA X=

1( , )t tQ A X


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

The Oracle SimulationThe Oracle Simulation

• Measure how many trials required to learn representation

• Convergence criteria: – RRL: Correct policy for all possible input vectors– Q-Learner: MSE < 0.01 for all possible input vectors– Maximum of 30,000 trials per run

• Repeated runs for multiple learning rates– Choose learning rate with quickest convergence on

average• 50 random initializations; N = 1,2,3,4,5,10 inputs

RRL Q-Learner– Min # trials: 1 1150– Max #trials: 39 29350– # Runs Non-Converged: 0 15


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

The Oracle: Simulation ResultsThe Oracle: Simulation Results

RRL−3 Q−3 RRL−5 Q−5 RRL−10 Q−10 10

0

101

102

103

104

Oracle Simulation: RRL vs Q−Learner

log(

# T

rials

)


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RoShamBoRoShamBo (RPS) (RPS)

Rules : Rock beats Scissors.

Paper beats Rock.

Scissors beats Paper.

“ The rules are simple, but the game itself is as complex as the mind of your opponent.”

(www.worldrps.com)

Character of human throws:• Rock: commonly perceived as the most aggressive throw

• Paper: considered the most subtle throw

• Scissors: often perceived as clever or crafty

Competitions: World Championship, Computer Olympiad


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

RPS Player RepresentationRPS Player RepresentationSoftmax representation for probability(action):

Inputs & Weights:

Opponent’s two previous moves:

Player’s two previous moves (recurrent):

Learning Algorithm:

Stochastic Direct Reinforcement (SDR)

A generalization of RRL

maf

fF m

ba

aa ,....,1for

()]exp[()]exp[()

1

==∑ =

W Bi

V Ai


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Simulated Competition Simulated Competition

‘ Human Champions’ Dataset:The “Great Eight” Gambits are the eight most widely used by recent champions.

1. Avalanche (RRR) 2. Bureaucrat (PPP) 3. Crescendo (PSR) 4. Dénouement (RSP)5. Fistfull o’ Dollars (RPP) 6. Paper Dolls (PSS) 7. Scissor Sandwich (PSP) 8. Toolbox (SSS)

“Great Eight Plus Four” yields equal probability of R,P,S.

Training a ‘Dynamic Opponent’ using SDR Learn by playing against “Great Eight Plus Four”

Learn to beat the opponent using Recurrent SDRLearn by playing against the Dynamic Opponent

Opponent tries to predict Player’s move, and vice versa


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Three Model Players: Learning Curves Three Model Players: Learning Curves

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5The fraction of Wins,Draws,and Losses with SDR.

No

recu

rren

ce

Wins: 0.357 Draws: 0.383 Losses: 0.260

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5

Onl

y re

curr

ence


0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0.2

0.3

0.4

0.5

All

wei

ghts



ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

[Wins [Wins –– Losses] for Three Model PlayersLosses] for Three Model Players

No Recurrence Only Recurrence All Weights

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Summary Stats: 10 Simulations


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Three Players: {Losses, Draws, Wins}Three Players: {Losses, Draws, Wins}

LossesDraws Wins0.2

0.25

0.3

0.35

0.4

0.45

0.5

No recurrenceLossesDraws Wins

0.2

0.25

0.3

0.35

0.4

0.45

0.5The fraction of Wins,Draws,and Losses with SDR.

Only recurrenceLossesDraws Wins

0.2

0.25

0.3

0.35

0.4

0.45

0.5

All weights

Summary Stats: 10 Simulations


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Risk Averse ReinforcementRisk Averse Reinforcement

Sources of Uncertainty in RL:• Stochastic Rewards• Stochastic Environment• Stochastic Policies

Standard RL framework takes expectations of above!• “Optimal Policies” are only “optimal” in expectation

Partial Observability: Another source of uncertainty

Risk: Distributions of trajectories, rewards, outcomes• Probability of Poor Performance• Risk of Ruin


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Asset Management as a MicrocosmAsset Management as a Microcosmfor Reinforcement Learning Researchfor Reinforcement Learning Research

• Simulations– Bang-Bang Control Problem

Simple Buy / Sell Decisions (“Long” / “Short” Positions)– Transaction Costs → Recurrent, Non-Markovian Rep.

• Uncertain Environment:– Details of markets / economy are Unobservable– Prices / fundamentals / economic data are Very Noisy– Markets react quickly to Unpredictable News / Events

• Challenging Problem:– Efficient Markets Theory: You Can’t Beat the Market.– Competitive Game: Trading opportunities will be discovered,

exploited and eliminated by others.– Prediction Accuracy Limited: “1/2 + ε”


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

InputSeries

TargetSeries

TransactionCosts

Trades/PortfolioWeights

Forecasts

Bottleneck

Profits/Losses

( )U θ θ, ′

ForecastingSystem θ

TradingRules ′θ

-Supervised Learning:

Error(θ)

Trading based on ForecastsTrading based on ForecastsFour limitations:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement

InputSeries

TargetSeries

TransactionCosts

Trades/PortfolioWeights

ReinforcementLearning:

TradingSystem

θ

Profits/Losses( )U θ

Delay

( )U θ

Four advantages:

• One set of parameters

• A single utility function

• U includes transaction costs

• Direct mapping from inputs to actions


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Structure of TradersStructure of Traders

• Single Asset

- Price series

- Return series simple returns

or rates of return

• Traders

- Discrete position size

- Recurrent policy

• Information Set:

– Full system state is not known

tz

1t t tr z z −= −

1

1tt

t

zrz −

= −

}{ 1,0,1tF ∈ −

1( ; , )t t t tF F F Iθ −=

}{ 1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y− − − −=


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Returns, Profit & Wealth for TradersReturns, Profit & Wealth for Traders• Simple Trading Returns and Profit:

• Compounded Trading Returns and Wealth:

• Transaction Costs: represented by .

• Risk Free Rate:suppressed for simplicity ( ).

• Note: Market impact: = function(trade size)

1 1

1

t t t t t

T

t tt

R F r F F

P R

δ

µ

− −

=

= − −

= ∑

}{1 1

01

(1 ) (1 ) 1

1

t t t t t

T

T tt

R F r F F

W W R

δ− −

=

= + ⋅ − − −

= +∏

0ftr =

δ

δ


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Balancing Reward with Risk:Balancing Reward with Risk:Financial Performance MeasuresFinancial Performance Measures

Performance Functions:• Path independent:

(Standard Utility Functions)• Path dependent: • In general:

Performance Ratios:• Sharpe Ratio:

• Downside Deviation Ratio:

• Sterling Ratio:

For Learning:• Per-Period Returns: • Marginal Performance:

( )t tU U W=

1 1 0( , ,..., , )t t tU U W W W W−=

1 0( , ,..., )t t tU U R R W−=

Average( )Standard Deviation( )

t

t

RR

Average( )Downside Deviation( )

t

t

RR

Average( )Draw-Down( )

t

t

RR

tR1t t t tD U U U −≡ ∆ = −


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Maximizing the Sharpe RatioMaximizing the Sharpe Ratio

Sharpe Ratio:

Exponential Moving Average Sharpe Ratio:

with time scale and

Motivation: EMA Sharpe ratio • emphasizes recent patterns;• is causal & can be updated incrementally.

Average( )Standard Deviation( )

tT

t

RSR

=

2 1 2( )( )

t

t t

AS tK B Aη =

−

1 1( )t t t tA A R Aη− −= + −2

1 1( )t t t tB B R Bη− −= + −

1 21 21

K ηη

−= −

1η −


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Differential Sharpe Ratio Differential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization

Expand to first order in :

Define Differential Sharpe Ratio as:

where

Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic optimization).

1 1

2 3 21 1

1( ) 2( )

( )

t t t t

t t

B A A BdS tD t

d B Aη

η η− −

− −

∆ − ∆≡ =

−

20

( )( ) ( 1) | ( ).

dS tS t S t O

dη

η η ηη ηη =≈ − + +

1t t tA R A −∆ = −2

1t t tB R B −∆ = −

( )S tη

tR tU


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Long / Short Trader SimulationLong / Short Trader Simulation• Learns from scratch and on-line

• Moving average Sharpe Ratio with η = 0.01

0.5

1

Price

−1

0

1

Sign

al

0

100

200

Prof

it(%

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.2

−0.1

0

0.1

Shar

pe R

.

time


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Trader Simulation (summary stats)Trader Simulation (summary stats)Effects of transaction costs on performance

100 runs; costs = 0.2, 0.5, and 1%

0.2 0.5 1

2

3

4

5

6

7

Tra

din

g F

re

qu

en

cy (%

)

Transaction Cost (%)0.2 0.5 1

100

200

300

400

500

600

700

800

Cu

mu

lativ

e S

um

o

f P

ro

fits (%

)

Transaction Cost (%)0.2 0.5 1

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

Sh

arp

e R

atio

Transaction Cost (%)


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Asset Allocation ExampleAsset Allocation ExampleS&PS&P--500 Index and 3500 Index and 3--Month TMonth T--BillBill

102

103

S&P−

500

S&P 500 Index With Divs. Reinvested

1970 1975 1980 1985 1990

4

6

8

10

12

14

16

Annu

aliz

ed Y

ield

s (%

)

Time

Treasury Bill YieldS&P 500 Div. Yield


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Maximizing the Differential Sharpe Ratio:Maximizing the Differential Sharpe Ratio:S&PS&P--500 / T500 / T--Bill Asset AllocationBill Asset Allocation

100

101

Equ

ity

RRL−Trader System vs Q−Trader System

Buy and HoldRRL−Trader Q−Trader

−1

0

1

RR

L−Tr

ader

1970 1975 1980 1985 1990−1

0

1

Q−T

rade

r


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Gaining Economic Insights byGaining Economic Insights byOpening Up the “Black Box”Opening Up the “Black Box”

Which of the 85 economic / financial input series for the S&P-500 / T-Bill trader are most important?

Relative sensitivity of input i :

Each year, average the sensitivity for each input

Note: Sensitivity Analysis is straightforward for Direct RL,but not for Q-Learning

max

ii

jj

dFdx

SdFdx

=


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

S&PS&P--500: Three Most Important Variables500: Three Most Important Variables85 series: Learned relationships are nonstationary over time

1970 1975 1980 1985 1990 19950

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Date

Nor

mal

ized

Abs

olut

e Se

nsiti

vity

Sensitivity Analysis: Average on RRL−Trader Committee

Yield Curve Slope 6 Month Diff. in AAA Bond yield6 Month Diff. in TBill Yield


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Minimizing Downside RiskMinimizing Downside Risk

Downside Deviation:

- Degree Downside Deviation:

Lower Partial Moment:

Downside Deviation Ratio:

1 22

1

1 min{ , 0}T

T t tt

DD RT

θ=

= − ∑

1

1( ) max{ ,0}T

nT t t

tLPM n R

Tθ

=

= −∑

thN [ ]1( ) ( ) nnT TDD LPM n=

Average( )Downside Deviation( )

tT

t

RDDRR

=

thN


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Artificial Price SeriesArtificial Price Series

Pric

e

0 2000 4000 6000 8000 10000

0.6

0.8

1.0

1.2

1.4

1.6


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Performance Results (cont'd)Performance Results (cont'd)

0 50 100 150 200 250−0.08

−0.06

−0.04

−0.02

0

Dra

wD

own

DDR TraderSR Trader

0 50 100 150 200 2500

2

4

6x 10

−3

Mov

ing

Ave

rage

Dev

iatio

ns

Time

DownsideStandard


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

DrawDraw--Down ComparisonDown Comparison

−0.12 −0.1 −0.08 −0.06 −0.04 −0.02 00

0.5

1

1.5

2

2.5

3Log Histograms of Maximum DrawDowns

Maximum DrawDown

Log1

0 (C

ount

)

DDRSR


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Position ComparisonPosition Comparison

Short Neutral Long 0

5

10

15

20

25

30

35

40

Per

cent

of T

ime

in P

ositi

on

Trading Signal

Negatively Skewed Returns

SRDDR


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

British Pound: British Pound: ReturnReturnAA=15%,=15%, SRSRAA=2.3,=2.3, DDRDDRAA=3.3=3.3

1.5

1.52

1.54

1.56Pr

ice

−1

0

1

Sign

al

1

1.02

1.04

1.06

1.08

Equi

ty

1000 2000 3000 4000 5000 6000 7000 8000

−0.2

0

0.2

Shar

pe R

.

time


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ardComments on the British Pound ResultsComments on the British Pound Results

The Simulations are Suggestive, not Conclusive:• BP performance better than Deutschmark or Yen• Price data are Reuters “indicative” quotes, not transactions • Market microstructure effects could influence profitability.• We spent little time designing the trader.• From a real-world standpoint, the work is preliminary.

Efficacy of RRL for FX confirmed by Carl Gold (Caltech):• More extensive simulations w/ Olsen 30 minute FX quotes• IEEE CIFER *2003 Proceedings

Looking Ahead:• Further analysis of microstructure / transaction costs is

needed. FX broker transaction prices would help.• A true test requires live trading.


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Closing RemarksClosing Remarks• Direct Reinforcement

Advantages over Value Function RL:– Natural representations– Efficiency & robustness– Causal algorithms and nonstationary problems

• Recurrence – Naturally occurs in real-world problems– Must abandon MPD framework

• Risk Averse Reinforcement– Conventional RL considers only expected rewards– Risk: The distribution of rewards matters– Robust policies for the real world must be low risk!

• Interesting research opportunities for Statisticians, Computer Scientists

URL: www.cse.ogi.edu/~moody


ect

Rei

nfo

rcem

ent

Dir

ect

Rei

nfo

rcem

ent

Ris

k &

Rew

ard

Ris

k &

Rew

ard

Some References:Some References:

URL: www.cse.ogi.edu/~moody

Papers:John Moody and Matthew Saffell, ‘Learning to Trade via Direct Reinforcement’, Special Issue on Financial Engineering, IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.

John Moody and Matthew Saffell, ‘Minimizing Downside Risk via StochasticDynamic Programming', in “Computational Finance 1999", Y. S. Abu-Mostafa,B. LeBaron, A. W. Lo, and A. S. Weigend, editors, MIT Press, Cambridge, MA, pp. 403-416, 2000.

John Moody and Matthew Saffell, `Reinforcement Learning for Trading’,in Advances in Neural Information Processing Systems, S. Solla, M. Kearnsand David Cohn, eds., v. 11, pp 917-923, MIT Press, 1999.

Moody, J., Wu, L., Liao, Y. & Saffell, M., `Performance Functions andReinforcement Learning for Trading Systems and Portfolios', Journal ofForecasting v. 17, pp. 441-470, 1998.

risk, reward & reinforcement

Documents