kernelized value function approximation for reinforcement learning

29
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University

Upload: damisi

Post on 12-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Kernelized Value Function Approximation for Reinforcement Learning. Gavin Taylor and Ronald Parr Duke University. Kernel: k(s,s’) Training Data: (s,r,s’),(s,r,s’) (s,r,s’)…. Solve for value directly using KLSTD or GPTD. Solve for model as in GPRL. Kernelized Value Function. Kernelized - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kernelized Value Function Approximation for Reinforcement Learning

Kernelized Value Function Approximation for

Reinforcement Learning

Gavin Taylor and Ronald ParrDuke University

Page 2: Kernelized Value Function Approximation for Reinforcement Learning

Overview

Solve for valuefunction givenkernelized model

Solve for modelas in GPRL

KernelizedModel

Kernel: k(s,s’)Training Data:(s,r,s’),(s,r,s’)(s,r,s’)… Solve for value directly using

KLSTD or GPTD

V=Kw

KernelizedValue Function

Page 3: Kernelized Value Function Approximation for Reinforcement Learning

Overview - Contributions

• Construct new model-based VFA• Equate novel VFA with previous work• Decompose Bellman Error into reward and

transition error• Use decomposition to understand VFA

BE(Kw) = ΔR + γΔK 'wrewarderror

transitionerror

BellmanError

Samples

Model VFA

Page 4: Kernelized Value Function Approximation for Reinforcement Learning

Outline

• Motivation, Notation, and Framework• Kernel-Based Models

– Model-Based VFA– Interpretation of Previous Work

• Bellman Error Decomposition• Experimental Results and Conclusions

Page 5: Kernelized Value Function Approximation for Reinforcement Learning

Markov Reward Processes

• M=(S,P,R,) • Value: V(s)=expected, discounted sum of

rewards from state s• Bellman equation:

• Bellman equation in matrix notation:

V[si] = ri + γ P(s j | si)V[s j ]s j ∈S

V = R + γPV

Page 6: Kernelized Value Function Approximation for Reinforcement Learning

Kernels

• Properties:– Symmetric function between two points:– PSD K-matrix

• Uses:– Dot-product in high-dimensional space (kernel trick)– Gain expressiveness

• Risks:– Overfitting– High computational cost

k(si,s j )

Page 7: Kernelized Value Function Approximation for Reinforcement Learning

Outline

• Motivation, Notation, and Framework• Kernel-Based Models

– Model-Based VFA– Interpretation of Previous Work

• Bellman Error Decomposition• Experimental Results and Conclusions

Page 8: Kernelized Value Function Approximation for Reinforcement Learning

Kernelized Regression

• Apply kernel trick to least-squares regression

• t: target values• K: kernel matrix, where• k(x): column vector, where• : regularization matrix

y(x) = k(x)T (K + Σ)−1t

Σ€

K ij = k(si,s j )

ki(x) = k(si,x)

Page 9: Kernelized Value Function Approximation for Reinforcement Learning

Kernel-Based Models

• Approximate reward model

• Approximate transition model– Want to predict k(s’) (not s’)– Construct matrix K’, where€

ˆ R (s) = k(s)T (K + ΣR )−1r

ˆ k (s') = k(s)T (K + ΣP )−1K '

′ K ij = k( ′ s i,s j )

Samples

Model VFA

Page 10: Kernelized Value Function Approximation for Reinforcement Learning

Model-based Value Function

ˆ V (s) = ˆ R (s) + γ ˆ R (s') + γ 2 ˆ R (s' ') + ...

=k(s)T K − γK '( )−1

r

ˆ R (s) = k(s)T (K + ΣR )−1r

ˆ k (s') = k(s)T (K + ΣP )−1K '

=k(s)T K−1r + γk(s')T K−1r + γ 2k(s' ')T K−1r + ...

ˆ R (s)

ˆ R ( ′ s )

=k(s)T K−1r + γk(s)T K−1K 'K−1r + γ 2k(s)T (K−1K ')2K−1r + ...

ˆ k (s')

Samples

Model VFA

Page 11: Kernelized Value Function Approximation for Reinforcement Learning

Model-based Value Function

ˆ V (s) = k(s)T K + ΣR( ) + γ K + ΣR( ) K + ΣP( )−1

′ K [ ]−1

r€

ˆ V (s) = k(s)T K − γK '( )−1

r

ˆ V = K K + ΣR( ) + γ K + ΣR( ) K + ΣP( )−1

′ K [ ]−1

r

wSamples

Model VFA

Unregularized:

Regularized:

Whole state space:

Page 12: Kernelized Value Function Approximation for Reinforcement Learning

Previous Work

• Kernel Least-Squares Temporal Difference Learning (KLSTD) [Xu et. al., 2005]

– Rederive LSTD, replacing dot products with kernels– No regularization

• Gaussian Process Temporal Difference Learning (GPTD) [Engel, et al., 2005]

– Model value directly with a GP• Gaussian Processes in Reinforcement Learning (GPRL)

[Rasmussen and Kuss, 2004]

– Model transitions and value with GPs– Deterministic rewardSamples

Model VFA

Page 13: Kernelized Value Function Approximation for Reinforcement Learning

EquivalencyMethod Value Function Model-based

Equivalent

KLSTDGPTDGPRLModel-based [T&P `09]

w = KHK( )−1

Kr

w = HT HKHT + Σ( )−1

r

w = K + σ 2Δ − γ ′ K ( )−1

r

w = K + ΣR( ) + γ K + ΣR( ) K + ΣP( )−1

′ K [ ]−1

r

H = I − γPΣ

σ 2Δ

: GPTD noise parameter

: GPRL regularization parameter

ΣP = ΣR = 0

ΣP = ΣR = Σ(HT )−1

ΣP = ΣR = σ 2Δ

Samples

Model VFA

Page 14: Kernelized Value Function Approximation for Reinforcement Learning

Outline

• Motivation, Notation, and Framework• Kernel-Based Models

– Model-Based VFA– Interpretation of Previous Work

• Bellman Error Decomposition• Experimental Results and Conclusions

Page 15: Kernelized Value Function Approximation for Reinforcement Learning

Model Error

• Error in reward approximation:

• Error in transition approximation:€

ΔR = R − ˆ R

= R − K(K + ΣR )−1r

Δ ′ K = PK − ˆ P K

= ′ K − ˆ P K

= ′ K − K(K + ΣP )−1 ′ K

′ K ij = E k( ′ s i,s j )[ ]

PKˆ P K

: expected next kernel values

: approximate next kernel values

Page 16: Kernelized Value Function Approximation for Reinforcement Learning

Bellman Error

BE(Kw) = ΔR + γΔK 'w

rewarderror

transitionerror

ΔR = R − ˆ R

Δ ′ K = PK − ˆ P K

Bellman Error a linear combination of reward and transition errors

Page 17: Kernelized Value Function Approximation for Reinforcement Learning

Outline

• Motivation, Notation, and Framework• Kernel-Based Models

– Model-Based VFA– Interpretation of Previous Work

• Bellman Error Decomposition• Experimental Results and Conclusions

Page 18: Kernelized Value Function Approximation for Reinforcement Learning

Experiments

• Version of two room problem [Mahadevan & Maggioni, 2006]

• Use Bellman Error decomposition to tune regularization parameters

REWAR

D

Page 19: Kernelized Value Function Approximation for Reinforcement Learning

Experiments

ΣP = 0 ΣR = 0

ΣP = 0.1I ΣR = 0

ˆ V

BE

Δ ′ K w

ΔR

Page 20: Kernelized Value Function Approximation for Reinforcement Learning

Conclusion

• Novel, model-based view of kernelized RL built around kernel regression

• Previous work differs from model-based view only in approach to regularization

• Bellman Error can be decomposed into transition and reward error

• Transition and reward error can be used to tune parameters

Page 21: Kernelized Value Function Approximation for Reinforcement Learning

Thank you!

Page 22: Kernelized Value Function Approximation for Reinforcement Learning

What about policy improvement?

• Wrap policy iteration around kernelized VFA– Example: KLSPI– Bellman error decomposition will be policy

dependent– Choice of regularization parameters may be

policy dependent• Our results do not apply to SARSA variants

of kernelized RL, e.g., GPSARSA

Page 23: Kernelized Value Function Approximation for Reinforcement Learning

What’s left?

• Kernel selection– Kernel selection (not just parameter tuning)– Varying kernel parameters across states– Combining kernels (See Kolter & Ng ‘09)

• Computation costs in large problems– K is O(#samples)– Inverting K is expensive– Role of sparsification, interaction w/regularization

Page 24: Kernelized Value Function Approximation for Reinforcement Learning

Comparing model-based approaches

• Transition model– GPRL: models s’ as a GP– T&P: approximates k(s’) given k(s)

• Reward model– GPRL: deterministic reward– T&P: reward approximated with regularized,

kernelized regression

Page 25: Kernelized Value Function Approximation for Reinforcement Learning

Don’t you have to know the model?

• For our experiments & graphs: Reward, transition errors calculated with true R, K’

• In practice: Cross-validation could be used to tune parameters to minimize reward and transition errors

Page 26: Kernelized Value Function Approximation for Reinforcement Learning

Why is the GPTD regularization term asymmetric?

• GPTD is equivalent to T&P when• Can be viewed as propagating the regularizer

through the transition model– – Is this a good idea?– Our contribution: Tools to evaluate this question

ΣP = ΣR = Σ(HT )−1

iT

i

iT PH )()(0

1 ∑∞

=

− =

Page 27: Kernelized Value Function Approximation for Reinforcement Learning

What about Variances?

• Variances can play an important role in Bayesian interpretations of kernelized RL– Can guide exploration– Can ground regularization parameters

• Our analysis focuses on the mean• Variances a valid topic for future work

Page 28: Kernelized Value Function Approximation for Reinforcement Learning

Does this apply to the recent work of Farahmand et al.?

• Not directly• All methods assume (s,r,s’) data• Farahmand et al. include next states (s’’) in

their kernel, i.e., k(s’’,s) and k(s’’,s’)• Previous work, and ours, includes only s’ in

the kernel: k(s’,s)

Page 29: Kernelized Value Function Approximation for Reinforcement Learning

How is This Different from Parr et al. ICML 2008?

• Parr et al. considers linear fixed point solutions, not kernelized methods

• Equivalence between linear fixed point methods was fairly well understood already

• Our contribution:– We provide a unifying view of previous kernel-based methods– We extend the equivalence between model-based and direct

methods to the kernelized case