planning and model selection in data driven markov...

Planning and Model Selection in Data DrivenMarkov models

Shie Mannor

Department of Electrical EngineeringTechnion

Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak(Technion), Ofir Mebel (Apple), Aviv Tamar (Technion), John Tsitsiklis (MIT), Huan Xu

(NUS), and many others.

ICML, June 2014

S. Mannor (Technion) Planning and Model Selection in Data Driven Markov modelsICML, June 2014 1 / 36

Classical planning problems

We typically want to maximize the expected average/discountedreward

In planning:Model is “known”A single scalar reward

We believe the model to be true, even though it is not ...

Flavors of uncertainty

“Natural" uncertainty: random transitions/rewards→ classicalMDP planning. Not this talk.

Deterministic uncertainty in the parameters→ Robust MDPs. 1/2of this talk.

Probabilistic uncertainty in the parameters→ Bayesian RL/MDPs→ Not this talk (but a book is coming!).

Model uncertainty: model is not known→ second 1/2 of this talk.

Motivation I: Power grid

Unit commitment scheduling problem (very dynamic)Objective: guarantee reliability (99.9%)

Model is kind of known, but highly stochastic.Failures (outages) do happen!

Motivation I: Power grid

Unit commitment scheduling problem (very dynamic)Objective: guarantee reliability (99.9%)

Model is kind of known, but highly stochastic.Failures (outages) do happen!

Motivation IILarge US retailer (Fortune 500 company)Marketing problem: send or not send coupon/invitation/mail ordercatalogue

Common wisdom: per customer look at RFMRecency, Frequency, Monetary value

Dynamics matterHow to discretize?

Common to the problems

“Real” state space is huge with lots of uncertainty and parameters

Batch data are available

Reasonable simulators, but not ground truth

Computational speed less of an issue, but still need to get thingsdone

Important low probability events

Uncertainty and (model) risk are THE concern

This talk

How to choose the “right" model based on available data.

How to optimize when the model is not (fully) known?

Part I: The Model Selection Problem

We will focus on a simpler problem:1 Ignore action completely (MRP). We have: State→ reward→

next state.2 We observe a sequence of T observations and rewards that occur

in some space O × R (O is complicated)

D(T ) = (o1, r1,o2, r2, . . . ,oT , rT ).

3 We are given K mappings from O to states spaces S1, . . . ,SK ,belong to MRPs M1, . . . ,MK , respectively.Each mapping Hi : O → Si describes a model whereSi = {x (i)

1 , . . . , x (i)|Si |} is the state space of the MRP Mi .

4 We do not describe how the mappings {Hi}Ki=1 are constructed.[HalakM, KDD 2013]

The Identification Problem

A model selection criterion takes as input DT and the modelsM1, . . . ,Mk , and it returns one of the k models as the proposed truemodel.

Definition: A model selection criterion is weakly consistent if

M(DT ) 6= i)→ 0 as n→∞,

where Pi is the probability induced when model i is the correct model.

Penalized Likelihood Criteria

In abstraction: data samples y1, y2, . . . , yT .

Li(T ) = maxθ{log P(y1, . . . , yT |Mi(θ))}.

We denote the dimension of θ by |Mi |. Then, an MDL model estimatorhas the following structure

MDL(i) , |Mi |f (T )− Li(T ),

where f (T ) is some sub-linear function.

Many related criteria: AIC, BIC, and many others.

Impossibility result

Theorem: There does not exist a consistent MDL-like criterion.

Identifying Markov Reward Processes

We look at the aggregate prediction error.

Two types of aggregations:1 Reward aggregation2 Transition probability aggregation

We will focus on refined models: M1 � M2 is M2 is a refinement of M1.

Reward Aggregation

Define reward mean square error (RMSE) operator to be

LiRMSE (DT ) =

∑j∈Si

ε(x ij ),

where ε(x ij ) is the error in state j in model i of the reward estimate.

Observation: limT→∞ LiRMSE (DT ) =

∑x∈Si

π(x)Var(x).

Lemma: Suppose Mi contains Mk . Then, for a single trajectory DT wehave Li

RMSE (DT ) ≤ LkRMSE (DT ). Moreover, if the states aggregated in

Mi are with different mean rewards, then the inequality is sharp.

Corollary: Consider a series of refined models M1 � . . . � Mk . Then,

L1RMSE (DT ) ≥ L2

RMSE (DT ) ≥ . . . ≥ LkRMSE (DT ).

Reward Aggregation Score

The (reward) score for the j-th model to be

M(j) = |Mj |f (T )

RMSE (DT ), (1)

where f (T ) is a sub-linear increasing function withlimT→∞ f (T )/

√T →∞.

Based on the RMSE, we consider the following model selector

MRMSE = arg minj

Theorem: The model selector MMSE is weakly consistent.

Finite time analysis gives rates.

Experiments with artificial data

The figure reports the test statistic M(k) for different model dimensionsk . The error bars are one standard deviation from the mean.

Experiments with artificial data

The test statistic AIC(k)/BIC(k) for different model dimensions k .

Experiments with real data

Large US apparel retailer.RFM measures: Recency, Frequency and Monetary value

Problem: How to aggregate? Focus on recency

1 Randomly2 Most recent3 Least recent

Real data

(random) (lowest) (highest)

Each graph is for a different value of f (T )/T : blue =1, green = 10, red= 50.

Mini-conclusion

A very special model selection problem.

Standard approaches fail - but not all is lost

How to aggregate?

Mismatched models?

MRP→ MDP: for optimization and for model discovery

Mini-conclusion

A very special model selection problem.

Standard approaches fail - but not all is lost

How to aggregate?

Mismatched models?

MRP→ MDP: for optimization and for model discovery

Part II: Robust MDPs

When in doubt—assume the worse

Model = {S,P,R,A}

Assume:S and A are known and givenP ∈ P and R ∈ R are not known.

Look for a policy with best worst-case performance.

Model = {S,P,R,A}

Robust MDPs

Problem becomes:

(∗) maxpolicy

mindisturbance

Epolicy , disturbance

γt rt

Game against natureIn general: problem is NP-hardNice (non robust) generative probabilistic interpretation:If Prgen(disturbance) ≥ 1− δ, then (∗) approximates percentileoptimization

Can get generalization bounds: Robustness = generalization

Robust MDPs

Problem becomes:

(∗) maxpolicy

mindisturbance

γt rt

Robust MDPs

Problem becomes:

(∗) maxpolicy

mindisturbance

γt rt

Robust MDPs

Problem becomes:

(∗) maxpolicy

mindisturbance

γt rt

Robust MDPs (Cont’)

Problem is solvable when disturbance is a product uncertainty:

disturbance =

{(P,R) : P ∈

Up(s),R ∈∏

Ur (s)

Take the worst outcome for every state.Eventually solved by Nilim and El-Ghoui (OR, 2005) using DP(assuming regular uncertainty sets).If really bad events are to be included: must make Up and Ur huge→ conservativenessA complicated tradeoff to make: can’t correlate bad events

disturbance =

{(P,R) : P ∈

Up(s),R ∈∏

Ur (s)

disturbance =

{(P,R) : P ∈

Up(s),R ∈∏

Ur (s)

Uncertainty sets

Convex sets→ large uncertainty.

Useful uncertainty sets I

At most K disasters:

disturbanceK ={

(P,R) : Ps = Pnom,Rs = Rnom except at most

s1, .., sK where Psi ,Rsi ∈ (∆P,∆R)}

A generative model:Every node get disturbed with probability µ

Prgenerative(K disturbances) = 1− tail of polynomial distribution

)µk (1− µ)K−k

Useful uncertainty sets I

At most K disasters:

disturbanceK ={

(P,R) : Ps = Pnom,Rs = Rnom except at most

s1, .., sK where Psi ,Rsi ∈ (∆P,∆R)}

A generative model:Every node get disturbed with probability µ

Prgenerative(K disturbances) = 1− tail of polynomial distribution

)µk (1− µ)K−k

Useful uncertainty sets II

Continuous uncertainty “budget"

Suppose there is probability Prdist ((P, r); (Pnom,Rnom)).

A generative model:Every node get disturbed with a continuous probability Prdist :

Prgenerative

(disturbance) =∏s∈S

Prdist((Ps,Rs); (Ps nom,Rs nom)

))disturbanceλ =

{(P,R) : Pr

generative(disturbance) ≤ λ

}Not convex→ Cannot solve

Prgenerative

))disturbanceλ =

{(P,R) : Pr

Prgenerative

))disturbanceλ =

{(P,R) : Pr

Prgenerative

))disturbanceλ =

{(P,R) : Pr

State Augmentation

Original state space: SNew state space: S × {0,1, . . . ,K} for discrete disastersNew state space: S × [0, λ] for uncertainty budget

Recipe: Use your favorite algorithm with augmented state space.1 Can use structural properties (monotonicity) in uncertainty.2 For continuous uncertainty, can discretize efficiently.

Caveat: must assume uncertainty is observed

State Augmentation

Scaling-up robust MDPs

Smallish robust MDPs are easy (robust or not)Non-robust MDPs can be handled with function approximation orpolicy search methods.

What about robust MDPs?

Difficulty 1: State space in many problems is very largeDifficulty 2: Parameters of dynamics rarely known: hard to doMonte-Carlo

A crash course on ADPIn ADP (Approximate Dynamic Programming) we want to solve theBellman equation.

(Policy evaluation): WriteT πy = r + γPy

Bellman equation is Jπ = T πJπ.

Assume: J = Φ · w(Φ is known and given and w is a vector in, say, Rn.)With approximation:

Φ · w = ΠT πΦ · w

(Π is a projection onto {Φw : w ∈ Rn})

Not too hard to show that ΠT π is a contraction→ unique fixedpoint.

A crash course on ADPIn ADP (Approximate Dynamic Programming) we want to solve theBellman equation.

(Policy evaluation): WriteT πy = r + γPy

Bellman equation is Jπ = T πJπ.

Assume: J = Φ · w(Φ is known and given and w is a vector in, say, Rn.)With approximation:

Φ · w = ΠT πΦ · w

(Π is a projection onto {Φw : w ∈ Rn})

Not too hard to show that ΠT π is a contraction→ unique fixedpoint.

Back to robust MDPs

In robust MDPs:

Jπ(s) = r(s, π(s)) + γ infP∈P

Eπ[Jπ(s′)|s, π(s)

]Writing the pessimistic transition as σπv = infp∈P p>v we can write:

T πy = r + γσπ(y)

Not linear anymore!

Can show that Tv = supπ T πv is a contraction [Iyengar, 2005]

Solving Robust MDPs

Fixed point equation (writing J = Φ · w):

Φ · w = ΠT πΦ · w (∗)

In general: no solution.

But: If for some β < 1 we have that

P(s′|s, π(s)) ≤ β

γP(s′|s, π(s)) for all P

(P is the conditional according to which data are observed.)Then:

1 ΠTπ is a contraction and we have a unique solution to (*).2 This solution is close to the optimal one.

Solving Robust MDPs

Fixed point equation (writing J = Φ · w):

Φ · w = ΠT πΦ · w (∗)

In general: no solution.

But: If for some β < 1 we have that

P(s′|s, π(s)) ≤ β

γP(s′|s, π(s)) for all P

(P is the conditional according to which data are observed.)Then:

1 ΠTπ is a contraction and we have a unique solution to (*).2 This solution is close to the optimal one.

Solving Robust MDPs (cont’)Value iteration:

Φ · wk+1 = ΠT πΦ · wk (∗)

Solution:

wk+1 = (Φ>DΦ)−1(

ΦT Dr + γΦ>Dσπ(Φwk ))

Three terms to control:

Φ>DΦ ≈ 1N

N∑t=1

φ(st )φ(s>t ), Φ>Dr ≈ 1N

N∑t=1

φ(st )r(st , π(st ))

Φ>Dσπ(Φwk ) ≈ 1N

N∑t=1

φ(st )σP(Φwk )

But how to we compute the pessimistic σP(Φwk )?

Solving the inner problem

Pessimizing oracle:

infp∈P

∑s is reachable

p(s)φ(s)>wk

Can be solved whenP(s,a) = {p : dist(p, p) ≤ ε, and p is a probability} and some otherlucky cases.

So we know how to “find" the value function of a given policy.Finding optimal policy: SARSA:

1 Define φ(s,a) instead of just φ(s)2 Let πw = arg maxa φ(s,a)>w .3 Do value iteration in the new state-action problem.

[Tamar M Xu, ICML 2014]

Solving the inner problem

Pessimizing oracle:

infp∈P

∑s is reachable

p(s)φ(s)>wk

Can be solved whenP(s,a) = {p : dist(p, p) ≤ ε, and p is a probability} and some otherlucky cases.

So we know how to “find" the value function of a given policy.Finding optimal policy: SARSA:

1 Define φ(s,a) instead of just φ(s)2 Let πw = arg maxa φ(s,a)>w .3 Do value iteration in the new state-action problem.

[Tamar M Xu, ICML 2014]

Example from option trading

Trading an option:

Back to motivating examples I

Unit commitment problem (day-ahead/few days ahead):

Classical approach is N/N + 1Renewables (wind/solar) and demand-response cause muchstochasticityModeling is hard (complex internal state, dependencies,high-dimensionality)Rare events are essential: need to robustify.

Back to motivating examples II

Marketing problem:

Our model is quite bad to start withRobust optimization fights you model mismatchUncertainty set construction is mysterious (but is it important?)

Main issue: ValidationSecondary issue: opimization

Back to motivating examples II

Marketing problem:

Our model is quite bad to start withRobust optimization fights you model mismatchUncertainty set construction is mysterious (but is it important?)

Main issue: ValidationSecondary issue: opimization

Conclusions

Model misspecification→ parameter uncertainty→ robustapproach

Why uncertainty matters?Models come from data, data are a bitchReward comes from data or humans, this is even worse

Outlook:Contextual parametersLatent models [Bandits: MaillardM ICML 2014]Policy evaluationModel validation

Conclusions

planning and model selection in data driven markov...

Documents

estimating models based on markov jump processes given...

regularized policy iteration with nonparametric function...

the markov chain market ragnar orberg - · pdf filethe...

markov chains on countable state space 1 markov chains...

stochastic processes and markov chains (part i)markov...

chapter 9: markov chain regular markov chains...

markov chains and hidden markov models - rice university ·...

hidden markov model cryptanalysis - uni-saarland.de ·...

hidden markov models in...

scientific abstract markov, k.k. - markov, k.k. · title:...

the hidden markov model (hmm) - computer...

image segmentation by data-driven markov chain monte carlo

hidden markov models€¦ · •hidden markov models 1 the...

markov chains and hidden markov models - cornell...

markov chains and stationary...

finite horizon markov decision problems...

authors fr: markov, g.l. to: markov, v.a. · title: authors...

a data-driven failure prognostics method based on mixture...

markov reward models and markov decision processes in

chapter 16 markov chains -...