orf 544 stochastic optimization and learning · derivative-free stochastic search notes: » the...

167
. ORF 544 Stochastic Optimization and Learning Spring, 2019 Warren Powell Princeton University http://www.castlelab.princeton.edu © 2018 W.B. Powell

Upload: others

Post on 19-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

.

ORF 544

Stochastic Optimization and LearningSpring, 2019

Warren PowellPrinceton University

http://www.castlelab.princeton.edu

© 2018 W.B. Powell

Page 2: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Week 4

Chapter 7: Derivative-free stochastic search

© 2019 Warren Powell Slide 2

Page 3: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Derivative-free stochastic search

Notes:» The material for this week and next will all be drawn

from chapter 7.» Chapter 7 is 60 pages, and desperately in need of a

rewrite. I don’t have time to do this, but the lectures will follow a new (and improved) outline.

» Derivative-free stochastic optimization is an extremely rich problem class. We will use this to illustrate four fundamental classes of policies, which can be organized along two core strategies:

• Policy search – Here we will search within a class of policies to identify which work best over time/iterations.

• Lookahead policies – These are policies that are constructed to optimize the value of an experiment plus the value of the downstream state.

© 2019 Warren Powell

Page 4: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Derivative-free stochastic search

Week 4 (this week): We will cover:» Introduction to derivative-free stochastic optimization» Introduction to two strategies for developing policies:

• Policy search class• Lookahead class

» Then we will focus on the “policy search” class, which can be divided into two classes:

• Policy function approximations (PFAs)• Cost function approximations (CFAs)

» We will do the more difficult lookahead classes next week.

© 2019 Warren Powell

Page 5: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Introduction to derivative-free stochastic search

© 2019 Warren Powell Slide 5

Page 6: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Sports

Finding the best player» We have a set of

players from which to choose a team

» The effectiveness of a certain team is best revealed by playing a game

» We maximize the total number of games won in the season

© 2019 Warren Powell Slide 6

Page 7: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Optimal learning in diabetesHow do we find the best treatment for diabetes?

» The standard treatment is a medication called metformin, which works for about 70 percent of patients.

» What do we do when metformin does not work for a patient?

» There are about 20 other treatments, and it is a process of trial and error. Doctors need to get through this process as quickly as possible.

© 2019 Warren Powell

Page 8: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

8

Drug discovery

Biomedical research» How do we find the

best drug to cure cancer?

» There are millions of combinations, with laboratory budgets that cannot test everything.

» We need a method for sequencing experiments.

© 2019 Warren Powell Slide 8

Page 9: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Drug discovery

Designing molecules

» X and Y are sites where we can hang substituents to change the behavior of the molecule. We approximate the performance using a linear belief model:

0

ij ijsites i substituents j

Y X

» How to sequence experiments to learn the best molecule as quickly as possible?

© 2019 Warren Powell

Page 10: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

© 2019 Warren Powell Slide 10

Page 11: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing menus

What should you offer?» No-one will like

everything on a menu.

» The trick is to find choices that will satisfy people.

» You will probably need to observe the response to a particular menu over a period of time.

» So, observations are time © 2019 Warren Powell

Page 12: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Derivative free stochastic search

Settings

© 2019 Warren Powell

Page 13: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

The multiarmed bandit problem

Slide 13

Page 14: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60
Page 15: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problemsBandit problems and objective functions» The classical bandit problem is cumulative reward:

where

» The essential characteristic of bandit problems is the exploration vs. exploitation tradeoff. Do you try something that does not look as good, potentially incurrent a lower reward, so you can learn and make better decisions in the future.

» Then the bandit community discovered that the same tradeoff exists in final reward. These problems became known as “best arm” bandit problems in the bandit vocabulary.

© 2019 Warren Powell

11

0

max ( ( ), )N

n n

n

F X S W

1 "winnings"State of knowledge

( )

n

n

n n

WSx X S Choose next “arm” to play

New information

What we know about each slot machine

Page 16: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problems Arms:

Page 17: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Bandits:

Multiarmed bandit problems

Page 18: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problems

Page 19: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problems

Page 20: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problems

Dimensions of a “bandit” problem:» The “arms” (decisions) may be

• Binary (A/B testing, stopping problems)• Discrete alternatives (drug, catalyst, …)• Continuous choices (price)• Vector-valued (basketball team, products, movies, …)• Multiattribute (attributes of a movie, song, person)• Static vs. dynamic choice sets• Sequential vs. batch

» Information (what we observe)• Success-failure/discrete outcome• Exponential family (e.g. Gaussian, exponential, …)• Heavy-tailed (e.g. Cauchy)• Data-driven (distribution unknown)• Stationary vs. nonstationary processes• Lagged responses?• Adversarial?

Page 21: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problems

Dimensions of a “bandit” problem:» Belief models

• Lookup tables (these are most common)– Independent or correlated beliefs

• Parametric models– Linear or nonlinear in the parameters

• Nonparametric models– Locally linear– Deep neural networks/SVM

• Bayesian vs. frequentist

» Objective function• Expected performance (e.g. regret)• Offline (final reward) vs. online (cumulative reward)

– Just interested in final design?– Or optimizing while learning?

• Risk metrics

Page 22: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Multiarmed bandit problemsWhat is a “bandit problem”?

» The literature seems to characterize a “bandit problem” as any problem where a policy has to balance exploration vs. exploitation.

» But this means that a bandit “problem” is defined by how it is solved. E.g., if you use a pure exploration policy, is it a bandit problem?

My definition:» Any sequential learning problem:

• Maximizing cumulative rewards (finite or infinite horizon)• Maximizing the terminal reward with a finite budget.

» Fundamental elements of a “bandit” problem:• A belief model that describes what we know about the function• The ability to learn sequentially about the function.

» I prefer to distinguish three problem classes:• Active learning – These problems arise when our decisions affect the

information that arrives. This opens the door to making decisions that balance current performance against learning to improve future performance.

• Passive learning – Here we may learn as we go, but we do not directly influence the learning process through our decisions.

• No learning – This would be for problems where there is no element of learning.

Page 23: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

An energy storage problem

Transition function

E

G

B

L

1 1

1 0 1 1 2 2 1

1 , 1 1

1

ˆt t t

pt t t t t t t t

D Dt t t t

battery batteryt t t

E E E

p p p p

D f

R R x

Page 24: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

An energy storage problem

Types of learning:» No learning ( are known)

» Passive learning (learn from price data)

We have no control over the evolution of prices.

» Active learning (“bandit problems”)

1 0 1 1 2 2 1p

t t t t tp p p p

1 0 1 1 2 2 1p

t t t t t t t tp p p p

1 0 1 1 2 2 3 1GB p

t t t t t t t t t tp p p p x If we have to learn the parameters, then we have to introducethis learning process into the dynamics.

Page 25: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning in stochastic optimization

Learning the pricing model:» Let be the new price and let

» We update our estimate using our recursive least squares equations:

1 11

1t t t t t

t

B pq q eg+ +

+

= -

( )1 1

11

1

( | ) , 1 ( )

1 ( )

pricet t t t t

Tt t t t t t

t

Tt t t t

F p p

B B B p p B

p B p

e q

g

g

+ +

++

+

= -

= -

= +

0 1 1 2 2( | ) ( )price Tt t t t t t t t t t tF p p p p p

Page 26: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

An energy storage problem

Types of learning:» No learning ( are known)

» Passive learning (learn from price data)

» Active learning (“bandit problems”)

1 0 1 1 2 2 1p

t t t t tp p p p

1 0 1 1 2 2 1p

t t t t t t t tp p p p

1 0 1 1 2 2 3 1GB p

t t t t t t t t t tp p p p x

Buy/sell decisions

Our decisions influence the prices we observe, which helps with learning.

Page 27: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Classes of problems

© 2019 Warren Powell Slide 27

Page 28: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classesSpecial structure» There are special cases where we can solve

exactly. But not very many.

Sampled problems (SAA, scenario trees)» If the only problem is that we cannot compute the expectation, we

might solve a sampled approximation

Adaptive learning algorithms» This is what we have to turn to for most problems, and is the focus

of this tutorial.

max ( , )x F x W

1

1ˆmax ( , ) ( , )N

nx

n

F x W F x WN

Deterministic!

Also deterministic!

Page 29: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classes

State independent problems» The problem does not depend on the state of the system.

» The only state variable is what we know (or believe) about the unknown function , called the belief state , so .

State dependent problems» Now the problem may depend on what we know at time t:

» Now the state is

max ( , ) min( , )xF x W p x W cx

( , )F x W

0max ( , , ) min( , )

tx R tC S x W p x W cx

Page 30: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classes

Offline (final reward)» We can iteratively search for the best solution, but only

care about the final answer.» Asymptotic formulation:

» Finite horizon formulation:

Online (cumulative reward)» We have to learn as we go

max ( , )xF x W

,max ( , )NF x W

11

0

max ( ( ), )N

n n

n

F X S W

üïïïïïýïïïïïþ

“ranking and selection”or

“stochastic search”

Page 31: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classes

There are entire fields of stochastic optimization built around “final reward” and “cumulative reward” objections:» Final reward

• Stochastic search (e.g. derivative-based stochastic optimization)• Ranking and selection (finding the best out of a set of choices)

» Cumulative reward• Multiarmed bandit problems• Classical dynamic programming, optimal control, stochastic

programming, …

» Our presentation will not care whether we are optimizing final or cumulative reward – it is just an objective function.

© 2019 Warren Powell

Page 32: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classes

“Offline” vs. “online” learning» My view (or perhaps the view of stochastic optimization):

• “offline learning” is learning in the computer. We do not care about making mistakes as long as we get a good answer in the end.

• “online learning” would be learning in the field, where you have to live with your mistakes.

» The machine learning community uses these terms differently:

• “offline learning” means batch – you have a batch dataset from which you fit a model.

• “online learning” is sequential, as would happen if data is arriving in the field over time.

• The problem is that there are many sequential algorithms that are used in “offline” (e.g. laboratory) settings, and the machine learning community calls these “online.”

© 2019 Warren Powell

Page 33: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Major problem classes

Notes:» For chapter 7, we will focus on the top row, “state

independent problems.”» These are pure learning problems, where we are trying

to learn a deterministic decision or policy.

© 2019 Warren Powell

Page 34: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

© 2019 Warren Powell Slide 34

Page 35: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

ModelingAny learning problem can be written

where:» 𝑆 𝜇 , 𝛽 ∈ , 𝑥 ∈ 𝑋 𝑥 , … , 𝑥

= Belief about the value of 𝜇 .» 𝑆 Initial belief» 𝑥 Decision made using information from first 𝑛 experiments.» 𝑥 Initial decision based on information in 𝑆^0.» 𝑊 Observation from 𝑛𝑡ℎ experiment, 𝑛 1,2, … , 𝑁

Decisions are made using a policy:𝑥 𝑋 𝑆

Page 36: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

ModelingAll sequential decision problems can be described using:» States

• Belief states – what do we know about 𝔼𝐹 𝑥, 𝑊 , or any other function we are learning (transition functions, value functions, policies)

• Physical and informational states (we will get to these later).» Decisions – What are our choices?

• Pure information collection decisions (e.g. run an MRI, purchase a credit report)

• Implementation decisions, but which may also affect what we learn.» Exogenous information

• Initial belief/prior• What we learn from an experiment

» Transition function• Updating beliefs (statistical estimation)• Possibly updating physical state.

» Objective function• Performance metrics• Finding the best policy

© 2019 Warren Powell

Page 37: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

State variables» For derivative-free, we often just have a belief state,

capturing what we know about our function:• Lookup tables• Parametric models• Nonparametric models

» Physical state • What node we are at in a network• How much inventory do we have• We will get to this in the second half of the course

» Information state• Weather forecast, current price• Information state at time t may be independent of the state at t-1.• State at time t depends on state at t-1:

– Information evolves exogenously– Our decisions influence the state (e.g. selling stock)

© 2019 Warren Powell

Page 38: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Decisions» We have a finite set of choices:

• 𝑥 ∈ 𝑋 𝑥 , 𝑥 , … , 𝑥

» Examples:• Experimenting with different drugs to kill cancer in mice.• Running simulations to plan operations (e.g. for Amazon)• Evaluating different players for baseball during spring training.• Testing different materials to maximize the energy density of a

battery.• Test marketing new products in specific regions.

» We only care about the performance at the end of a set of N experiments. We do not care about how well we do along the way.

Page 39: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Types of decisions» Binary

» Finite

» Continuous scalar

» Continuous vector

» Discrete vector

» Categorical

0,1x X

1,2,...,x X M

,x X a b

1( ,..., ), K kx x x x

1( ,..., ), K kx x x x

1( ,..., ), is a category (e.g. red/green/blue)I ix a a a

There are entire fields dedicated to particular classes of decisions.

Page 40: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Exogenous information» Simplest: let be the performance of alternative

.» If we are using a lookup table representation with a

Bayesian belief model, we would assume:

» If we are using a parametric representation, we would write

© 2019 Warren Powell

Page 41: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Transition function:» Now use to update our belief model:

• Frequentist?

• Bayesian?

» Use any of the recursive learning methods from chapter 3.

© 2019 Warren Powell

11

1 11 If 1 1

Otherwise

n n nx xn n n

x x xnx

W x xN N

mm

m

++

ìæ öï ÷ïç ÷- + =ïç ÷ïç ÷ç= + +è øíïïïïî

1

1 If

Otherwise

n n W nnx x x

n n Wx x

nx

W x xb m bm b b

m

+

+

ìï +ï =ïï= +íïïïïî

Page 42: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Objective functions for» State independent and state dependent problems.» Final reward and cumulative reward

Page 43: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Simulating objective functions» It is important to know how to simulate objective

functions. We will get to this in chapter 9.

Page 44: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

Learning policies:Approximate dynamic programmingQ-learningSDDP…

Objective functions

We will focus on these in the second half of the course.

Page 45: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Modeling

“Online” (cumulative reward) dynamic programming is recognized as the “dynamic programming problem,” but the entire literature on solving dynamic programs describes class (4) problems. Class (3) appears to be an open problem class.

Objective functions

Page 46: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

© 2019 Warren Powell Slide 46

Page 47: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

Notes:» With derivative-based search, we had gradients.» With derivative-free, we have to form a belief model

about . Classes of belief models are• Lookup table

– Independent beliefs– Correlated beliefs

• Parametric models– Linear– Nonlinear

» Logistic regression» Step functions (sell if price is over some number)» Neural network

( , )F x W

© 2019 Warren Powell

Page 48: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Approximation strategies

Approximation strategies» Lookup tables

• Independent beliefs • Correlated beliefs

» Linear parametric models• Linear models • Sparse-linear• Tree regression

» Nonlinear parametric models• Logistic regression• Neural networks

» Nonparametric models• Gaussian process regression• Kernel regression• Support vector machines• Deep neural networks

© 2019 Warren Powell

Page 49: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Approximating strategies

Lookup tables

» Independent beliefs

» Correlated beliefs• A few dozen observations can

teach us about thousands of points.

» Hierarchical models• Create beliefs at different levels of

aggregation and then use weighted combinations

1( , ) ,...,nx MF x W x x x

© 2019 Warren Powell

Page 50: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Approximating strategies

Parametric models» Linear models

• Might include sparse-additive, where many parameters are zero.

» Nonlinear models

• (Shallow) Neural networks

| ( )f ff F

F x x

0 1 1

0 1 1

( ) ...

( ) ...|1

x

x

eF xe

charge

charge discharge

charge

1 if ( | ) 0 if

1 if

t

t t

t

pX S p

p

tStx

© 2019 Warren Powell

Page 51: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Approximating strategies

Nonparametric models» Kernel regression

• Weighted average of neighboring points

• Limited to low dimensional problems

» Locally linear methods• Dirichlet process mixtures• Radial basis functions

» Splines» Support vector machines» Deep neural networks

© 2019 Warren Powell

Page 52: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate

correlations in experiments between similar catalysts.

© 2019 Warren Powell

Page 53: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

53

Belief models

We start with a belief about each material (lookup table)

1 2 3 4 4 5

© 2019 Warren Powell Slide 53

Page 54: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

54

Belief models

Testing one material teaches us about other materials

1 2 3 4 4 5

© 2019 Warren Powell Slide 54

Page 55: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

We express our belief using a linear, additive QSAR model»»

0

ij ijsites i substituents j

Y X Indicator variable for molecule .m m

ij ijX X m

© 2019 Warren Powell

Page 56: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

A sampled belief model

1

2

3

4

Res

pons

e

1 2

1 2

( )

0 ( )( | )1

i i

i

x

i i x

eR xe

0 1 2, ,i i i i

Thickness proportional to nkp

Concentration/temperature

© 2019 Warren Powell

Page 57: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief modelsParametric belief models» Value of information (the

KG) behaves in an unexpected way

» Graph to right shows the value of information while learning a quadratic approximation.

» We are using these insights to develop simple rules for where to run experiments that avoid the complexity of KG calculations.

21 2Y x x

Experiment x0 20 40 60 80 100 120

0

0.5

1

1.5

2

2.5

3

3.5

4

© 2019 Warren Powell

Page 58: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

Truckload brokerages:» Now we have a logistic curve for

each origin-destination pair (i,j)

» Number of offers for each (i,j) pair is relatively small.

» Need to generalize the learning across “traffic lanes.”

0

0( , | )1

aij ij ij

aij ij ij

p aY

p a

eP p ae

Shipper Carrier

Offered price

© 2019 Warren Powell

Page 59: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

Hotel revenue management» Locally linear curves

Price $/night100 120 140 160 180 200 220

Rese

rvat

ions

/day

.5

1

.0

1

.5

2

.0

2

.5

3

.0

© 2019 Warren Powell

Page 60: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Belief models

Hotel revenue management» Guessing the right logistics curve

Price $/night100 120 140 160 180 200 220

Rese

rvat

ions

/day

.5

1

.0

1

.5

2

.0

2

.5

3

.0

© 2019 Warren Powell

Page 61: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policies

© 2019 Warren Powell Slide 61

Page 62: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policies

We have to start by describing what we mean by a policy.» Definition:

A policy is a mapping from a state to an action. … any mapping.

How do we search over an arbitrary space of policies?

Page 63: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policies

Two fundamental strategies:

1) Policy search – Search over a class of functions for making decisions to optimize some metric.

2) Lookahead approximations – Approximate the impact of a decision now on the future.

0( , )0

max , ( | ) |f f

T

t t tf Ft

E C S X S S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Page 64: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policiesPolicy search:1a) Policy function approximations (PFAs)

• Lookup tables – “when in this state, take this action”

• Parametric functions– Order-up-to policies: if inventory is less than s, order up to S.– Affine policies -– Neural networks

• Locally/semi/non parametric– Requires optimizing over local regions

1b) Cost function approximations (CFAs)• Optimizing a deterministic model modified to handle uncertainty

(buffer stocks, schedule slack)

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

( | )PFAt tx X S

( | ) ( )PFAt t f f t

f Fx X S S

Page 65: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):

2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)

*1 1

1 1

( ) arg max ( , ) ( ) | ,

( ) arg max ( , ) ( ) | ,

arg max ( , ) ( )

t

t

t

t t x t t t t t t

VFAt t x t t t t t t

x xx t t t t

X S C S x V S S x

X S C S x V S S x

C S x V S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Page 66: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):

2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)

*1 1

1 1

( ) arg max ( , ) ( ) | ,

( ) arg max ( , ) ( ) | ,

arg max ( , ) ( )

t

t

t

t t x t t t t t t

VFAt t x t t t t t t

x xx t t t t

X S C S x V S S x

X S C S x V S S x

C S x V S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Page 67: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):

2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)

*1 1

1 1

( ) arg max ( , ) ( ) | ,

( ) arg max ( , ) ( ) | ,

arg max ( , ) ( )

t

t

t

t t x t t t t t t

VFAt t x t t t t t t

x xx t t t t

X S C S x V S S x

X S C S x V S S x

C S x V S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Page 68: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

The ultimate lookahead policy is optimal*

' ' ' 1' 1

( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Designing policies

Page 69: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Designing policies

The ultimate lookahead policy is optimal

Expectations that we cannot compute

Maximization that we cannot compute

Page 70: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policies

The ultimate lookahead policy is optimal

» 2b) Instead, we have to solve an approximation called the lookahead model:

» A lookahead policy works by approximating the lookahead model.

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

*' ' ' , 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

t H

t t x t t tt tt tt t t t tt t

X S C S x C S X S S S x

Page 71: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Designing policiesTypes of lookahead approximations » One-step lookahead – Widely used in pure learning

policies:• Bayes greedy/naïve Bayes• Expected improvement• Value of information (knowledge gradient)

» Multi-step lookahead• Deterministic lookahead, also known as model predictive

control, rolling horizon procedure• Stochastic lookahead:

– Two-stage (widely used in stochastic linear programming)– Multistage

» Monte carlo tree search (MCTS) for discrete action spaces

» Multistage scenario trees (stochastic linear programming) – typically not tractable.

Page 72: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions

2) Cost function approximation (CFAs)»

3) Policies based on value function approximations (VFAs)»

4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control

» Chance constrained programming

» Stochastic lookahead /stochastic prog/Monte Carlo tree search

» “Robust optimization”

Four (meta)classes of policies

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

( ) arg max ( , ) ( , )t

VFA x xt t x t t t t t tX S C S x V S S x

' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t

TLA St t tt tt tt tt

t tX S C S x p C S x

xtt , xt ,t1,..., xt ,tT

,' '( ),..., ' 1

( ) arg max min ( , ) ( ( ), ( ))ttt t t H

TLA ROt t tt tt tt ttw Wx x t t

X S C S x C S w x w

[ ( )] 1t tP A x f W

Polic

y se

arch

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H

TLA Dt t tt tt tt ttx x t t

X S C S x C S x

Look

ahea

dap

prox

imat

ions

Page 73: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions

2) Cost function approximation (CFAs)»

3) Policies based on value function approximations (VFAs)»

4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control

» Chance constrained programming

» Stochastic lookahead /stochastic prog/Monte Carlo tree search

» “Robust optimization”

Four (meta)classes of policies

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H

TLA Dt t tt tt tt ttx x t t

X S C S x C S x

( ) arg max ( , ) ( , )t

VFA x xt t x t t t t t tX S C S x V S S x

' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t

TLA St t tt tt tt tt

t tX S C S x p C S x

xtt , xt ,t1,..., xt ,tT

,' '( ),..., ' 1

( ) arg max min ( , ) ( ( ), ( ))ttt t t H

TLA ROt t tt tt tt ttw Wx x t t

X S C S x C S w x w

[ ( )] 1t tP A x f W

Func

tion

appr

ox.

Page 74: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions

2) Cost function approximation (CFAs)»

3) Policies based on value function approximations (VFAs)»

4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control

» Chance constrained programming

» Stochastic lookahead /stochastic prog/Monte Carlo tree search

» “Robust optimization”

Four (meta)classes of policies

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H

TLA Dt t tt tt tt ttx x t t

X S C S x C S x

( ) arg max ( , ) ( , )t

VFA x xt t x t t t t t tX S C S x V S S x

' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t

TLA St t tt tt tt tt

t tX S C S x p C S x

xtt , xt ,t1,..., xt ,tT

,' '( ),..., ' 1

( ) arg max min ( , ) ( ( ), ( ))ttt t t H

TLA ROt t tt tt tt ttw Wx x t t

X S C S x C S w x w

[ ( )] 1t tP A x f W

Imbe

dded

opt

imiz

atio

n

Page 75: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Optimal policies

Online vs. offline learning

Slide 75

Page 76: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Optimal policies

Earlier we expressed an optimal policy using Bellman’s equation:» Recall that we used a graph:

» Bellman equation (state = node)

Page 77: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Optimal policiesBellman’s equation for a learning problem:» A belief state

» Bellman equation (state = node)

» Can only solve this in very special cases (and not with normally distributed beliefs). 5 alternatives -> 10-dimensional continuous state.

Page 78: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Optimal policies

Online (cumulative reward) vs. offline (final reward)

» Bellman’s equation for online (cumulative reward) learning:

» Bellman’s equation for offline (final reward) learning:

» If we only care about the final reward, we do not care h h k l h

Page 79: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

© 2019 Warren Powell Slide 79

Page 80: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Examples:

» Optimizing buying-selling energy from the grid

» Finding the best selling price for my book on Amazon.

© 2019 Warren Powell

Page 81: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximations

Battery arbitrage – When to charge, when to discharge, given volatile LMPs

Page 82: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71

Grid operators require that batteries bid charge and discharge prices, an hour in advance.

We have to search for the best values for the policy parameters

DischargeCharge

Charge Dischargeand .

Policy function approximations

Page 83: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximations

Our policy function might be the parametric model (this is nonlinear in the parameters):

charge

charge discharge

charge

1 if ( | ) 0 if

1 if

t

t t

t

pX S p

p

Energy in storage:

Price of electricity:

Page 84: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximations

Finding the best policy» We need to maximize

» We cannot compute the expectation, so we run simulations:

DischargeCharge

0

max ( ) , ( | )T

tt t t

tF C S X S

Page 85: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Optimal pricing

Page 86: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Dynamic pricing» After n sales, our estimate of the

demand function is

» Revenue is

» The optimal price given our estimates would be:

Page 87: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Revenue managementEarning vs. learning

» You earn the most with prices near the middle.

» You learn the most with extreme prices.

» Challenge is to strike a balance.

Page 88: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Designing a policy:» Just doing what appears to be best will encourage prices

too close to the middle. » We can try an excitation policy by adding noise:

where , where is a tunable parameter. » After charging price , we observe revenue:

,

where , , is the noise when observing demand.

© 2019 Warren Powell

Page 89: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Designing a policy:is a tunable parameter that needs to solve:

, ,…, |

» The best value of should strike a balance between doing a better job of exploration, without overdoing it.

© 2019 Warren Powell

Page 90: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Policy function approximation

Other examples of PFAs:» A basic linear decision rule

» Where the features for have to be designed by hand.

© 2019 Warren Powell

Page 91: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximation

© 2019 Warren Powell Slide 91

Materials science example

Page 92: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

92 92

Materials science application

Heuristic measurement policies» Boltzmann exploration

• Explore choice x with probability

•» Upper confidence bounding

» Thompson sampling

» Interval estimation (or upper confidence bounding)• Choose x which maximizes

'

'

( )

nx

nx

nx

x

ePe

( | ) arg max IE n IE n IE nx x xX S

0

nxz

log( | ) arg max

UCB n UCB n UCBx x n

x

nX SN

1ˆ ˆ( ) arg max where ( , ) TS n n n n nx x x x xX S N

[0,1]U

Page 93: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science application

Notes:» Each one of these policies has an imbedded “max” (or

“min”) operator.» But each one still has a tunable parameter – this is the

distinguishing feature of “policy search” policies.» For these policies, the max involves a simple search

over a set of discrete choices, so this is not too difficult. » Another example is solving a shortest path into New

York. The tunable parameter might be how much time you add to the trip to deal with traffic.

» The max/min could also be a large optimization problem, such as what is solved to plan energy generation for tomorrow.

© 2019 Warren Powell

Page 94: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science application

Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate

correlations in experiments between similar catalysts.

Page 95: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

95

Materials science application

Correlated beliefs: Testing one material teaches us about other materials

1 2 3 4 4 5

Page 96: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

96 96

Materials science application

Heuristic measurement policies» Boltzmann exploration

• Explore choice x with probability

•» Upper confidence bounding

» Thompson sampling

» Interval estimation• Choose x which maximizes

'

'

( )

nx

nx

nx

x

ePe

( | ) arg max IE n IE n IE nx x xX S

0

nxz

log( | ) arg max

UCB n UCB n UCBx x n

x

nX SN

1ˆ ˆ( ) arg max where ( , ) TS n n n n nx x x x xX S N

[0,1]U

Page 97: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science application

Notes:» Upper confidence bounding and Thompson sampling

are very popular in the computer science community.» These have been shown to have provable regret bounds,

which the CS community then uses to claim that they must be very good.

» My own experimental work has not supported this claim, but we have found that a properly tuned version of interval estimation tends to work surprisingly well.

» … The problem is the tuning.

© 2019 Warren Powell

Page 98: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

98

Materials science application

Picking means we are evaluating each choice at the mean.

1 2 3 4 4 5

Page 99: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

99

Materials science application

Picking means we are evaluating each choice at the 95th percentile.

1 2 3 4 4 5

Page 100: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science applicationOptimizing the policy» We optimize 𝜃 to maximize:

where

Notes:» This can handle any belief model,

including correlated beliefs, nonlinear belief models.

» All we require is that we be able to simulate a policy.

( | ) arg max ( , )n IE n IE n IE n n n nx x x x xx X S S

,max ( ) ,IEIE NF F x W

Reg

ret

Page 101: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science application

Other applications

» Airlines optimizing schedules with schedule slack to handle weather uncertainty.

» Manufacturers using buffer stocks to hedge against production delays and quality problems.

» Grid operators scheduling extra generation capacity in case of outages.

» Adding time to a trip planned by Google maps to account for uncertain congestion.

Page 102: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Materials science application

Notes:» CFA policies are exceptionally popular in internet

applications:» Choosing ads, news articles, products, … that

maximize ad-clicks.» The ease of computation of the policies is very

attractive.» Representative from google once noted: “We can use

any policy that can be computed in under 50 milliseconds.”

© 2019 Warren Powell

Page 103: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximation

© 2019 Warren Powell Slide 103

Fleet management example

Page 104: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Schneider National

© 2018 Warren B. Powell

Page 105: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Slide 105© 2018 Warren B. Powell

Page 106: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Fleet management application

DemandsDrivers

© 2018 Warren B. Powell

Page 107: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Fleet management application

t t+1 t+2

The assignment of drivers to loads evolves over time, with new loads being called in, along with updates to the status of a driver.

© 2018 Warren B. Powell

Page 108: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Fleet management application

A purely myopic policy would solve this problem using

where

What if a load it not assigned to any driver, and has been delayed for a while? This model ignores the fact that we eventually have to assign someone to the load.

1 If we assign driver to load 0 Otherwise

Cost of assigning driver to load at time

tdl

tdl

d lx

c d l t

min x tdl tdld l

c x

© 2018 Warren B. Powell

Page 109: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Fleet management application

We can minimize delayed loads by solving a modified objective function:

where

We refer to our modified objective function as a cost function approximation.

How long load has been delayed by time Bonus for moving a delayed load

tl l t

min x tdl tl tdld l

c x

© 2018 Warren B. Powell

Page 110: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Fleet management application

We now have to tune our policy, which we define as:

We can now optimize , another form of policy search, by solving

( | ) arg mint x tdl tl tdld l

X S c x

0min ( ) ( , ( | ))

T

t t tt

F C S X S

( , | )t tC S x

© 2018 Warren B. Powell

Page 111: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximation

© 2019 Warren Powell Slide 111

Logistics example

Page 112: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Logistics application

Inventory management

» How much product should I order to anticipate future demands?

» Need to accommodate different sources of uncertainty.

• Market behavior• Transit times• Supplier uncertainty• Product quality

© 2018 Warren B. Powell

Page 113: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Logistics application

Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve

» This assumes our demand forecast is accurate.

tD

tpx

tD

( ) arg maxtt t x p tp

p PX S c x

subject to

0

tp tp P

tp p

tp

x D

x u

x

t

© 2018 Warren B. Powell

Page 114: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve

» This is a “parametric cost function approximation”

( )

Reserve

buffer

( | ) arg max

subject to

tt t p tpx

p P

tp tp P

tp p

tp

X S c x

x D

x u

x

( )t

Logistics application

tD

tpx

Reserve

Buffer stock

© 2018 Warren B. Powell

Page 115: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

An even more general CFA model:» Define our policy:

subject to

» We tune by optimizing:

Logistics application

( ) arg max ( , | )t x t tX C S x

0

min ( ) ( , ( ))T

t tt

F C S X

( )Ax b

Parametricallymodified costs

Parametricallymodified constraints

© 2018 Warren B. Powell

Page 116: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximation

© 2019 Warren Powell Slide 116

Energy storage example

Page 117: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

Notes:» In this set of slides, we are going to illustrate the use of

a parametrically modified lookahead policy.» This is designed to handle a nonstationary problem with

rolling forecasts. This is just what you do when you plan your path with google maps.

» The lookahead policy is modified with a set of parameters that factor the forecasts. These parameters have to be tuned using the basic methods of policy search.

© 2018 W.B. Powell

Page 118: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

An energy storage problem:

Page 119: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximationForecasts evolve over time as new information arrives:

Actual

Rolling forecasts, updated each hour.

Forecast made at midnight:

Page 120: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

Benchmark policy – Deterministic lookahead

' ' ' '

' ' '

' ' '

max' ' '

' ' 'arg

' 'arg

' '

wd rd gd Dtt tt tt tt

gd gr Gtt tt tt

rd rgtt tt tt

wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt

x x x f

x x f

x x R

x x R R

x x f

x x

x x

b

g

g

+ + £

+ £

+ £

+ £ -

+ £

+ £

+ £

Page 121: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Lookahead policies

Lookahead policies peek into the future» Optimize over deterministic lookahead model

The real process

. . . .

The

look

ahea

dm

odel

t 1t 2t 3t

© 2018 W.B. Powell

Page 122: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Lookahead policies

Lookahead policies peek into the future» Optimize over deterministic lookahead model

The real process

. . . .

The

look

ahea

dm

odel

t 1t 2t 3t

© 2018 W.B. Powell

Page 123: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Lookahead policies

Lookahead policies peek into the future» Optimize over deterministic lookahead model

The real process

. . . .

The

look

ahea

dm

odel

t 1t 2t 3t

© 2018 W.B. Powell

Page 124: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Lookahead policies

Lookahead policies peek into the future» Optimize over deterministic lookahead model

The real process

. . . .

The

look

ahea

dm

odel

t 1t 2t 3t

© 2018 W.B. Powell

Page 125: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

Benchmark policy – Deterministic lookahead

' ' ' '

' ' '

' ' '

max' ' '

' ' 'arg

' 'arg

' '

wd rd gd Dtt tt tt tt

gd gr Gtt tt tt

rd rgtt tt tt

wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt

x x x f

x x f

x x R

x x R R

x x f

x x

x x

b

g

g

+ + £

+ £

+ £

+ £ -

+ £

+ £

+ £

Page 126: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

Parametric cost function approximations» Replace the constraint

with:» Lookup table modified forecasts (one adjustment term for

each time in the future):

» We can simulate the performance of a parameterized policy

» The challenge is to optimize the parameters:

't t

' ' ' 'wr wd Ett tt t t ttx x f

'wrttx '

wdttx

0

( , ) ( ), ( ( ) | )T

t t tt

F C S X S

0

min , ( | )T

t t tt

C S X S

Page 127: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

One-dimensional contour plots – perfect forecastfor i=1,…, 8 hours into the future.

q0.0 0.2 0.4 0.6 0.8 1.0 1.2

* 1 for perfect forecasts.iq =

Page 128: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

One-dimensional contour plots-uncertain forecastfor i=9,…, 15 hours into the future

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

q

Page 129: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

1q

10q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Page 130: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

1q

2q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Page 131: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

2q

3q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Page 132: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Parametric cost function approximation

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

3q

5q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Page 133: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Numerical derivatives

Simultaneous perturbation stochastic approximation» Let:

• 𝑥 be a p dimensional vector.• 𝛿 be a scalar perturbation• 𝑍 be a p dimensional vector, with each element drawn from a normal (0,1)

distribution.

» We can obtain a sampled estimate of the gradient 𝛻 𝐹 𝑥 , 𝑊using two function evaluations: 𝐹 𝑥 𝛿 𝑍 and 𝐹 𝑥 𝛿 𝑍

1

12

( ) ( )2

( ) ( )2( , )

( ) ( )2

n n n n n n

n n

n n n n n n

n nn nx

n n n n n n

n np

F x Z F x ZZ

F x Z F x ZZF x W

F x Z F x ZZ

d dd

d dd

d dd

+

é ù+ - +ê úê úê úê ú+ - +ê úê ú = ê úê úê úê úê ú+ - +ê úê úë û

Page 134: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Numerical derivatives

Finite differences» We use finite differences in MOLTE-DB:

• We wish to optimize the decision of when to charge or discharge a battery

charge

charge discharge

charge

1 if ( | ) 0 if

1 if

t

t t

t

pX S p

p

© 2019 Warren Powell

Page 135: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Numerical derivatives

Finding the best policy» We need to maximize

» We cannot compute the expectation, so we run simulations:

DischargeCharge

0

max ( ) , ( | )T

tt t t

tF C S X S

© 2019 Warren Powell Slide 135

Page 136: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximation

© 2019 Warren Powell Slide 136

Stochastic shortest path

Page 137: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problemA stochastic network, costs revealed as we arrive to a node:

1

2

3

4

5

6

7

8

9

10

11

3.6

8.1

13.5

8.9

12.712.6

8.4

9.2

15.9

© 2018 W.B. Powell

Page 138: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Modeling:» State variable (iteration n, time t)

• 𝑅 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒 𝑖 during pass 𝑛• 𝐼 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐̅ 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 𝑜𝑓 𝑐𝑜𝑠𝑡𝑠 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡

during pass 𝑛• 𝑆 𝑅 , 𝐼 )

» Decisions

, , =1 if we traverse at time • We want a policy 𝑋 𝑆 𝑥 , , for all links 𝑖, 𝑗.

» Costs• 𝑐 Actual realization of costs at time 𝑡 to traverse 𝑖, 𝑗

© 2018 W.B. Powell

Page 139: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problemModeling:» Objective function

• Cost per periodC S , X 𝑆 𝑋 𝑆 𝑐

𝑥 , ,,

𝑐

Costs incurred at time t.• Total costs:

min𝔼 ∑ C S , X 𝑆

» This is the base model.

© 2018 W.B. Powell

Page 140: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problemA policy based on a lookahead model» At each time 𝑡 we are going to optimize over an estimate of the

network that we are going to call the lookahead model.» Notation: all variables in the lookahead model have tilde’s, and

two time indices.• First time index, 𝑡, is the time at which we are making a decision.

This determines the information content of all parameters (e.g. costs) and decisions.

• A second time index, 𝑡′, is the time within the lookahead model.» Decisions

• 𝑥 1 if we plan on traversing link 𝑖, 𝑗 in the lookahead model.• 𝑐 Estimated cost at time 𝑡 of traversing link 𝑖, 𝑗 in the

lookahead model.

© 2018 W.B. Powell

Page 141: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

© 2018 W.B. Powell

Page 142: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

© 2018 W.B. Powell

Page 143: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problemA static, deterministic network

1

2

3

4

5

6

7

8

9

10

11

12.6

8.4

9.2 3.6

8.1

17.4

15.9

16.5 20.2

13.5

8.912.7

15.9

2.34.5

7.3

9.65.7

© 2018 W.B. Powell

Page 144: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

t

t + 1

t + 2

t + 3

© 2018 W.B. Powell

A time-dependent, deterministic network

Page 145: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

The base model

The

look

ahea

dm

odel

t 1t 2t 3t

. . . .

Stochastic shortest path problem

© 2018 W.B. Powell

A time-dependent, deterministic lookahead network

't t

' 1t t

' 2t t

' 3t t

' 4t t

Page 146: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

The base model

t 1t 2t 3t

. . . .

Stochastic shortest path problem

© 2018 W.B. Powell

A time-dependent, deterministic lookahead network

The

look

ahea

dm

odel

't t

' 1t t

' 2t t

' 3t t

' 4t t

Page 147: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

The base model

t 1t 2t 3t

. . . .

Stochastic shortest path problem

© 2018 W.B. Powell

A time-dependent, deterministic lookahead network

The

look

ahea

dm

odel

't t

' 1t t

' 2t t

' 3t t

' 4t t

Page 148: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Imagine that the lookahead is just a black box:» Solve the optimization problem

» subject to

» This is a deterministic shortest path problem that we could solve using Bellman’s equation, but for now we will just view it as a black box optimization problem.

( ) arg min i

n nt t tij tij

i N j N

X S c xp

+Î Î

= åå

,

,

, ,

1 Flow out of current node where we are located

1 Flow into destination node

0 for all other nodes.

nti j

j

i ri

i j j ki k

x

x r

x x

=

=

- =

å

å

å å

© 2018 W.B. Powell

Page 149: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Simulating a lookahead policy

, ,0 ,

, ', 1, ', 2, ',

We would like to compute

ˆ ( )

but this is intractable.Let be a sample realization of costs

ˆ ˆ ˆ ( ), ( ), ( ),...

Now simulate the policy

T

t ij t t ijt i j

t t ij t t ij t t ij

F X S c

c c c

p p

ww w w

=

+ +

= åå

, ,0 ,

1

ˆ ˆ( ) ( ( )) ( )

Finally, get the average performance1 ˆ ( )

Tn n n

t ij t t ijt i j

Nn

n

F X S c

F FN

p p

p p

w w w

w

=

=

=

=

åå

å

Talk through how this works.

© 2018 W.B. Powell

Page 150: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Notes:» The deterministic lookahead is still a policy for a

stochastic problem.» Can we make it better?

Idea:» Instead of using the expected cost, what about using a

percentile.» Use pdf of to find percentile (e.g. ). Let

The percentile of

» Which means

© 2018 W.B. Powell

Page 151: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

The percentile policy.» Solve the linear program (shortest path problem):

» subject to

» This is a deterministic shortest path problem that we could solve using Bellman’s equation, but for now we will just view it as a black box optimization problem.

( | ) arg min ( ) (Vector with 1 if decision is to take ( , ))i

n pt t tij tij tij

i N j N

X S c x x i jp q q+Î Î

= =åå

, ,1 Flow out of current node where we are located

1 Flow into destination node

0 for all other nodes.

ntt i j

j

tiri

tij tjki k

x

x r

x x

=

=

- =

å

å

å å

© 2018 W.B. Powell

Page 152: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Simulating a lookahead policy

, ', 1, ', 2, ',

, ',0 ,

Let be a sample realization of costs ˆ ˆ ˆ( ), ( ), ( ),...

Now simulate the policy

ˆ ˆ( ) ( ) ( ( ) | )

Finally, get the average performance1 ˆ( ) (

t t ij t t ij t t ij

Tn n

t t ij t tt i j

n

c c c

F c X S

F FN

p p

p p

ww w w

w w w q

q w

+ +

=

=

=

åå

1

)N

n=å

© 2018 W.B. Powell

Page 153: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Stochastic shortest path problem

Policy tuning» Cost vs. lateness (risk)

© 2018 W.B. Powell

Page 154: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Notes on cost function approximations

© 2019 Warren Powell Slide 154

Page 155: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximations

Notes:» CFAs are the dirty secret of real-world stochastic

optimization. They are easy to understand, easier to solve, and make it possible to incorporate problem structure.

» The first challenge is to identify an effective parameterization. This requires having insights into the problem.

» The second challenge is to optimize over the parameters.

» There are close parallels between policy search and machine learning.

© 2019 Warren Powell

Page 156: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximations

Tuning the parameters» Let

Tunable policy» The simulated performance of the policy is given by

• Final reward:– Let 𝑥 , 𝜃 be the solution produced by following

policy 𝜋 parameterized by 𝜃. The performance of the policy is given by

,

• Cumulative reward

𝐹 𝜃, 𝑊 𝐶 𝑆 , 𝑋 𝑆 𝜃

© 2019 Warren Powell

Page 157: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Cost function approximations

Tuning the parameters» We can write any parameter tuning problem as

» This is a basic stochastic learning problem. We can approach it using:

• Derivative-based stochastic search – Will probably need to use numerical derivatives.

• Derivative-free stochastic search – We might represent the set of possible values of 𝜃 ∈ 𝜃 , … , 𝜃 and then search over this finite set.

© 2019 Warren Powell

Page 158: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning and tuning policies

Slide 158

Page 159: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesSimulating a policy» Regardless of our belief model (frequentist or Bayesian), a learning

process looks like

» Our observations might come from𝑊 𝜇 𝜀

or perhaps𝐹 𝐹 𝑥 , 𝑊

There are different ways to interpret our “W” variable.» Imagine that we have three policies 𝜋 , 𝜋 , and 𝜋 . Let Π 𝜋 ,

𝜋 , 𝜋 ). » In offline learning, we only care about our final answer which we

call:

0 1 10 0 1 1 1 2 1 1( , , , , , ,..., , , , )N

N N N Nx x x

S x W S x W S x W S-- -

, maxN Nx xxp m=

Page 160: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesLearning the best policy» Imagine that we have three policies 𝜋 , 𝜋 , and 𝜋 . Let Π 𝜋 ,

𝜋 , 𝜋 ). » In offline (final reward) learning, we will optimize the final reward

max 𝔼 𝔼 ,…, | 𝜇

ormax 𝔼 𝔼 ,…, | 𝔼 | 𝐹 𝑥 , , 𝑊

» In online (cumulative reward) learning, we optimize the cumulative rewards:

max 𝔼 𝔼 ,…, | 𝔼 | 𝐹 𝑋 𝑆 𝜃 , 𝑊

» x

Page 161: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesFinding a policy» Imagine that we have three classes of policies 𝜋 , 𝜋 , and 𝜋 . Let

Π 𝜋 , 𝜋 , 𝜋 ). » For each class of policy, we might have a set of parameters 𝜃 ∈ Θ

(for example) that have to be tuned.» In offline learning, we only care about our final answer which we

call:𝑥 , max 𝜇

Note that the policy 𝜋 is implicit in the estimate 𝜇 . We make it explicit when we write the final design 𝑥 , .» Now we wish to solve the optimization problem that finds the best

policy (best class, and best within the class).

Page 162: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesNotes» Buried in the estimate of 𝜇 is:

• The truth 𝜇

• The set of observations:

𝑊 , 𝑊 , …, 𝑊

» …where the choices 𝑥 , 𝑥 , … , 𝑥 are determined by our learning policy 𝑥 𝑋 𝑆 . This is the reason we label

with the policy 𝜋.

, maxN Nx xxp m=

Page 163: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesEvaluating a policy» The final design 𝑥 , is a random variable since it depends on:

• The true 𝜇 (in our Bayesian model, we treat this as a random variable)

• The observations 𝑊 , 𝑊 , … , 𝑊» We can evaluate a policy 𝑋 𝑆 using

» Finally, we write our optimization problem as

» It is useful to express what the expectations are over, so we write

, , (these are equivalent - more on this later).N NN

x xF p p

p m m= =

,max NNxpp m

1 2 ,, ,..., |max N N

NW W W xpp m m

m

Page 164: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Learning policiesSimulating a policy» We cannot compute expectations, so we have to learn how to

simulate:

» Let 𝜇 𝜓 be a sampled truth, and let 𝑊 𝜔 , … , 𝑊 𝜔 be a sample path of function observations.

» Assume we have truths 𝜓ℓ, ℓ 1, … , 𝐿, and sample paths 𝜔 , … , 𝜔 .

» Let 𝜇 , 𝜓ℓ, 𝜔 be the estimate of 𝜇 when the truth is 𝜓ℓ and the sample path of realizations is 𝜔 , while following experimental policy 𝜋.

» Now evaluate the policy using

𝐹 max1𝐿

1𝐾 𝜇 , 𝜓ℓ, 𝜔

1 2 ,, ,..., |max N N

NW W W xpp m m

m

Page 165: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

“Contextual” learning

Slide 165

Page 166: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Contextual learningWhat if we are given information before making a decision?» We see the attributes of a patient before prescribing treatment.» We are given a weather report before having to decide how many

newspapers to put in our kiosk.» Let’s call this an “environmental state” variable (this is not

standard, but there is not a standard name for this).

Example: » Newsvendor problem with a dynamic state variable (the price):

» Further assume that 𝑝 is independent of any previous prices, and that it is revealed before we make our decision 𝑥.

The price 𝑝 is known as a “context.” This would be called a “contextual learning problem” or a “contextual bandit problem.”

© 2019 Warren Powell

0max ( , , ) min( , )

x tC S x W p x W cx

Page 167: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60

Contextual learning

Why is this special?» Without the dynamic information, we would be

searching for the best » With the contextual information, we are now looking

for where is the revealed price. This means that instead of looking for a deterministic variable (or vector), we are now looking for a function.

» When we are looking for a decision as a function of a state variable (even part of a state variable), then this is what we call a policy.

» The learning literature makes a big deal about “contextual bandit problems.” For us, there is nothing different about this than any state-dependent problem.

© 2019 Warren Powell