cosc 878 seminar on large scale statistical machine learning 1
TRANSCRIPT
1
COSC 878
Seminar on Large Scale Statistical Machine Learning
2
Today’s Plan
• Course Websitehttp://people.cs.georgetown.edu/~huiyang/cosc-878/ • Join the Google group:
https://groups.google.com/forum/#!forum/cosc878
• Students Introduction• Team-up and Presentation Scheduling• First Talk
Reinforcement Learning: A Survey
Grace1/13/15
What is Reinforcement Learning
• The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment
Solve RL Problems – Two strategies
• Search in the behavior space– To find one behavior that perform well in the
environment– Genetic algorithms, genetic programming
• Statistical methods and dynamic programming– Estimate the utility of taking actions in states of
the world– We focus on this strategy
Standard RL model
What we learn in RL
• The agent’s job is to find a policy \pi that maximizes some long-run measure of reinforcement. – A policy \pi maps states to actions– Reinforcement = reward
Difference between RL and Supervised Learning
• In RL, no presentation of input/output pairs– No training data– We only know the immediate reward– Not know the best actions in long run
• In RL, need to evaluate the system online while learning– Online evaluation (know the online performance)
is important
Difference between RL and AI/Planning
• AI algorithms are less general– AI algorithms require a predefined model of state
transitions– And assume determinism
• RL assumes that the state space can be enumerated and stored in memory
Models
• The difficult part:– How to model future into the model
• Three models– Finite horizon– Infinite horizon– Average-reward
Finite Horizon
• At a given moment in time, the agent optimizes its expected reward for the next h steps
• Ignore what will happen after h steps
Infinite Horizon
• Maximize the long run reward• Does not put limit on the number of future steps• Future rewards are discounted geometrically• Mathematically more tractable than finite horizon
Discount factor (between 0 and 1)
Average-reward • Maximize the long run average reward
• It is the limiting case of infinite horizon when \gamma approaches 1• Weakness:
– Cannot know when get large rewards– When we prefer large initial reward, we have no way to know it in this model
• Cures:– Maximize both the long run average and the initial rewards– The Bias optimal model
Compare model optimality
• all unlabeled arrows produce a reward of 0
• A single action
Compare model optimality
Finite horizonh=4• Upper line:• 0+0+2+2+2=6• Middle:• 0+0+0+0+0=0• Lower:• 0+0+0+0+0=0
Compare model optimality
Infinite horizon\gamma=0.9• Upper line:• 0*0.9^0 +
0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4… = 2*0.9^2*(1+0.9+0.9^2..)= 1.62*(1)/(1-0.9)=16.2
• Middle:• … 10*0.9^5+…~=59• Lower:• … + 11*0.9^6+… = 58.5
Compare model optimality
Average reward• Upper line:• ~= 2• Middle:• ~=10• Lower:• ~= 11
Parameters
• Finite horizon and infinite horizon both have parameters– h– \gamma
• These parameters matter to the choice of optimality model– Choose them carefully in your application
• Average reward model’s advantage: not influenced by those parameters
19
MARKOV MODELS
Markov Process• Markov Property1 (the “memoryless” property) for a system, its next state depends on its current state.
Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)
• Markov Process a stochastic process with Markov property.
e.g.
20 1A. A. Markov, ‘06
s0 s1 …… si ……si+1
21
• Markov Chain• Hidden Markov Model• Markov Decision Process• Partially Observable Markov Decision Process• Multi-armed Bandit
Family of Markov Models
APagerank(A)
• Discrete-time Markov process• Example: Google PageRank1
Markov Chain
BPagerank(B)
𝑃 𝑎𝑔𝑒𝑟𝑎𝑛𝑘 (𝑆 )=1−𝛼𝑁
+𝛼 ∑𝑌 ∈Π
𝑃 𝑎𝑔𝑒𝑟𝑎𝑛𝑘 (𝑌 )𝐿(𝑌 )
# of pages # of outlinkspages linked to S
22
DPagerank(D)
CPagerank(C)
EPagerank(E)
Random jump factor
1L. Page et. al., ‘99
The stable state distribution of such an MC is PageRank
State S – web pageTransition probability M PageRank: how likely a random
web surfer will land on a page
(S, M)
Hidden Markov Model• A Markov chain that states are hidden and
observable symbols are emitted with some probability according to its states1.
23
s0 s1 s2 ……
o0 o1 o2
p0
0
p1 p2
1 2
Si– hidden state pi -- transition probability oi --observationei --observation probability (emission probability) 1Leonard E. Baum et. al., ‘66
(S, M, O, e)
• MDP extends MC with actions and rewards1
si– state ai – action ri – reward pi – transition probability
p0 p1 p2
Markov Decision Process
24
……s0 s1
r0
a0
s2
r1
a1
s3
r2
a2
1R. Bellman, ‘57
(S, M, A, R, γ)
Definition of MDP• A tuple (S, M, A, R, γ)
– S : state space– M: transition matrix
Ma(s, s') = P(s'|s, a)
– A: action space– R: reward function
R(s,a) = immediate reward taking action a at state s– γ: discount factor, 0< γ ≤1
• policy π
π(s) = the action taken at state s• Goal is to find an optimal policy π* maximizing the expected
total rewards.
25
Policy
Policy: (s) = aAccording to which, select an action a at state s.
(s0) =move right and ups0
(s1) =move right and ups1
(s2) = move rights2
26 [Slide altered from Carlos Guestrin’s ML lecture]
Value of Policy
Value: V(s) Expected long-term reward starting from s
Start from s0
s0
R(s0)(s0)
V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]
Future rewards discounted by [0,1)
27 [Slide altered from Carlos Guestrin’s ML lecture]
Value of Policy
Value: V(s) Expected long-term reward starting from s
Start from s0
s0
R(s0)(s0)
V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]
Future rewards discounted by [0,1)
s1
R(s1) s1’’
s1’
R(s1’)
R(s1’’)28 [Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Value: V(s) Expected long-term reward starting from s
Start from s0
s0
R(s0)(s0)
V(s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ]
Future rewards discounted by [0,1)
s1
R(s1) s1’’
s1’
R(s1’)
R(s1’’)
(s1)
R(s2)
s2
(s1’)
(s1’’)
s2’’
s2’
R(s2’)
R(s2’’)29 [Slide altered from Carlos Guestrin’s ML
lecture]
30
Computing the value of a policy
V(s0) =
=
=
=
Value function
A possible next stateThe current state
Optimality — Bellman Equation
The Bellman equation1 to MDP is a recursive definition of the optimal value function V*(.)
𝑉 ∗ ( s)=max𝑎 [𝑅 (𝑠 ,𝑎 )+𝛾∑
𝑠 ′
𝑀𝑎(𝑠 ,𝑠 ′ )𝑉 ∗(𝑠 ′ )]
31
Optimal Policyπ∗ ( s )=arg𝑚𝑎𝑥
𝑎 [𝑅 (𝑠 ,𝑎 )+𝛾∑𝑠 ′
𝑀𝑎 (𝑠 ,𝑠 ′ )𝑉 ∗(𝑠 ′)]
1R. Bellman, ‘57
state-value function
Optimality — Bellman Equation
The Bellman equation can be rewritten as
32
Optimal Policy
π∗ ( s )=arg𝑚𝑎𝑥𝑎
𝑄 (𝑠 ,𝑎 )
action-value function
Relationship between V and
Q
33
MDP algorithms
• Value Iteration• Policy Iteration• Modified Policy Iteration• Prioritized Sweeping• Temporal Difference (TD) Learning• Q-Learning
Model free approaches
Model-based approaches
[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98, Richard Sutton, ‘88, Watkins, ‘92]
Solve Bellman equation
Optimal value V*(s)
Optimal policy *(s)
[Slide altered from Carlos Guestrin’s ML lecture]