learning to make decisions optimally for self-driving...
TRANSCRIPT
Learning to Make Decisions Optimally for Self-Driving Networks
Song Chong Graduate School of AI & School of EE
KAIST
August 3, 2020
AI Decision-Making Meets Network Autonomy
Knowledge-Defined Networking [SIGCOMM’17]
2
Experience, simulation, generative model etc.
simulationgenerative model
experience
kowledge
intelligence
• Online sequential decision making problems in dynamical systems• Large-scale systems => curse of complexity• System models are unknown and stochastic => curse of uncertainty
• It is mostly about resource management• Congestion control, wireless scheduling, bitrate adaptation in video streaming • Network function placement, resource management in datacenter
• Solved today mostly using meticulously designed heuristics• Painstakingly test and tune the heuristics for good performance• Repeated if workload, environment, and metric of interest change
3
Network Control
Self-Driving Networks: Can networks learn to operate by their own decisions, with very little human
intervention, i.e., directly from experience interacting with environment?
Reinforcement Learning:Complex Decision Making for Dynamical Systems under Uncertainty
Goal: Learn a policy 𝜋𝜋 to generate 𝑎𝑎0, 𝑎𝑎1,⋯ maximizing expected return
4
𝔼𝔼𝜋𝜋 �𝑡𝑡=0
∞
𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡 𝛾𝛾 ∈ [0,1)
5
Previous Works on RL-based Network Control
6
• Resource management in datacenter [HotNets’16]
• Adaptive bitrate video streaming [SIGCOMM’17-1] [ICML WKSHPS’19]
• Scheduling for data processing clusters [SIGCOMM’19]
• Network resource scheduling [TNET’19]
• DVFS of CPU/GPU in mobile systems [SenSys’20]
• Resource management in edge computing [NetworkMag’19]
• Network function embedding [INFOCOM WKSHPS’19]
• Cognitive network management [CommMag’18]
Learning & Approximating Value of Action
• Mathematical model for sequential decision-making problems• Transition probability 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 = Pr(𝑠𝑠𝑠|𝑠𝑠, 𝑎𝑎)• Policy function 𝑎𝑎 = 𝜋𝜋(𝑠𝑠)• Reward function 𝑟𝑟 = 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 with discount factor 𝛾𝛾 ∈ [0,1)
8
Markov Decision Process
𝑠𝑠0 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 ⋯𝑎𝑎0 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3
Return 𝑟𝑟0 + 𝛾𝛾𝑟𝑟1 + 𝛾𝛾2𝑟𝑟2 + 𝛾𝛾3𝑟𝑟3 ⋯
Optimal policy 𝜋𝜋∗: 𝜋𝜋∗ ← max𝜋𝜋𝔼𝔼𝜋𝜋 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡
• 𝑄𝑄-function to measure the value-to-go of action 𝑎𝑎 for given state 𝑠𝑠
• Optimal policy 𝜋𝜋∗ for state 𝑠𝑠 is then determined by
9
Action Value
𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≝ 𝑟𝑟 𝑠𝑠, 𝑎𝑎 + 𝔼𝔼𝜋𝜋∗ �𝑡𝑡=1
∞
𝛾𝛾𝑡𝑡𝑟𝑟𝑡𝑡 ,
𝑠𝑠0= 𝑠𝑠 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 ⋯•𝑎𝑎0= 𝑎𝑎 𝜋𝜋∗(𝑠𝑠1) 𝜋𝜋∗(𝑠𝑠2) 𝜋𝜋∗(𝑠𝑠3)
⋯
𝜋𝜋∗ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴
𝑄𝑄 𝑠𝑠, 𝑎𝑎
𝑟𝑟 𝑠𝑠, 𝑎𝑎 ≝ ∑𝑠𝑠𝑠 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠
• Bellman equation (sufficient and necessary condition for optimality)
• The fixed point equation 𝑄𝑄 = 𝑇𝑇𝑄𝑄 always has a unique solution and the iteration 𝑄𝑄𝑘𝑘+1 ← 𝑇𝑇𝑄𝑄𝑘𝑘 (value iteration method, a.k.a. dynamic programming) converges geometrically to it for any 𝑄𝑄0 [Bertsekas’17]
• However, in real-world applications solving this equation is not that simple• Curse of uncertainty: 𝑃𝑃𝑠𝑠𝑠𝑠′
𝑎𝑎 and 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 may not be known• Curse of complexity: cardinality of sets 𝑆𝑆 and 𝐴𝐴 is prohibitively large
10
Computing Action Value
𝑄𝑄 𝑠𝑠,𝑎𝑎 = ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴
𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴
𝑄𝑄 = 𝑇𝑇𝑄𝑄
• Value iteration method requires that 𝑃𝑃𝑠𝑠𝑠𝑠′𝑎𝑎 and 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 are known
• 𝑄𝑄-learning: a stochastic value iteration method [Cambridge’89]
• Upon 𝑘𝑘 + 1 -th sample of 𝑠𝑠𝑘𝑘 ,𝑎𝑎𝑘𝑘 , 𝑟𝑟𝑘𝑘 , 𝑠𝑠𝑘𝑘𝑠 = (𝑠𝑠,𝑎𝑎, 𝑟𝑟, 𝑠𝑠𝑠),
• Converge if total asynchronism holds and 𝛼𝛼𝑘𝑘 decreases as 𝑘𝑘 → ∞ such that∑𝑘𝑘𝛼𝛼𝑘𝑘 =∞, ∑𝑘𝑘𝛼𝛼𝑘𝑘2 < ∞, due to asynchronous convergence theory [Tsitsiklis’94]
11
Learning Action Value
𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← 𝑄𝑄𝑘𝑘 𝑠𝑠, 𝑎𝑎 + 𝛼𝛼𝑘𝑘(𝑟𝑟 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴
𝑄𝑄𝑘𝑘 𝑠𝑠𝑠, 𝑎𝑎𝑠 − 𝑄𝑄𝑘𝑘 𝑠𝑠, 𝑎𝑎 )
Temporal-difference (TD) error
𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴
𝑄𝑄𝑘𝑘 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴
𝑄𝑄𝑘𝑘+1 𝑠𝑠, 𝑎𝑎 ← 1𝑘𝑘+1
∑𝑡𝑡=0𝑘𝑘 [𝑟𝑟𝑡𝑡 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴
𝑄𝑄𝑘𝑘 𝑠𝑠𝑡𝑡𝑠, 𝑎𝑎𝑠 ] Monte Carlo averaging
Stochastic approximation
• Exploitation-exploration tradeoffs• 𝜀𝜀-greedy policy • UBC (Upper Confidence Bound) policy
12
Q-Learning in Action
𝑄𝑄(𝑠𝑠,𝑎𝑎) ← 𝑄𝑄 𝑠𝑠,𝑎𝑎 + 𝛼𝛼 (𝑟𝑟 + 𝛾𝛾max𝑎𝑎′∈𝐴𝐴
𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 − 𝑄𝑄 𝑠𝑠,𝑎𝑎 )
𝜋𝜋∗ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴
𝑄𝑄 𝑠𝑠,𝑎𝑎
Policy1. Learn Q-table 𝑄𝑄 on samples 𝑠𝑠,𝑎𝑎, 𝑟𝑟, 𝑠𝑠𝑠
2. Choose action from Q-table 𝑄𝑄
Environment
action 𝑎𝑎
state 𝑠𝑠reward 𝑟𝑟
𝑎𝑎 = 1 𝑎𝑎 = 2
𝑠𝑠 = 0 1.1 1.9
𝑠𝑠 = 1 3.0 4.1
𝑠𝑠 = 2 5.1 6.0
Q-table 𝑄𝑄
Action-Value Approximation• Q-learning breaks the curse of uncertainty but does not break the curse
of complexity• Prohibitively large number of state-action pairs to learn and store values• E.g., AlphaGo: 10170 states [Nature’16], Atari breakout: 25628,224 states [Nature’15],
resource management in Google datacenter: 2100 state-action pairs [HotNets’16]
• Approximate 𝑄𝑄(𝑠𝑠, 𝑎𝑎) by a function defined in a lower-dimensional feature space and parameterized by 𝜃𝜃
13
𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≈ �𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎);𝜃𝜃
Feature extraction mapping
State-action pair(𝑠𝑠, 𝑎𝑎)
Feature vector∅(𝑠𝑠, 𝑎𝑎) Neural
network
Q approximation�𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎);𝜃𝜃
Convolutional neural network
• DQN = Q-learning + Deep Neural Network (DNN)• Approximate Q-function by a DNN with parameter 𝜃𝜃 in a feature space
Deep Q-Network (DQN) [Nature’15]
𝑄𝑄 𝑠𝑠, 𝑎𝑎 ≈ �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎; 𝜃𝜃 , ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴
13
�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎1; 𝜃𝜃�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎2;𝜃𝜃
�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎𝑛𝑛;𝜃𝜃
argmax𝑎𝑎∈𝐴𝐴
�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎;𝜃𝜃state 𝑠𝑠 action value
action 𝑎𝑎:
parameter 𝜃𝜃
Q-network
NN Approximation Breaks the Curse of Complexity
15
𝑄𝑄−network �𝑄𝑄 ∅(𝑠𝑠, 𝑎𝑎); 𝜃𝜃𝑄𝑄−table 𝑄𝑄(𝑠𝑠, 𝑎𝑎)
Projected Bellman Equation
• Bellman equation
• Projected Bellman equation
16
𝑄𝑄 = 𝑇𝑇𝑄𝑄
�𝑄𝑄 = Π𝑇𝑇 �𝑄𝑄
𝑇𝑇 �𝑄𝑄
Π𝑇𝑇 �𝑄𝑄
{ �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎; 𝜃𝜃 |𝜃𝜃 ∈ Θ}
𝑄𝑄 𝑠𝑠, 𝑎𝑎 = ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴
𝑄𝑄 𝑠𝑠𝑠,𝑎𝑎𝑠 ], ∀𝑠𝑠 ∈ 𝑆𝑆,∀𝑎𝑎 ∈ 𝐴𝐴
𝜃𝜃 = argmin𝜃𝜃∈Θ
∥ �𝑄𝑄 − 𝑇𝑇 �𝑄𝑄 ∥𝜉𝜉2
= argmin𝜃𝜃∈Θ
𝔼𝔼(𝑠𝑠,𝑎𝑎)~𝜉𝜉 �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 − ∑𝑠𝑠𝑠𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴
�𝑄𝑄 𝜙𝜙(𝑠𝑠𝑠), 𝑎𝑎𝑠;𝜃𝜃 ] 2PDF of (𝑠𝑠, 𝑎𝑎)
𝔼𝔼𝑠𝑠′|𝑠𝑠,𝑎𝑎
• Loss function to be minimized over 𝜃𝜃
• Monte Carlo averaging
• Experience replay (replay buffer)• Remove sample correlations, enhance exploration and improve sample efficiency• Possible because Q-learning is an off-policy learning method
• Target Q-network• Moving target can cause instability during learning• Fix target by evaluating target action value via a previously-learned Q-network
Monte Carlo Averaging Breaks the Curse of Uncertainty
𝐿𝐿 𝜃𝜃 = 𝔼𝔼(𝑠𝑠,𝑎𝑎,𝑠𝑠′) �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 − (𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠𝑠 + 𝛾𝛾 max𝑎𝑎′∈𝐴𝐴
�𝑄𝑄 𝜙𝜙 𝑠𝑠𝑠 ,𝑎𝑎𝑠;𝜃𝜃 ) 2
prediction target
𝜃𝜃 ← min𝜃𝜃∑{ 𝑠𝑠,𝑎𝑎,𝑟𝑟,𝑠𝑠′ }[(𝑟𝑟 + 𝛾𝛾max
𝑎𝑎′∈𝐴𝐴�𝑄𝑄 𝜙𝜙 𝑠𝑠𝑠 ,𝑎𝑎𝑠;𝜃𝜃− − �𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃 )2]
14
Fixed targetSolve this by Stochastic Gradient Descent (SGD) method
�𝑄𝑄 𝜙𝜙(𝑠𝑠),𝑎𝑎;𝜃𝜃
18
Deep Q-Network in Action
�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃−
�𝑄𝑄 𝜙𝜙(𝑠𝑠), 𝑎𝑎;𝜃𝜃
Exploitation-Exploration Tradeoffs
• 𝜀𝜀-greedy policy
• UBC (Upper Confidence Bound) policy
19
𝜋𝜋∗ 𝑠𝑠 = �argmax𝑎𝑎∈𝐴𝐴
𝑄𝑄 𝑠𝑠,𝑎𝑎 , with probability 1−ε
random , with probaility ε
RL Is Not a Panacea for All Our Problems
Max-Weight Scheduling
• State-of-the-art scheduling algorithm• Throughput optimal• Myopic policy• Suffers poor delay performance• Minimize the conditional queue drift
21
𝔼𝔼 ∑𝑛𝑛∈𝑁𝑁(𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 2 − 𝑞𝑞𝑛𝑛 𝑡𝑡 2) 𝑞𝑞 𝑡𝑡“Unknown” capacity region
Per-user capacity
Per-user queue
Per-user arrival
Scheduling
?
𝑸𝑸+-UCB: Beyond Max-Weight [TNET’19]
• RL-based algorithm considering return-to-go
• Guarantee throughput and delay optimality
• Guarantee max-weight algorithm performance during learning phase
• Sample-efficient exploration
22
RL-based algorithmGoal
Max-weight algorithm
Training Iteration
Perform
ance
Joint Throughput and Delay Optimality
23
40.8% 𝑟𝑟 𝑠𝑠𝑡𝑡 ,𝑎𝑎𝑡𝑡 , 𝑠𝑠𝑡𝑡+1 =
Reward Design:
−∑𝑛𝑛∈𝑁𝑁 𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 − 𝜈𝜈∑𝑛𝑛∈𝑁𝑁 𝑞𝑞𝑛𝑛 𝑡𝑡 + 1 2 − 𝑞𝑞𝑛𝑛 𝑡𝑡 2 , 𝜈𝜈 > 0
Throughput optimality Delay optimality
Delay during Learning Phase
24
𝑄𝑄+-UCB vs Max-Weight Delay over initial 100 iterations
DVFS and Thermal Throttling for Mobile Devices
• Dynamic Voltage and Frequency Scaling (DVFS)• Dynamically adjust the Voltage-Frequency (VF) level of the processor to
improve energy efficiency• Thermal Throttling
• Lower the processor temperature by setting very low VF level of the processor when overheated
• OS-level rule-based control • Limitations of existing techniques
• Application-agnostic• No cooperation between CPU and GPU due to independent governors• Agnostic about CPU-GPU thermal coupling and thermal environments
• Need predictive management
*
*25
zTT: Learning-based DVFS with Zero Thermal Throttling [SenSys’20]
• Application-aware DVFS• Real-time resource requirements (CPU, GPU) prediction for mobile applications• Maximize user QoE• Minimize power consumption
• Prevent overheating• Predict thermal headroom• Perform DVFS within thermal headroom and avoid thermal throttling• Adapt to changes of thermal environments
• Purpose of Learning• Model learning: Learn transition probability of system states • Environment learning: Predict temperature for a given CPU and GPU clock frequency
combination• Application learning : Learn CPU and GPU resource requirements for a given
application
*
*26
References• [Bertsekas’17] Dynamic Programming and Optimal Control, Vol. 1, D. P. Bertsekas, 4th Ed., Athena Scientific,
2017• [Bertsekas’12] Dynamic Programming and Optimal Control, Vol. 2 – Approximate Dynamic Programming,
D. P. Bertsekas, 4th Ed., Athena Scientific, 2017• [Cambridge’89] C. J. C. H. Watkins, “Learning from delayed rewards”, Ph.D. thesis, Cambridge University,
1989• [Tsitsiklis’94] J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q-learning”, Machine Learning,
1994• [Nature’15] V. Mnih et al., “Human-level control through deep reinforcement learning”, Nature, 2015 • [HotNets’16] H. Mao et. al., “Resource management with deep reinforcement learning”, ACM HotNets, 2016• [SIGCOMM’17] Albert Mestres et. al., “Knowledge-defined networking”, ACM SIGCOMM 2017
• [SIGCOMM’17-1] H. Mao et. al., “Neural Adaptive Video Streaming with Pensieve”, ACM SIGCOMM, 2017
• [TNET’19] J. Bae, J. Lee and S. Chong, “Learning to schedule network resources throughput and delay optimally using 𝑄𝑄+-learning”, submitted to IEEE/ACM Trans. on Networking, 2019
• [SenSys’20] S. Kim, K. Lee and S. Chong, “zTT: Learning-based DVFS with Zero Thermal Throttling for Mobile MPSoCs”, submitted to ACM SenSys, 2020
27
• [ICML WKSHPS’19] H. Mao et al., “Real-world Video Adaptation with Reinforcement Learning”, ICML Woskshop 2019
• [SIGCOMM’19] H. Mao et al., “Learning Scheduling Algorithms for Data Processing Clusters”, ACM SIGCOMM 2019
• [NetworkMag’19] D. Zeng et al., “Resource Management at the Network Edge: A Deep Reinforcement Learning Approach,” IEEE Network, vol. 33, no. 3, pp. 26–33, May 2019
• [INFOCOM WKSHPS’19] M. Dolati et al., “Deep-ViNE: Virtual Network Embedding with Deep Reinforcement Learning,” IEEE INFOCOM Workshop 2019
• [CommMag’18] S. Ayoubi et al., “Machine Learning for Cognitive Network Management,” IEEE Communications Magazine, vol. 56, no. 1, pp. 158–165, Jan. 2018
28