machine learning applied to network resource and fault
TRANSCRIPT
Machine Learning Applied to Network Resource and Fault Management
Carolina Cuba - 226004carolinacuba23 at gmail dot com
Alexander Valle - 230254ra.vallers at gmail dot com
Instituto de Computação - UNICAMP
SchedulePreliminary Concepts - Self Organized Networks
ML in Networks Resource Management
ML in Networks Fault Management
Case of Study Egypt Optical Network
2
“Ability of a system to spontaneously arrange its components or elements in a purposeful (non-random) manner, under
appropriate conditions but without the help of an external agency.” 4Ref.
5
“The main idea is to bring into the network
intelligence and autonomous adaptability by
diminishing human involvement, while
enhancing network performance, in terms of
network capacity, coverage and service quality.”
1. Preliminary Concepts1.1 Self Organized Networks (SON)
Ref.
1. Preliminary Concepts1.1 Self Organized Networks (SON)
6
Planning
Deployment
Maintenance and Optimization
Self-Configuration
Self-Optimization
Self-Healing
Ref.[3]
7
1. Preliminary Concepts1.1 Self Organized Networks (SON)
Ref.[3]
8
● K-NN (K-Nearest Neighbors)● CF (Collaborative Filtering) ● NN (Neural Network)● SOM (Self Organizing Map)
● SVM (Support Vector Machine)● AD (Anomaly Detector)● DT (Decision Trees)
1. Preliminary Concepts1.2 Performance of ML Algorithms in SONs
Ref.[3]
9
● QL (Q-Learning) ● MC (Markov Chains)● GA (Genetic Algorithm)● FQL (Fuzzy Q-Learning) ● HMM (Hidden Markov Model)
1. Preliminary Concepts1.2 Performance of ML Algorithms in SONs
Ref.[3]
● Representative DatasetsSolution: https://sites.google.com/site/cnetmag/: Datasets and frameworks
● Speed vs. AccuracySolution: Use of ensemble learning and hybrid techniques
● Ground Truth Solution: Exploring is the application of active learning to facilitate labeling.● ML Techniques for Networks
Solution: design new ML algorithms tailored for networks● Incremental Learning
Solution: The model is re-trained with only the new data.● Security of Machine Learning
Solution: Build robust ML models
10
1. Preliminary Concepts1.3 Challenges Using Machine Learning in Network Management
Ref.[14]
11
1. Preliminary Concepts1.4 COGNITIVE AUTONOMOUS NETWORKS (CANs) IN 5G
Ref.[17]
1. Preliminary Concepts1.5 Self Optimization Framework in SONs
12
SelfNet focuses on the management of NFV and SDN butwith specific focus on the SONs paradigm
Ref.[5]
1. Preliminary Concepts1.5 Self Optimization Framework in SONs
13Ref.[NetworkManagement_WhitePaper]
CogNet is the use of ML models derived from applying suitable ML algorithmsto the network data and metrics collected from the NFVI and the control plane.
9
1. Preliminary Concepts1.6 Cognitive Networks
Ref.[14]
● C-Monitor Function: refers to the cognitive monitor that performs intelligent probing.
● C-Analyze Function: is responsible for detecting or predicting changes in the network environment.
● C-Plan Function: can leverage ML to develop an intelligent automated planning (AP)
● C-Execute Function: can use ML to schedule the generated plans and determine the course of action should the execution of a plan fail.
16
SchedulePreliminary Concepts - Self Organized Networks
ML in Network Resource Management
ML in Network Fault Management
Case of Study Egypt Optical Network
15
2. ML in Network Resource Management2.1 What is Resource Management?
Management Resource: Means to control the vital components of a network.
● CPU● Memory● Disk● Switches and Routers● Bandwidth● Radio Channels ● Frequences
Ref. 16
2. ML in Network Resource Management2.1 What is Resource Management?
Naive Way: “Network service providers can provision a fixed amount of resources that satisfies an expected demand for a service.”
The Challenge is to predict the network demand in a dynamic way such that it is resilient to variations in service
demand.
However… “it is non-trivial to predict demand, while over and under estimation can lead to both poor utilization and loss in revenue.”
Ref. 17
2. ML in Network Resource Management2.1 What is Resource Management?
Ref. 18
Other Challenges…
● The underlying systems are complex and often impossible to model accurately.
● Practical instantiations have to make online decisions with noisy inputs and work well under diverse conditions.
● Some performance metrics of interest are notoriously hard to optimize
2. ML in Network Resource Management2.1 What is Resource Management?
Resource Management
Admission Control
Resource Allocation
Admission Control: Optimize the utilization of resources by monitoring and managing the acceptance of services requests. (Fix amount of resources)
Resource Allocation: Adapt the amount of resources to a given service demand. (Dynamic amount of resources)
Ref. 19
2. ML in Network Resource Management2.2 Resource Allocation
Ref. 20
Resource Allocation is a Resource Management approach where the challenge lies in predicting demand variability and future resource
utilization.
Machine Learning techniques can be used to learn indicators that aid the decision of resource allocation.
The most suitable ML-Approach for this problem is Reinforcement Learning
2. ML in Network Resource Management2.3 What is Reinforcement Learning?
Generally Speaking…
Reinforcement Learning is an Machine Learning approach that allows an agent to learn to make better decisions directly from experience by
interacting with the environment.
How does it do it?
The agent starts knowing nothing about the task at hand and learns by reinforcement — a reward that it receives based on how well it is doing
on the task.
Ref. 21
2. ML in Network Resource Management
At each time step t...
2.3 What is Reinforcement Learning?
Ref. 22
2. ML in Network Resource Management
Another things to considering...
2.3 What is Reinforcement Learning?
Ref.
Policy is a function that defines what action to take at a given state s. It returns a sequence of tuples (state, action, reward) that lead to the objective.
Value Function measures of the overall expected cumulative reward assuming
the Agent is in state s and then continues playing following some policy π.
23
2. ML in Network Resource Management
2.4 What about Q-Learning?
Ref.
Q-Value Function measures of the overall expected cumulative reward from taking an action a in the state s and the policy.
Each Q-Value is saved at a Q-Table, which stores all Q-Values from a given policy
24
2. ML in Network Resource Management
Update the Q-Table using Bellman Equation:
2.4 What about Q-Learning?
Ref. 25
2. ML in Network Resource Management
2.4 What about Q-Learning?
Ref. 26
What if the Environment is too Complex?
Imagine an environment with 10,000 states and 1,000 actions per state. This would create a table of 10 million cells. Things will quickly get out of control!
27
2. ML in Network Resource Management
2.3 Deep Reinforcement Learning
Ref.
Q-Value Function measures of the overall expected cumulative reward from taking an action a in the state s and the flowing the policy.
.
28
2. ML in Network Resource Management
2.3 Deep Reinforcement Learning
Ref.
Q-Value Function measures of the overall expected cumulative reward from taking an action a in the state s and the flowing the policy.
.
We use a neural network to approximate the Q-value function
29
2. ML in Network Resource Management
2.3 Deep Reinforcement Learning
Ref.
Q-Value Function measures of the overall expected cumulative reward from taking an action a in the state s and the flowing the policy.
.
30
2. ML in Network Resource Management
Example - Playing Atari Game
2.3 What is Reinforcement Learning?
Ref.
● Objective: Complete the game with the highest score.
● State: Raw pixels input of the game state.
● Action: Game controls● Reward: Score increase or
decrease
31
2. ML in Network Resource Management
2.4 State of Art - Learning Algorithms for Dynamic Resource Allocation in Virtualized Networks
Ref.
● Objective: VNE (virtual network embedding) dynamically allocates resources based on the specification in the VN requests.
● Proposal :○ The agent gets a resource usage status;○ The agent produces an action
(increase/decrease resource allocated)○ The virtual node/link is monitored to
evaluate its performance. The evaluation is communicated to agent as a reward.
○ The agent adjust its policy (updating Q-values) to ensure better allocations in the future 32
2. ML in Network Resource Management
2.4 State of Art - Learning Algorithms for Dynamic Resource Allocation in Virtualized Networks
Ref.
The packet drop rate of the static approach is in general constant (due to packet errors as well as buffer overflows) while that of the dynamic approach is initially high, but gradually reduces.
This can be attributed to the fact that at the beginning of the simulation when the agents are still learning, the virtual node queue sizes are allocated varying node buffers that lead to more packet drops.
33
2. ML in Network Resource Management
2.4 State of Art- Learning Algorithms for Dynamic Resource Allocation in Virtualized Networks
Ref.
The dynamic approach performs better than the static one in terms of virtual network acceptance ratio. This can be attributed to the fact that in the dynamic approach the substrate network always has more available resources than in the static case.
34
2. ML in Network Resource Management
2.4 State of Art - Deep Reinforcement Learning for Resource Management in Network Slicing
Ref.
● Algorithm:○ At episode t, the DQL agent observes the state st;○ The agent chooses action at;○ The agent observes the reward R (st, at) and a new
state;○ The agent stores the episode experience into D;○ The agent samples a minibatch of experiences
from D;○ The agent updates the weights θ for the evaluation
network by a gradient-based approach;○ The agent clones the evaluation network Q to the
target network;○ The episode index is updated by t ← t + 1
35
2. ML in Network Resource Management
2.4 Example Resource management for Slicing
Ref. 36
● Compared with the “no priority” solution, the DQL-empowered slicing results provision flows with smaller average waiting time (i.e., 10.5% lower than “no priority”) and significantly more sufficient CPU usage (i.e., 27.9% larger than “no priority”)
● DQL could support alternative solutions to exploit the computing resources and reduce the waiting time
SchedulePreliminary Concepts - Self Organized Networks
ML in Network Resource Management
ML in Network Fault Management
Case of Study Egypt Optical Network
37
The fault management process refers to the handling of the whole lifecycle of faults, which includes:● faults● errors● failures● symptoms
Ref.
3. ML in Network Fault Management
38
3.1 Fault Discovery and Diagnosis (FDD)
39Ref.
Hard faults: a sensor node is not capable of communicating with the rest of the network.
• Soft faults: a sensor node continues to operate and com-municate with altered behavior, e.g., produces faulty data, cannot act as a stable routing node.
A fault is an unexpected change or malfunction in a system, although it may not lead to physical failure or breakdown
3. ML in Network Fault Management3.1 Fault Discovery and Diagnosis (FDD)
40
3. ML in Network Fault Management
Ref.
Fault Detectionautomatically detect when and where a fault occurred in the network.
Fault Classificationdetermination of the causes of the problem, so that the correct solution can be triggered.
Automated mitigation
3.1 Fault Discovery and Diagnosis (FDD)
41
3. ML in Network Fault Management3.2 Fault Detection in Self Healing Framework
Ref.[3]
42
Ref.[14]
Fault Classification Categorizing of malfunctions or failure, clustering similar faults together; when-ever a fault occurs also make RCA .
3. ML in Network Fault Management3.2 Fault Detection in Self Healing Framework
43
3. ML in Network Fault Management
a) Cell Outage Detection the network detects that the central site has suffered outage
b) Cell Outage Compensation : self healing mechanisms. adjusts their coverage area and, in turn, compensate for the outaged cell.
Another option: Sen a Wifi UAV(drone)
3.3 Cell Outage Detection
Ref.[3]
44
Profiling, detection and diagnosis are done per selected contexts
Ref.
3. ML in Network Fault Management3.4 Anomaly Detection and Diagnosis Function for Radio Access Networks (RANs)
45
The mean time between failures (MTBF).
The MTBF is usually modeled by the Weibull distribution.Ref.
The probability density function of a Weibull random variable is:
where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution.
3. ML in Network Fault Management3.5 Fault Management KPI
46Ref.
The I-P region is the only part of the reliability curve where failures actually can be prevented
https://reliabilityweb.com/articles/entry/the-reliability-impact-within-the-p-f-curve
3. ML in Network Fault Management3.6 Curve P
47Ref.[3]
3. ML in Network Fault Management3.7 ML Algorithms Performance in Fault Management
48Ref.[46]
3. ML in Network Fault Management3.8 Fault Management in Optical Networks
49
3. ML in Network Fault Management3.8 Fault Management in Optical Networks
Ref.[16]
50
3. ML in Network Fault Management
Ref.[16]
Optical Spectrum Analysis
Soft-failures can degrade lightpaths’ quality of transmission and introduce errors in the optical layer
3.8 Fault Management in Optical Networks
51Ref.[Srinikethan atal 2018]
90% accuracy using random forest
3. ML in Network Fault Management3.8 Fault Management in Optical Networks - Detect and Localize Link Failures : A ML Approach to Detect and Localize Link Failures
**The goal of the problem is to predict Telstra network's fault severity at a time at a particular location based on the log data available**
● Each row in the main dataset (train.csv, test.csv) represents a location and a time point. They are identified by the "id" column, which is the key "id" used in other data files.
● Fault severity has 3 categories: 0,1,2 (0 meaning no fault, 1 meaning only a few, and 2 meaning many).
● Different types of features are extracted from log files and other sources: event_type.csv, log_feature.csv, resource_type.csv, severity_type.csv.
Telstra Network Disruptions
https://www.kaggle.com/c/telstra-recruiting-network 52
3. ML in Network Fault Management3.9 A competition Problem: Telstra Network Disruptions
Ref. https://www.kaggle.com/c/telstra-recruiting-network 53
3. ML in Network Fault Management3.9 A competition Problem: Telstra Network Disruptions
Converting Categorical Data to numeric data, adding an new numeric feature (mean_volumn)
54
3. ML in Network Fault Management3.9 A competition Problem: Telstra Network Disruptions
After making Principal component Analysis (PCA) , it can be see that clases of fault are not linear separable, so we use all the features to test the classifiers.
Ref. https://www.kaggle.com/c/telstra-recruiting-network
Dataset Description
NVNF (OVS, Firewall & Snort)
predicting CPU consumption of an OVS, firewall and Snort connected to a SDN controller with respect to 86 traffic features
MAWI MAWI Working Group Traffic Archive
WITS Waikato Internet Traffic Storage
LBNL/ICSI LBNL/ICSI Enterprise Tracing Project
KDD99 Classifying intrusion & normal connection
NetCla NetCla: The ECML-PKDD Network Classification ChallengeRef.[8] 55
3. ML in Network Fault Management3.10 DataSets for ML in Network Management
SchedulePreliminary Concepts - Self Organized Networks
ML in Network Resource Management
ML in Network Fault Management
Case of Study: Egypt Optical Network
56
4. Case of Study: Egypt Optical Network Challenges
1: There is no centralized system can be used to monitor and control the performance of the total power consumptions
2: There is no automated process to faster the fault localizations time and find the root cause problem to minimize the mean time to repair over the entire optical network
3: There is no automated process to perform the performance monitoring task and notifies by the needed proactive actions in the optical network
4: There are no automated tools to perform complete provisioning of the all resources in the opticalnetwork and to maximize the return of the investments in these resource
Ref. 57
1)The model of power consumption with ANN (3 layers )2)Fault localization model with ANN (3 layers )3) The model is as intelligent optical performance monitoring (IOPM) with ANN (3 layers )4) the configuration model a combinations between the Artificial Bee Colony Algorithm (ABC) and the ANN (4 layers )
Ref. 58
4. Case of Study: Egypt Optical Network Challenges4.1 Model with Artificial Neural Network (ANN)
● The time of the fault location is reduced from 40 min to 3 min, ● The efforts to create one circuit is reduced by 30.87%,
The number of the complaints are reduced by 30% per year● The response time to the complains is decreased from 55 min to 5 min.
Ref. 59
4. Case of Study: Egypt Optical Network Challenges4.1 Model with Artificial Neural Network (ANN) - Using Machine Learning
● With Machine Learning we can have Intelligent Asset Management with network reliability and resource allocation.
● SDN and NFV facilitate the developed of SONs or CANs with Resource and Fault Management.
● We have to balance between ML algorithms using ensemble learning with hybrid models.
● Depending of contract with consumes we have to deal with speed and accuracy.● In orden to have a good ML analysis we need to know about the physical layer
that we will deal it.
60
5. Conclusions
A question :Specific the main ML algorithms applied to Self-Optimization and Self-Healing( in context of Resources and Fault Management problems).
Reference: J. Moysen, L. Giupponi, From 4G to 5G:Self-organized network management meets machine learning, Computer Communications, Vol. 129, pp. 248-268, September 2018 https://doi.org/10.1016/j.comcom.2018.07.015Available: https://arxiv.org/abs/1707.09300.
61
ReferencesMachine Learning and Self Organized Networks (SON)
1. A Review on Self-Healing and Self-Organizing Techniques for Wireless Sensor Networks S Diaz, D Mendez, R Kraemer - Journal of Circuits, Systems and …, 2019 - World Scientific
2. A self-adaptive deep learning-based system for anomaly detection in 5G networks LF Maimó, ÁLP Gómez, FJG Clemente, MG Pérez… - IEEE …, 2018 - ieeexplore.ieee.org
3. A survey of machine learning techniques applied to self-organizing cellular networks PV Klaine, MA Imran, O Onireti… - … Surveys & Tutorials, 2017 - ieeexplore.ieee.or
4. Deep Q-Learning for Self-Organizing Networks Fault Management and Radio Performance Improvement FB Mismar, BL Evans - 2018 52nd Asilomar Conference on …, 2018 - ieeexplore.ieee.or
5. From 4G to 5G: Self-organized network management meets machine learning J Moysen, L Giupponi - Computer Communications, 2018 - Elsevier
6. Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks J Ali-Tolppa, S Kocsis, B Schultz… - … for a 5G Future (ITU K …, 2018 - ieeexplore.ieee.org
7. Self-organizing capabilities in 5G networks: NFV & SDN coordination in a complex use case M Pérez, GM Pérez, PG Giardina, G Bernini… - Proceedings of …, 2018 - researchgate.net
8. Guangyuan Piao (2018). "Machine & Deep Learning for Network Management: An Overview with Benchmarks". https://goo.gl/gp7gBb
9. https://github.com/Selfnet-5G10. https://5g-ppp.eu/cognet/ https://github.com/CogNet-5GPPP 63
ReferencesMachine Learning and Self Organized Networks (SON)
11. Applying Machine Learning Technology to Optimize the Operational Cost of the Egyptian Optical Network KH Rahouma, A Ali Conference: 16th Elsevier International Learning & Technology Conference 2019, At Effat University, Jeddah, Saudi
12. An overview on application of machine learning techniques in optical networks F Musumeci, C Rottondi, A Nag… - … Surveys & Tutorials, 2018 - ieeexplore.ieee.org
13. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities R Boutaba, MA Salahuddin… - Journal of …, 2018 - jisajournal.springeropen.com
14. Machine learning for cognitive network management S Ayoubi, N Limam, MA Salahuddin… - IEEE …, 2018 - ieeexplore.ieee.org
15. Deep learning for radio resource allocation in multi-cell networks KI Ahmed, H Tabassum, E Hossain - IEEE Network, 2019 - ieeexplore.ieee.org
16. Machine learning for network automation: overview, architecture, and applications [Invited Tutorial] D Rafique, L Velasco - Journal of Optical Communications and …, 2018 - osapublishing.org
17. Towards Cognitive Autonomous Networks in 5G SS Mwanje, C Mannweiler - … : Machine Learning for a 5G Future …, 2018 - ieeexplore.ieee.org
64
ReferencesResource Management
18. A survey of machine learning techniques applied to software defined networking (SDN): Research issues and challenges J Xie, FR Yu, T Huang, R Xie, J Liu… - … Surveys & Tutorials, 2018 - ieeexplore.ieee.org
19. Deep learning for radio resource allocation in multi-cell networks KI Ahmed, H Tabassum, E Hossain - IEEE Network, 2019 - ieeexplore.ieee.org
20. Deep reinforcement learning for resource allocation in V2V communications H Ye, GY Li - 2018 IEEE International Conference on …, 2018 - ieeexplore.ieee.org
21. Network resource allocation system for QoE-aware delivery of media services in 5G networks A Martin, J Egaña, J Flórez, J Montalbán… - IEEE Transactions …, 2018 - ieeexplore.ieee.org
22. Learning to optimize: Training deep neural networks for wireless resource management H Sun, X Chen, Q Shi, M Hong, X Fu… - 2017 IEEE 18th …, 2017 - ieeexplore.ieee.or
23. Proactive resource management for LTE in unlicensed spectrum: A deep learning perspective24. U Challita, L Dong, W Saad - IEEE transactions on wireless …, 2018 - ieeexplore.ieee.org25. Reinforcement learning for resource provisioning in the vehicular cloud MA Salahuddin, A Al-Fuqaha… -
IEEE Wireless …, 2016 - ieeexplore.ieee.org26. RADAR: Self‐configuring and self‐healing in resource management for enhancing quality of cloud
services SS Gill, I Chana, M Singh… - … and Computation: Practice …, 2019 - Wiley Online Library27. Learning algorithms for dynamic resource allocation in virtualised networks R Mijumbi, JL Gorricho, J
Serrat - Proceedings of Workshop on …, 2014 - maps.upc.edu28. Reinforcement learning based methodology for energy-efficient resource allocation in cloud data
centers T Thein, MM Myo, S Parvin, A Gawanmeh - Journal of King Saud University …, 2018 - Elsevier 65
ReferencesFault Management
29. Accurate Fault Location based on Deep Neural Evolution Network in Optical Networks for 5G and Beyond X Zhao, H Yang, H Guo, T Peng… - Optical Fiber …, 2019 - osapublishing.org
30. A Fault Prediction Algorithm Based on Rough Sets and Back Propagation Neural Network for Vehicular Networks R Geng, X Wang, N Ye, J Liu - IEEE Access, 2018 - ieeexplore.ieee.org
31. A survey on fault diagnosis in wireless sensor networks Z Zhang, A Mehmood, L Shu, Z Huo, Y Zhang… - IEEE …, 2018 - ieeexplore.ieee.org
32. A survey on fault management in software-defined networks PC da Rocha Fonseca, ES Mota - … Communications Surveys & …, 2017 - ieeexplore.ieee.org
33. Cognitive assurance architecture for optical network fault management D Rafique, T Szyrkowiec, H Grießer… - Journal of Lightwave …, 2017 - ieeexplore.ieee.org
34. Deep Q-Learning for Self-Organizing Networks Fault Management and Radio Performance Improvement FB Mismar, BL Evans - 2018 52nd Asilomar Conference on …, 2018 - ieeexplore.ieee.org
35. Fault Management Based on Machine Learning L Velasco, D Rafique - Optical Fiber Communication Conference, 2019 - osapublishing.org
36. Fault management in software-defined networking: A survey Y Yu, X Li, X Leng, L Song, K Bu… - … Surveys & Tutorials, 2018 - ieeexplore.ieee.org
37. Network Performance and Fault Analytics for LTE Wireless Service Providers D Kakadia, J Yang, A Gilgur - Network Performance and Fault Analytics for …, 2017 - Springer
38. Localized Fault Tolerant Algorithm Based on Node Movement Freedom Degree in Flying Ad Hoc Networks Q Guo, J Yan, W Xu - Symmetry, 2019 - mdpi.com 66
ReferencesFault Management
39. Localized Fault Tolerant and Connectivity Restoration Algorithms in Mobile Wireless Ad Hoc Network X Song, L Zhou, H Zhao, X Hu, J Wei - IEEE Access, 2018 - ieeexplore.ieee.org
40. Machine Learning Algorithms and Fault Detection for Improved Belief Function Based Decision Fusion in Wireless Sensor Networks A Javaid, N Javaid, Z Wadud, T Saba, OE Sheta… - Sensors, 2019 - mdpi.com
41. Fault and performance management in multi-cloud based NFV using shallow and deep predictive structures L Gupta, M Samaka, R Jain, A Erbad… - 2017 26th …, 2017 - ieeexplore.ieee.org
42. A survey on fault diagnosis in wireless sensor networks Z Zhang, A Mehmood, L Shu, Z Huo, Y Zhang… - IEEE …, 2018 - ieeexplore.ieee.org
43. TE-Based Machine Learning Techniques for Link Fault Localization in Complex Networks44. SM Srinivasan, T Truong-Huu… - 2018 IEEE 6th45. Super Base Station Fault Detection Mechanism Based on Negative Selection Algorithm and Expert
Knowledge Base G Ye, Y Wang, Q Sun - IOP Conference Series: Materials Science …, 2019 - iopscience.iop.org46. An Optical Communication's Perspective on Machine Learning and Its Applications FN Khan, Q Fan, C Lu,
APT Lau - Journal of Lightwave …, 2019 - osapublishing.org47. https://www.kaggle.com/c/telstra-recruiting-network
67