acm sigkdd conference on knowledge discovery and data mining (kdd), 2009 © 2008 ibm corporation...

12
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang , Mudhakar Srivatsa , Dakshi Agrawal and Ling Liu Georgia Institute of Technology IBM T.J. Watson Research Center

Upload: ashlee-rebecca-moody

Post on 20-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Learning, Indexing and Diagnosing Network Faults

Ting Wang†, Mudhakar Srivatsa‡,

Dakshi Agrawal‡ and Ling Liu†

Georgia Institute of Technology†

IBM T.J. Watson Research Center‡

Page 2: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Complex Networks Network as a graph

– Vertices represent network entities

– Edges represent pair-wise (local) interactions between network entities

Even simple interactions give rise to complex global network phenomena

– Fault cascading in communication networks

– Information spread (e.g., via emails) in social networks

– Infection propagation in protein interaction networks

Key challenge is to detect and understand emerging global phenomena

2

Page 3: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Monitoring Data

Networks generate massive monitoring data (aka events)

– Monitored data consists of local (in both space & time) observations on the network

– Monitored data is incomplete and sometimes even erroneous (e.g., imprecise, out-of-order wrt to both time and causality, etc)

3

Examples

– Ping failure, interface down, high CPU utilization, etc. in communication networks

– Email threads (time stamp, tokenized subject, MIME type, etc.) between members in a organizational hierarchy

– Pathological symptoms in biological networks – protein interaction networks (PINs)

Key observation: monitoring data gathered from network entities are correlated through the network topology

Page 4: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Patterns Network patterns attempt to efficiently capture spatial (topological) and

temporal correlations in monitored data

Key challenges

– Understand the semantics of network patterns

– Identify domain-specific network patterns (e.g., fault diagnosis & prediction in IT systems, information spread and access control on social networks, disease propagation in protein networks, etc)

– How to learn and represent network patterns?

– How to scalably match network patterns against an online stream of network events?

4

e1

e2e3

e1 e2 e3

iBGP server

OSPF networks N1

and N2

Update configuration withdraw prefix

announcement

N1 says N2 is not reachable

N2 says N1 is not reachable

Director D

Employees N1 and N2

Meeting with D and N1

Email from N1 to N2 N2 updates project design document

Person P Friends N1 and N2

P updates a blog on her facebook page

N1 sends friend request to N2

N2 views P’s updates and accepts N1’s friend request

Simplified Examples

Page 5: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Network Patterns Notation and Formalism

– Event data: <nodeId, type, timestamp, monitorId>

– Network Pattern: <event types, spatial pattern, temporal pattern>

– INTERFACE DOWN <LINK DOWN, NEIGHBOR, TIME WINDOW>

Temporal Pattern

– E.g.: markov chains, frequent item sets

Spatial Pattern: Composition/Closures of one or more topological relationships

– Communication networks: upstream, downstream, neighbor, tunnel

– Social networks: manages, friends, team members, IM buddies

– Biological network: catalyst, inhibitor, suppressor

5

e1 e2 e3t12 t23

t13

t11 t33t22

Temporal Pattern: Markov Chain

Temporal Pattern: Frequent Item Sets

Spatial Pattern: Downstream (transitive closure)

Page 6: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Fault Diagnosis and Prediction in Communication Networks

Challenges: improve scalability & expressiveness of fault-diagnosis

–Limitation of current solutions: a complexity that grows as square of the network size

–Correlation rules are pair-wise: expensive to support complex fault diagnosis (e.g., predicting soft failures, router failure from VRF tunnel events, etc)

–Lacks predictive capability

Approach:

–Fault signatures encode temporal patterns: frequent item sets, Markov chains; and topological patterns (spans the network): upstream, downstream, neighbors, VPN tunnels, etc

–Topologically index streaming monitoring data to facilitate scalable single-pass event correlation and fault-diagnosis

–Results in linear complexity – increased scalability

Traditional RCA Engine vs. Proposed Approach

Correlation Engine (ITNM RCA)

Monitoring Data

(Omnibus)

Topology

Pair-wise correlation

rules

Fault Signatures (Network Patterns)

Topological

Index

Fault diagnosis

Complexity:Monitoring data x Monitoring data x Rules

Monitoring data x Network Diameter x Signatures

Monitoring data ~ linear in network sizeNetwork diameter ~ logarithmic in network

size for power-law networks

Page 7: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

7 04/21/23

Learn fault signatures from historical network event data– Fault Synopsis: Fault Type Network Pattern– Fault Signature: Network Pattern <Fault Type, Spatial Pattern to Localize Faulty Node>– Fault Diagnosis: <Spatial Pattern to Localize Faulty Node, Network Topology> Faulty

Node– Fault Prediction: Use incrementally matchable network patterns

Use indexable network patterns– Topological relationships are invertible: neighbor-1 = neighbor, downstream-1 = upstream

Step 1: Learning Network Faults

Fault Type up-stream down-stream neighbor …

f1 c1 c2 c3 …

f2 c2 c4 c1 …

Network Pattern up-stream down-stream Neighbor …

c1 - f1, p1 f2, p2 …

c2 f1, p1 f2, p2 - …

c3 - - f1, p1 …

c4 f2, p2 …

Fault Synopsis

Fault Signature

Page 8: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Step 2: Online Matching Fault localization using topological indices and hierarchical evidence aggregation

– Topology indexing algorithms + space-time trade off in computing R(x) and R-

1(x)• R Є {upstream, downstream, neighbor, tunnel, …}

– Scalable hierarchical evidence aggregation for efficient fault diagnosisNetwork Pattern up-stream down-stream neighbor VPN Tunnel

c1 Device Down - f1 -

c2 - f2 - Device Down

c3 - - Device Down -

n1

n2

c1

c2c3

fnfn-1...f3f2f1

bf bf…...

bf bf…... bf bf…...

…… …

Evidence Aggregation Scalable Hierarchical Evidence Aggregation

Page 9: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Details

9

Event Datasets

Preparation of training

data

Interval Filter: segment event dataset into event bursts

Support Filter: eliminate high frequency (regular n/w ops) and low frequency burst sets (noise)

Periodicity Filter: eliminate burst sets with high periodicity (maintenance ops)

Extract temporal patterns

Markov chains and maximum

likelihood estimation

Extract topological

patterns

Set of topological relationships: SE, NE, DS, US, TN

Principle of minimum

explanation

Fault Signatures

OFFLINE LEARNINGNetwork Topology

Match temporal patterns

Fault Signatures

Evidences: <f, v,

Rv>

Indexed network topology

Network Topology

Scalable Evidence

Aggregation

Fault Diagnosis

and Prediction

Min-Heap + incremental

pattern matching

Inverted Index for constant time lookup

Space-Time tradeoffs

BIRCH data structure (hierarchical aggregation)

Optimizations: filter-and-refine (Bloom filter) + slotted aggregation

(BIGTABLE)

Event Stream

ONLINE MATCHING

Page 10: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Fault Diagnosis & Prediction: Scalability

10

Result Summary:

SNMP Trap messages from a large enterprise (7 ASes, 32 IGP networks, 871 subnets, 1,268 VPN tunnels, 2,068 main nodes, 18,747 interfaces and 192,000 entities) over 14 days in 2007

Topology dataset – European backbone network (2,383 main nodes, spans 7 countries, 11 ASes and over 100,000 entities)

Network fault simulator and monitoring data generation

Linear scalability; further optimizations: prune-and-search; slotted hierarchical aggregation

Ongoing activities

Integration with IBM Tivoli Network Management suite (ITNM) for live testing and fine-tuning

Network patterns for access control on information flows over : (i) ENRON email data & organization role topology; (ii) Smallblue data & social + information network topology

Page 11: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Summary Network patterns encode spatial-temporal properties of various

networks

– Ability to scalably mine and match network patterns is key for understanding global network phenomena

Case study on fault diagnosis and prediction in communication networks

– Complexity of solution has to be linear in network size

– Topologically indexed databases was a key tool for addressing scalability

Explore more complex network patterns for information, social and biological networks which exhibit stronger coupling relationships

– A failed router does not cause its neighboring router to fail

– A corrupt information node can corrupt its neighbor (e.g., summary node)

– A diseased enzyme can catalyze/inhibit its neighbors11

Page 12: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009 © 2008 IBM Corporation Learning, Indexing and Diagnosing Network Faults Ting Wang

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009

© 2008 IBM Corporation

Questions?

Mudhakar Srivatsa

[email protected]

12