sherlock – diagnosing problems in the enterprise srikanth kandula victor bahl, ranveer chandra,...

41
Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Upload: natalie-davis

Post on 26-Mar-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock – Diagnosing Problems in the Enterprise

Srikanth Kandula

Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Page 2: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Enterprise Management: Between a Rock and a Hard Place

Manageability• Stick with tried software,

never change infrastructure• Cheap• Upgrades are hard, forget

about innovation!

Usability• Keep pace with technology• Expensive

– IT staff in 1000s– 72% of MS IT budget is

staff• Reliability Issues

– Cost of down-time

Page 3: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Well-Managed Enterprises Still Unreliable

10% Troubled

85% Normal

Fraction Of Requests

0.7% Down

.1

.02

.04

.06

.08

10 100 1000 10000

Response time of a Web server (ms)

0

10% responses take up to 10x longer than normal

How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?

Page 4: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Current Tools Miss the Forest for the Trees

• Monitor Individual Boxes or Protocols

• Flood admin with alerts• Don’t convey the end-to-end

picture

SQLBackend

Web Server

Authentication Server

DNS

Client

But, the primary goal of enterprise management is to diagnose user-perceived problems!

But, the primary goal of enterprise management is to diagnose user-perceived problems!

Page 5: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

Sherlock

Page 6: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Challenges for the End-to-End Approach

• Don’t know what user’s performance depends on

Page 7: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

• Don’t know what user’s performance depends on– Dependencies are distributed

– Dependencies are non-deterministic

• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10

packets, but which affected user?

SQLBackend

Web Server

Auth. Server

DNS

Client

E.g., Web Connection

Challenges for the End-to-End Approach

Page 8: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock’s Contributions

• Passively infers dependencies from logs• Builds a unified dependency graph incorporating

network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise

Page 9: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock’s Architecture

Page 10: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Servers

Clients

Sherlock’s Architecture

Web1 1000ms

Web2 30ms

File1 Timeout

User Observations+

=

List Troubled Components

Network Dependency Graph

Inference Engine

Sherlock works for various client-server applications Sherlock works for various client-server applications

Page 11: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Video Server

Data Store

DNS

How do you automatically learn such distributed dependencies?

Page 12: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Strawman: Instrument all applications and libraries

Sherlock exploits timing info

Time

My Client talks to B

t

My Client talks to C

If talks to B, whenever talks to C Dependent Connections

Not Practical

Page 13: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock exploits timing info

Time

t

BBB B BB

False Dependence

BC

If talks to B, whenever talks to C Dependent Connections

Strawman: Instrument all applications and libraries Not Practical

Page 14: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock exploits timing info

Time

If talks to B, whenever talks to C Dependent Connections

t

BB C

Inter-access timeDependent iff t << Inter-access time

As long as this occurs with probability higher than chance

Strawman: Instrument all applications and libraries Not Practical

Page 15: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing

Video

DNS

Store

Dependency Graph

Page 16: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Bill’s Client StoreDNS

Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

• Works with legacy applications

• Adapts to changing conditions

Dependency Graph

Video

DNS

Store

Page 17: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

But hard dependencies are not enough…

Page 18: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Bill’s Client StoreDNS

Video Store

Video

Bill watches Video

Bill DNS Bill Video

But hard dependencies are not enough…

Need Probabilities

p1

p3

If Bill caches server’s IP DNS down but Bill gets video

Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

p2p1=10% p2=100%

Page 19: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

How do we use the dependency graph to diagnose user problems?

Page 20: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Which components caused the problem?

Need to disambiguate!!

Diagnosing User Problems

Page 21: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Bill’s Client StoreDNS

Video Store

Video

Bill Watches Video

Bill DNS Bill Video

Diagnosing User Problems

Which components caused the problem?

Bill Sees Sales

Sales

Bill Sales

Paul Watches Video2

Paul Video2

Video2 Store

Video2

Use correlation to disambiguate!!• Disambiguate by correlating

– Across logs from same client– Across clients

• Prefer simpler explanations

Page 22: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Will Correlation Scale?

Page 23: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Corporate Core

Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices

Building Network

Campus Core

Data Center

Dependency Graph is Huge

Page 24: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Can we evaluate all combinations of component failures?

The number of fault combinations is exponential!

Impossible to compute!

Will Correlation Scale?

Page 25: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Scalable Algorithm to Correlate

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Exponential PolynomialExponential Polynomial

Page 26: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Scalable Algorithm to Correlate

Only a few faults happen concurrently

Only few nodes change state

Exponential PolynomialExponential Polynomial

Page 27: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Re-evaluate only if an ancestor changes state

Reduces the cost of evaluating a case by 30x-70x

Reduces the cost of evaluating a case by 30x-70x

Exponential PolynomialExponential Polynomial

But how many is few?

Evaluate enough to cover 99.9% of faults

For MS network, at most 2 concurrent faults 99.9% accurate

Only a few faults happen concurrently

Only few nodes change state

Scalable Algorithm to Correlate

Page 28: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Results

Page 29: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Experimental Setup

• Evaluated on the Microsoft enterprise network

• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets

• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center

Page 30: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

What Do Web Dependencies in the MS Enterprise Look Like?

Page 31: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Page 32: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Auth. Server

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal

Page 33: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Auth. Server

Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.

What Do Web Dependencies in the MS Enterprise Look Like?

Client Accesses Portal Client Accesses Sales

Page 34: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

What Do File-Server Dependencies Look Like?

Client Accesses Software Distribution Server

Auth.Server

WINS DNS

Backend Server 1

Backend Server 2

Backend Server 3

Backend Server 4

ProxyFile Server

100%10% 6% 5% 2%

8%

5%

1%.3%

Sherlock works for many client-server applicationsSherlock works for many client-server applications

Page 35: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Dependency Graph: 2565 nodes; 358 components that can fail

Sherlock Identifies Causes of Poor Performance

Com

pone

nt In

dex

Time (days)87% of problems localized to 16 components87% of problems localized to 16 components

Page 36: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail

Corroborated the three significant faults

Com

pone

nt In

dex

Time (days)

Page 37: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes

Sherlock Goes Beyond Traditional Tools

Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!

Page 38: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Comparing with Alternatives

• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)

53%SCORE

(non-probabilistic)

Page 39: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Comparing with Alternatives

• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)

59%

53%SCORE

(non-probabilistic)

Shrink (probabilistic)

Page 40: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Comparing with Alternatives

• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)

Sherlock

Sherlock outperforms existing tools!Sherlock outperforms existing tools!

91%

53%

Shrink

SCORE (non-probabilistic)

Shrink (probabilistic)

59%

Page 41: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang

Conclusions

• Sherlock passively infers network-wide dependencies from logs and traceroutes

• It diagnoses faults by correlating user observations

• It works at scale!

• Experiments in Microsoft’s Network show– Finds faults missed by existing tools like SNMP– Is more accurate than prior techniques

• Steps towards a Microsoft product