uc berkeley online system problem detection by mining console logs wei xu* ling huang † armando...
Post on 15-Dec-2015
221 Views
Preview:
TRANSCRIPT
UC Berkeley
Online System Problem Detection by Mining Console Logs
Wei Xu* Ling Huang†
Armando Fox* David Patterson* Michael Jordan*
*UC Berkeley † Intel Labs Berkeley
2
Why console logs?
• Detecting problems in large scale Internet services often requires detailed instrumentation
• Instrumentation can be costly to insert & maintain• High code churn• Often combine open-source building blocks that are not
all instrumented
• Can we use console logs in lieu of instrumentation?
+ Easy for developer, so nearly all software has them
– Imperfect: not originally intended for instrumentation
Problems we are looking for
• The easy case – rare messages
• Harder but useful - abnormal sequences
NORMAL
receiving blk_1received blk_1
receiving blk_2
ERROR
what is wrong with blk_2 ???
3
Overview and Contributions
* Large-scale system problem detection by mining console logs (SOSP’ 09)
• Accurate online detection with small latency
4
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing*
Free text logs200 nodes
Constructing event traces from console logs
• Parse: message type + variables• Group messages by identifiers (automatically
discovered)– Group ~= event trace
receiving blk_1
receiving blk_1
received blk_1received blk_1
reading blk_1reading blk_1
receiving blk_2
received blk_2
receiving blk_2
receiving blk_1
receiving blk_1
received blk_1received blk_1
reading blk_1reading blk_1
receiving blk_2
received blk_2
receiving blk_2
5
Online detection: When to make detection?
• Cannot wait for the entire trace• Can last arbitrarily long time
• How long do we have to wait?• Long enough to keep correlations• Wrong cut = false positive
• Difficulties• No obvious boundaries• Inaccurate message ordering • Variations in session duration
6
receiving blk_1
received blk_1
reading blk_1
reading blk_1
deleting blk_1
deleted blk_1
receiving blk_1
received blk_1
Time
Frequent patterns help determine session boundaries
• Key Insight: Most messages/traces are normal• Strong patterns
• “Make common paths fast”• Tolerate noise
7
8
Two stage detection overview
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes
Stage 1 - Frequent patterns (1): Frequent event sets
9
receiving blk_1
received blk_1reading blk_1
deleting blk_1
deleted blk_1
receiving blk_1
received blk_1
Time
Coarse cut by time
Find frequent item set
Refine time estimation
reading blk_1
error blk_1
Repeat until all patterns found
PCA Detection
10
Stage 1 - Frequent patterns (2) : Modeling session duration time
• Assuming Gaussian?• 99.95th percentile estimation is off by half• 45% more false alarms
• Mixture distribution• Power-law tail + histogram head
DurationDuration
Cou
nt
Pr(
X>
=x)
11
Stage 2 - Handling noise with PCA detection
• More tolerant to noise• Principal Component Analysis (PCA) based
detection
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes
Frequent pattern matching filters most of the normal events
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing*
Free text logs200 nodes
100%86%
14% 13.97%
0.03%
13
Evaluation setup
• Hadoop file system (HDFS)• Experiment on Amazon’s EC2 cloud
• 203 nodes x 48 hours• Running standard map-reduce jobs• ~24 million lines of console logs
• 575,000 traces• ~ 680 distinct ones• Manual label from previous work
• Normal/abnormal + why it is abnormal• “Eventually normal” – did not consider time• For evaluation only
14
Frequent patterns in HDFS
Frequent Pattern 99.95th percentile Duration
% of messages
Allocate, begin write 13 sec 20.3%
Done write, update metadata 8 sec 44.6%
Delete - 12.5%
Serving block - 3.8%
Read exception - 3.2%
Verify block - 1.1%
Total 85.6%
• Covers most messages• Short durations
(Total events ~20 million)
15
Detection latency
• Detection latency is dominated by the wait time
Single event pattern
Frequent pattern (matched)
Frequent pattern (timed out)
Non pattern events
16
Detection accuracy
True Positives
False Positives
False Negatives
Precision Recall
Online 16,916 2,748 0 86.0% 100.0%
Offline 16,808 1,746 108 90.6% 99.3%
(Total trace = 575,319)
• Ambiguity on “abnormal”• Manual labels: “eventually normal”• > 600 FPs in online detection as very long latency• E.g. a write session takes >500sec to complete
(99.99th percentile is 20sec)
Future work
• Distributed log stream processing– Handle large scale cluster + partial failures
• Clustering alarms• Allowing feedback from operators• Correlation on logs from multiple applications /
layers
17
Summary
http://www.cs.berkeley.edu/~xuw/Wei Xu <xuw@cs.berkeley.edu>
Precision Recall
Online 86.0% 100.0%
18
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes24 million lines
top related