semantic aware online detection of resource …amza/papers/cloudcom.pdfon traditional, relatively...

10
Semantic Aware Online Detection of Resource Anomalies on the Cloud Arnamoy Bhattacharyya * , Seyed Ali Jokar Jandaghi * , Stelios Sotiriadis * and Cristiana Amza * * Department of Electrical and Computer Engineering University of Toronto, Toronto,Canada Abstract—As cloud based platforms become more popular, it becomes an essential task for the cloud administrator to efficiently manage the costly hardware resources in the cloud environment. Prompt action should be taken whenever hardware resources are faulty, or configured and utilized in a way that causes application performance degradation, hence poor quality of service. In this paper, we propose a semantic aware technique based on neural network learning and pattern recognition in order to provide automated, real-time support for resource anomaly detection. We incorporate application semantics to narrow down the scope of the learning and detection phase, thus enabling our machine learning technique to work at a very low overhead when executed online. As our method runs “life-long” on monitored resource usage on the cloud, in case of wrong prediction, we can leverage administrator feedback to improve prediction on future runs. This feedback directed scheme with the attached context helps us to achieve an anomaly detection accuracy of as high as 98.3% in our experimental evaluation, and can be easily used in conjunction with other anomaly detection techniques for the cloud. I. I NTRODUCTION Stateful online distributed systems have become central to many modern business operations. More and more people trust cloud services offered by well-known providers, such as, Amazon, or large social network sites, such as Facebook, with running critical enterprise and data processing applications and/or with storage of personal profiles and other data they care about. Behind the scenes, the dependability and reliability needs of these systems pose unprecedented problems. The support infrastructure for such sites, e.g., for large social network sites, such as, Facebook’s HBase/HDFS solution [?], as well as for large-scale data processing, such as, Cassandra [?] rely on traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g., logging and 3- way replication. These traditional reliability solutions assumed that stateful components were few e.g., a monolithic database system, for which failures were rare, and that these few components and their failures occured in somewhat predictable component states, which were monitored directly by humans. These assumptions do not currently hold in stateful, large- scale distributed systems, such as, HBase, HDFS, or Cassan- dra, although most customer expectations have not changed. In these large-scale systems, stateful components are expected to be many, and failures are expected to be the rule rather than the exception; for example, one hardware failure per data center, per day is commonly reported [?]. Moreover, the necessary maintenance activities for monitoring, diagnosis, inspection or repair can no longer be handled through frequent human intervention. Vast amounts of logs and monitoring data are generated, e.g., 1 Terrabyte of logging data per day is typical of a small to medium-size 100 node Cassandra cluster. Therefore, a system administrator cannot catch up with inspecting the vast amounts of data produced by all stateful components. Even if monitoring data could be read by operators, humans may no longer be able to predict or recognize all possible combinations of states and situations leading to unrecoverable errors for the many components of large-scale sites. Newer solutions that take into account the fact that Clouds contain a large number of components provide automated log and monitoring data collection and some anomaly diagnosis services [?], [?], [?], [?], [?], [?]. However, in order to be able to make sense of these errors, either the user’s expecta- tions need to be pre-specified, or significant automated post- processing overhead needs to be involved. More recently, several automatic, purely black-box anomaly detection solutions have been developped for automating pat- tern recognition in monitoring data e.g., through deriving statistical pair-wise correlations, or by peer comparison [?], [?], [?], [?]. Other black-box solutions reverse match the anomalous patterns to known code patterns in order to help the users in their anomaly diagnosis [?], [?], [?]. These black- box solutions are more flexible and applicable to large-scale systems, but, they still incur significant storage and processing overhead. As before, anomaly diagnosis is typically performed off-line, post-facto, at which point vast amounts of information has already been collected. In this paper, we leverage recurrent neural networks (RNN) [?] for learning and recognizing resource consump- tion patterns online, in real-time. Our method uses minimal instrumentation in order to capture and label the application contexts during which the resource monitoring occurs. For the purposes of this paper, these contexts consist in the known workload mix type and phase that was used during the monitored application runs. Alternatively, we could use as application context the application method, module, or other clearly identifiable piece of code, or semantic scope of the application, and its input values, or the application code running the same method/module, independent of parameter values. Specifically, we capture and investigate the throughput and 9781-5090-1445-3/16$31.00 c 2016 IEEE (CloudCom’16)

Upload: others

Post on 24-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

Semantic Aware Online Detection of ResourceAnomalies on the Cloud

Arnamoy Bhattacharyya ∗, Seyed Ali Jokar Jandaghi∗, Stelios Sotiriadis∗ and Cristiana Amza∗∗Department of Electrical and Computer Engineering

University of Toronto, Toronto,Canada

Abstract—As cloud based platforms become more popular, itbecomes an essential task for the cloud administrator to efficientlymanage the costly hardware resources in the cloud environment.Prompt action should be taken whenever hardware resources arefaulty, or configured and utilized in a way that causes applicationperformance degradation, hence poor quality of service. In thispaper, we propose a semantic aware technique based on neuralnetwork learning and pattern recognition in order to provideautomated, real-time support for resource anomaly detection.We incorporate application semantics to narrow down the scopeof the learning and detection phase, thus enabling our machinelearning technique to work at a very low overhead when executedonline. As our method runs “life-long” on monitored resourceusage on the cloud, in case of wrong prediction, we can leverageadministrator feedback to improve prediction on future runs.This feedback directed scheme with the attached context helpsus to achieve an anomaly detection accuracy of as high as98.3% in our experimental evaluation, and can be easily usedin conjunction with other anomaly detection techniques for thecloud.

I. INTRODUCTION

Stateful online distributed systems have become central tomany modern business operations. More and more peopletrust cloud services offered by well-known providers, such as,Amazon, or large social network sites, such as Facebook, withrunning critical enterprise and data processing applicationsand/or with storage of personal profiles and other data theycare about.

Behind the scenes, the dependability and reliability needsof these systems pose unprecedented problems. The supportinfrastructure for such sites, e.g., for large social network sites,such as, Facebook’s HBase/HDFS solution [?], as well asfor large-scale data processing, such as, Cassandra [?] relyon traditional, relatively simple, performance debugging andreliability solutions for stateful services e.g., logging and 3-way replication. These traditional reliability solutions assumedthat stateful components were few e.g., a monolithic databasesystem, for which failures were rare, and that these fewcomponents and their failures occured in somewhat predictablecomponent states, which were monitored directly by humans.

These assumptions do not currently hold in stateful, large-scale distributed systems, such as, HBase, HDFS, or Cassan-dra, although most customer expectations have not changed. Inthese large-scale systems, stateful components are expected tobe many, and failures are expected to be the rule rather than theexception; for example, one hardware failure per data center,

per day is commonly reported [?]. Moreover, the necessarymaintenance activities for monitoring, diagnosis, inspectionor repair can no longer be handled through frequent humanintervention. Vast amounts of logs and monitoring data aregenerated, e.g., 1 Terrabyte of logging data per day is typical ofa small to medium-size 100 node Cassandra cluster. Therefore,a system administrator cannot catch up with inspecting the vastamounts of data produced by all stateful components. Even ifmonitoring data could be read by operators, humans may nolonger be able to predict or recognize all possible combinationsof states and situations leading to unrecoverable errors for themany components of large-scale sites.

Newer solutions that take into account the fact that Cloudscontain a large number of components provide automated logand monitoring data collection and some anomaly diagnosisservices [?], [?], [?], [?], [?], [?]. However, in order to beable to make sense of these errors, either the user’s expecta-tions need to be pre-specified, or significant automated post-processing overhead needs to be involved.

More recently, several automatic, purely black-box anomalydetection solutions have been developped for automating pat-tern recognition in monitoring data e.g., through derivingstatistical pair-wise correlations, or by peer comparison [?],[?], [?], [?]. Other black-box solutions reverse match theanomalous patterns to known code patterns in order to helpthe users in their anomaly diagnosis [?], [?], [?]. These black-box solutions are more flexible and applicable to large-scalesystems, but, they still incur significant storage and processingoverhead. As before, anomaly diagnosis is typically performedoff-line, post-facto, at which point vast amounts of informationhas already been collected.

In this paper, we leverage recurrent neural networks(RNN) [?] for learning and recognizing resource consump-tion patterns online, in real-time. Our method uses minimalinstrumentation in order to capture and label the applicationcontexts during which the resource monitoring occurs. Forthe purposes of this paper, these contexts consist in theknown workload mix type and phase that was used duringthe monitored application runs. Alternatively, we could useas application context the application method, module, orother clearly identifiable piece of code, or semantic scope ofthe application, and its input values, or the application coderunning the same method/module, independent of parametervalues.

Specifically, we capture and investigate the throughput and9781-5090-1445-3/16$31.00 c© 2016 IEEE (CloudCom’16)

Page 2: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

latency metrics at the application level, as well as the CPU,memory and disk consumption monitoring data, and tag themwith the application context in which they occur at theircollection point. Then, as data is collected, our RNN machinelearning technique is deployed in order to learn mathematicalmodels, per context. Finally, at the point of root-cause analysesby human inspection, whether these are performed on the livesystem, or off-line, only the contexts that statistically divergefrom the learned model are signalled to the user and, for each,we provide context-rich problem digests.

Our technique assumes that the workload types of appli-cations are known and/or that open-source systems are usedin the Cloud software stack. Specifically, we monitor ApacheCassandra1 and MongoDB2 with regards to their resource con-sumption and throughput for known workload mixes involvingread and write operations.

Instead of searching through all possible patterns, in orderto learn and distinguish baselines versus anomalies, in vastamounts of unlabeled data, our RNN-based machine learningtechnique benefits by our grouping the monitoring informationby context. By collecting information pre-sorted by context(s),we thus enable both i) machine learning and ii) user to performthe tasks that they perform the best as follows. The machinecan process large amounts of data to generate the typicalwavelet/signal and/or statistical attributes i.e., distribution,standard deviation, etc, for the various resource consumptionpatterns of each application context, or for sets of relatedapplication contexts. Based on the extracted normal signalsby context, the machine can also generate compact synopsesfor any suspected anomalies or corner cases only, at run-time,while remembering only the learned model for the normalbehavior of each context. In his turn, the user can providefeedback for selective supervised learning - normal versusanomalous - based on the few contexts which the machinehas automatically detected as unusual, or for specific knownerror cases which the machine may have incorrectly considerednormal.

After refinement of predictions, based on user feedback,our method is expected to have two advantages: i) it willreduce the high storage or performance overheads in the casewhere a mathematical model can be derived for each per-context application pattern or signal and all information thatmatches the “normal” pattern of a context can be discardedinstead of written to disk and ii) the user will have pre-sorted, pre-classified, condensed and semantically tagged datato investigate in order to perform root cause analysis.

In our experimental evaluation, the applications of inter-est are deployed in cloud virtual machines (VMs) usingOpenStack3. We utilize an intensive workload generator, theYahoo cloud serving benchmark [?], to stress the system withworkloads similar to real world scenarios.

We use two scenarios, of fault injection and resource inter-ference, respectively, in order to explore the behavior of two

1http://cassandra.apache.org2https://www.mongodb.com3https://www.openstack.org

well known cloud software packages, Apache Cassandra andMongoDB, under both normal and anomalous environmentand workload conditions. In the fault injection experiment,we inject disk I/O delays, emulating a faulty or congesteddisk behavior, for a certain period of time, during separateexperiments, with Cassandra or MongoDB. In the resourceinterference experiment, we run the regular Cassandra code inone VM together with a second VM running a misconfiguredCassandra VM acting as a disk hog, which produces diskinterference for the regular Cassandra VM. Both the diskI/O delay and the disk hog perturbations produce changes inoverall VM resource usage, as well as significant applicationthroughput drops for the regular Cassandra and MongoDBworkloads. In our experimental evaluation, we show that ourmethod allows us to detect normal versus anomalous resourceconsumption patterns with 98.3% accuracy.

By offering context based anomaly extraction with min-imal processing cost, we anticipate that our solution willpromote augmenting routine system monitoring with real-time statistical analysis for context-aware anomaly detection.Moreover, our technique can be easily used in conjunctionwith existing, orthogonal anomaly detection methods, suchas previous methods for processing application text messagesfrom logs [?], [?], [?], [?].

II. MOTIVATIONAL EXPERIMENT

Figure ?? demonstrates a representative scenario foranomaly detection and/or diagnosis. We monitor the averagetotal throughput of a four node Cassandra cluster running theYahoo! Cloud Serving Benchmark (YCSB) workload4 overtime.

The Cassandra cluster registers a relatively stable highthroughput of about 6000 operations/second initially, untilaround time 70 (seconds) when we start perturbing one node ofthe system. YCSB has an insert and update phase. Towards theend of the YCSB update phase, for a period of 100 seconds,we inject disk delays on read and write operations as follows:50% of Cassandra reads and 50% Cassandra writes to/fromdisk on the affected node are delayed by 50 seconds each.

Although we inject the delays into a fraction of the op-erations of only one of the cluster nodes, we observe that(i) the throughput of the whole cluster drops significantly, asshown in Figure ??, (ii) during the drop, there is an evenload distribution among the nodes and (iii) Cassandra can notrecover its throughput from the drop throughout our disk delayinjection period.

This significant and sustained throughput drop is a typicalscenario of performance anomaly where, even if one of thecauses is known or suspected, a thorough investigation ofall related monitoring and log data showing anomalies wouldhelp.

Since Cassandra is a complex, resource intensive distributedsystem, we introduce an anomaly detection technique, which

4https://research.yahoo.com/news/yahoo-cloud-serving-benchmark/

Page 3: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

helps with the above investigation in two ways: i) by auto-matically identifying normal versus anomalous patterns in theresource consumption monitoring data associated with the ap-plication run, on-the-fly, and ii) by identifying the applicationcontexts, such as, the workload characteristics and/or workloadphase during which the anomalies occurred.

5 10 15 20

01000

2000

3000

4000

5000

6000

Time

Throughput(ops/sec)

Figure 1: Huge performance degradation of a Cassandra Clus-ter for a faulty disk.

In the following, we introduce our context-aware statisticallearning method.

III. INFERRING ANOMALIES FROM RESOURCE USAGEMONITORING PER CONTEXT

Our key idea is to add application context to the machinelearning system that is in charge of differentiating normalversus anomalous resource consumption patterns. Under theassumption that repeatable workload patterns will give rise tosimilar resource consumption patterns, training on data alreadyclassified by context reduces the training time by helping theclassifier to discover the similar patterns in the data. We userecurrent neural networks (RNN) [?] for this purpose, becausethey are widely used in pattern recognition.

Figure ?? shows the complete workflow of our machinelearning system. During training, the deep learning RNNengine is presented with monitoring data pre-labeled withthe representative workload type and phase under which itwas gathered. Over time, the engine learns the statisticallyexpected resource consumption patterns, per resource, andper workload. All pre-labeled learned patterns are stored in awavelet database. After a period of training, new data labelledwith its workload patterns is presented to our monitoring andstatistical learning system. At this point, we match the work-load patterns with those stored in the wavelet database andwe determine if the new resource consumption patterns matchthe corresponding resource consumption wavelets from the

database. We use a technique called Dynamic Time Warping(DTW) [?] for this purpose. If DTW does not find a match,then an anomaly is presented to the user together with theworkload type and phase in which it was detected. We canincorporate user feedback in case the unusual pattern is in factnot an anomaly. The user can also input resource consumptionpatterns associated with known errors to the deep learningsystem, which can potentially correct misclassifications by thesystem of known faulty patterns. In the following, we presentour semantic aware resource monitoring, deep learning withRNN, and wavelet matching using DTW in more detail.

A. Semantic Aware Resource Monitoring

Our resource monitoring module samples resource con-sumption periodically, at a given sampling rate. Over time,it gathers training data, which is periodically fed to our deeplearning system. Our data collection methodology considersthe semantic of the particular workload being run on the cloud.Semantically specifying the scope within which different re-source consumption data is collected greatly helps the machinelearning algorithm to recognize patterns and detect anomalieswith a much lower overhead.

The resource monitoring module shown in pseudo-code inAlgorithm ?? receives and initializes the workload configura-tion, including the YCSB dataset parameters: workloadName,recordCount, OperationCount, ReadPortion, UpdatePortion,ScanPortion, InsertPortion, and other variables related to thenumber of threads to be executed, the resource usage and thesampling rate.

After the initial state is captured, the workload character-istics and phase names are used as context parameters in theexecution of various application phases e.g., the YCSB dataload (insert), the data read and update phases of the workload.At the end, the monitoring module returns the resource usageresults correlated with the workload characteristics and witheach workload execution phase (e.g., insert, read and update)thus providing pre-labeled data for pattern recognition.

Algorithm 1 Context Capture during Resource MonitoringRequiresworkloadConf: workloadName, RecCount, OpCount, Read-Portion, UpdatePortion, ScanPortion, InsertPortionNumThreads: Number of threads to be executedSamplingRate: The sampling rateResourceRetrievalFeatures: CPU, Memory, Disk usage, etc.

1: procedure RUNYCSB(workloadConf,NumThreads)2: startCassandra()3: startWorkload(ResourceRetrievalFeatures,

SamplingRate, NumThreads)4: load(wokloadConf,“ld-phase”, RecCount)5: read(workloadConf,“rd-phase”, OpCount)6: update(workloadConf,“up-phase”, OpCount)

Returns ResourceUsageResultsPerPhase

Page 4: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

application)

."

.""

Workload)1)

Workload)n)

Representative))wavelets)

."

.""

2.)Deep)Learning)

3.)Wavelet))Database)

?)

4.)Workload)characterization)

5.)Wavelet)matching)Using)DTW) 6.)Anomaly)

False)alarm)

1.)Semantic)Aware)

monitoring)

T"R"A"I"N"I"N"G"

T"E"S"T"I"N"G"

Figure 2: Workflow of our system.

B. Deep Learning from the monitored data

Our next step is to identify a pattern from the resourcemonitoring data for a given semantic scope (in our case, theworkload characteristics and phase). After the semantic awareresource monitoring in the first step, we have a collection ofn training wavelets from n runs of each workload wi. Theoutput from this stage will be a pattern Pi, one for eachmonitored resource for a given workload. The duration d fora run of a given workload is fixed across all the trainingruns. The machine learning produces a representative resourceusage wavelet, with a duration d. This wavelet aggregates allthe minor fluctuations the n training runs have for differentresource usages.

To generate this representative wavelet, we use deep learn-ing methods, specifically Recurrent Neural Networks [?].

We have a sequence of inputs (the wavelets from n trainingruns) for an output yj for a given resource, so we denote theinputs by xj for the jth input (j = 1, 2, . . . , n). The corre-sponding jth output is calculated via the following equation:

yj =Wxj +Wryj−1 (1)

We have a weight matrix Wr which incorporates the outputat the previous step linearly into the current output.

In most common architectures, there is a hidden layer whichis recurrently connected to itself. Let hj denote the hiddenlayer at timestep j. The formulas are then:

h0 = 0 (2)

hj = σ(W1xj +Wrh

i−1) (3)

yj =W2hj (4)

Where σ is a suitable non-linearity/transfer function like thesigmoid. W1 and W2 are the connecting weights between the

input and the hidden and the hidden and the output layer. Wr

represents the recurrent weights.The RNN workflow is demonstrated in Figure ??

W2# W2# W2#

Wr# Wr#

W1# W1# W1#

Output#Layer#

Hidden#Layer#

Input#Layer#

Timestamp# 1# 2# 3#

Figure 3: Workflow of RNN.

Running the RNN over the n training wavelets generates aprediction model. Using this model, we perform a predictionfor the input xn which is the last captured wavelet in thetraining data. The forward prediction of length d for the lasttraining wavelet generates the representative wavelet Pi forthat workload type and phase. In the next step, this wavelet ismatched against future test wavelets to differentiate betweena normal and an anomalous execution.

C. Wavelet Matching Using Dynamic Time Warping

At the end of the machine learning algorithm, we form adatabase that consists of all the representative signals for all thetested workloads, for each monitored resource. This mappinginformation is necessary for testing new data for anomalies in

Page 5: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

an online fashion. When new data arrives, we know the work-load type from the semantic analysis. Then we find the relationbetween the test wavelet Ti and the representative wavelet Pi

for that given workload that is stored in the wavelet database.We use a widely recognized signal processing technique calleddynamic time warping (DTW) [?] to find the relation betweenthe test wavelet and the representative wavelet.

Both the wavelets will have d data points in them asdescribed in the previous section. Suppose the wavelets Tiand Pi consist of the following sampled values.

Ti = t1, t2, . . . , td (5)

Pi = p1, p2, . . . , pd (6)

To align these two sequences using DTW, we first constructa d-by-d matrix where the (i th , jth ) element of the matrixcorresponds to the squared distance, d(ti, pj) = (tia\pj)2,which is the alignment between points ti and pj . To find thebest match between these two sequences, we retrieve a paththrough the matrix that minimizes the total cumulative distancebetween them. In particular, the optimal path is the path thatminimizes the warping cost:

DTW (T, P ) = min

√√√√ K∑

k=1

wk

(7)

where wk is the matrix element (i, j)k that also belongs tothe k-th element of a warping path W , a contiguous set ofmatrix elements that represent a mapping between T and P .

This warping path can be found using dynamic program-ming to evaluate the following recurrence.

γ(i, j) = d(ti, pj) +min {γ(i− 1, j − 1), γ(i− 1, j),

γ(i, j − 1)(8)

where d(i, j) is the distance found in the current cell, andγ(i, j) is the cumulative distance of d(i, j) and the minimumcumulative distances from the three adjacent cells.

At the end of the DTW algorithm, we have a minimumdistance distmin that will warp T to P . The lower the distance,the more similar the signals are.

To use the DTW distance to measure differences betweennormal and anomalous signals, we use the following method.

During the training phase, we calculate the DTW distancebetween each of the training wavelets and the representativewavelet and we calculate the maximum distance, we call itthres normal. This measure tells us the maximum deviationof a normal flow from the representative workflow. During thetesting phase, if the calculated DTW distance is larger than thethres normal for a test wavelet, as calculated above, we flagthe execution as anomalous.

As we are monitoring multiple resources, anomaly in anyone resource consumption patterns will generate alarms. Inthe alarm information, we capture all the resource utilization

patterns that are anomalous. This feedback gives the user betterunderstanding on what is wrong.

D. Dealing with False Alarms

There can be scenarios where, due to unexpected systemactivity, a normal execution still differs from our representativewavelet by more than the maximum DTW distance. In thatcase, we can incorporate feedback from the user to enhanceour machine learning model (step 6 in Figure ??). We bufferthe wavelet for any workflow we have detected as anomalous.Once the user tells us that our prediction is wrong, wefeed the wavelet to our RNN and modify the representativeworkload to integrate characteristics from the new wavelet,including adjusting the maximum allowed deviation of thenormal pattern. This method allows us to refine our predictionin case of false positives (where a normal execution is wronglyflagged as anomalous).

Similarly, user feedback is used to correct false negativeswhen resource monitoring patterns for certain workloads arenot flagged as anomalous by our system, even if they actuallyare. For known errors, or whenever a new anomalous patternis detected by the user, we store the user-discovered patternsfor anomalous executions of a given workload pattern in ourwavelet database. Whenever new wavelets are presented toour RNN system thereafter, we compute the DTW distancefrom both the normal pattern and the user-directed anomalouspatterns. We flag a new wavelet as anomalous if its maximumdistance to any of the anomalous patterns is smaller than itsdistance to the normal pattern.

IV. EXPERIMENTAL EVALUATION

Our experimental platform includes a four node Cassan-dra cluster configuration that is deployed in an OpenStacksystem. We use Apache Cassandra 2.0.16 and MongoDBAmazon 3.2.6 for running our experiments. Also, we usethe YCSB Cassandra workloads (workload-a and workload-b) [?]. Workload-a is an application example for a sessionstore recording recent actions and workload-b is an applicationexample of photo tagging; adding a tag is an update operation,but most operations are to read tags. We set the number ofthreads to 5, the record and operation counts to 500000 andwe use default values for all other variables. The monitoringsampling rate for the resource usage is set to a second. Weuse the following two anomaly scenarios for our experiments:

1) Disk Delay: In this anomaly scenario, we emulate afaulty disk with the help of the systemtap tool [?].During the execution of a workload in Cassandra orMongoDB, we inject a delay of 50 seconds each in50% of the reads and 50% of the writes to disk. Wekeep injecting the delay for a period of 10 seconds.This scenario is similar to the more extensive disk faultinjection scenario we presented in section ?? wherewe motivated the investigation of a surprisingly highthroughput drop in Cassandra associated with temporarydisk delays.

Page 6: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

2) Resource Interference: We co-locate two VMs runningCassandra workloads on the same physical machine.One of the workloads is acting as a disk hog, which isgradually increasing its disk utilization, and eventuallyexceeding its predeclared requirements for disk usageand causing resource interference to the other workload.Detecting this type of anomaly may help guide VMconsolidation decisions. We use 2 identical mediumsize VMs (4 Cores, 4GB Ram, 40GB HD) deployed inOpenStack. The physical machine on which both VMsare placed has a Xeon(R) CPU E5-2650 2.00GHz (2x 8Cores), 32GB of RAM.

For each of the above two anomaly scenarios, we have 20normal runs and 20 runs with anomaly inserted for both thedatabase systems. Each training run contains a complete run ofthe respective YCSB workload. The neural network is trainedusing the sampled data points from the 20 training runs and itextracts a representative wavelet for each workload. In the nextstep of training, we build the wavelet database by computingthe DTW distance between the representative workload andthe training wavelets from the 20 runs of each workloadand storing the maximum distance, the workload type andthe representative wavelet. The training time is around 1second per training run, on average, for both Cassandra andMongoDB, thanks to our context aware machine learningmethod.

A. Cassandra Results

0"

20"

40"

60"

80"

100"

CPU$Usage$

Predicted"

Actual"

Run$1$ Run$2$.$.$.

Test$Run

Figure 4: Training wavelets and a test run showing thepredicted wavelet(dotted) overlayed with a new run (solid) ofthe same workload, not seen during training for a Cassandraworkload.

Figure ?? shows the representative wavelet extracted forCPU utilization by the recurrent neural network (RNN) for a

particular workload (workload-a) for Cassandra. We show 7(out of 20) training wavelets used for extracting the predictedwavelet (dotted) shown in the test run, where we compare theextracted wavelet in dotted line with a new run of the sameworkload, not seen during training, shown in solid lines at theend.

As shown in the figure, for Cassandra, the extracted patternis very similar to the actual training patterns. In the following,we use the trained RNN for recognizing normal Cassandraworkloads versus anomalous patterns for Cassandra workloadsunder the two anomaly scenarios described above: disk delayand disk resource interference.

thres_normal,=,220

Figure 5: The DTW distance helps to detect an anomalous run(grey) from a normal run (light grey) for disk delay anomalyin Cassandra, injected 55-65 sec.

thres_normal,=,0.82

Figure 6: The DTW distance helps to detect an anomalousrun (grey) from a normal run (light grey) for resource inter-ference anomaly in Cassandra. Although the disk hog is activeduring the whole run, its disk utilization grows gradually andapproaches maximum during the last third of the run.

Page 7: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

For creating test data for the detection of anomalies, weuse 10 normal runs and 10 runs with anomaly inserted foreach workload. Figure ?? shows the DTW distance from therepresentative run (black) that was computed for new normalruns not seen during training (light grey) and anomalous runs(grey), respectively, for a YCSB workload on Cassandra. Italso shows the CPU utilization for the predicted, normal andanomalous scenarios. Figure ?? shows the same things, butfor the resource interference anomaly. The disparity betweenthe DTW distance in the normal and faulty runs is even morepronounced in this case.

B. MongoDB Results

Figure ?? shows the representative wavelet extracted forCPU utilization by the recurrent neural network (RNN) fora particular workload (workload-a) for MongoDB. Again weshow 7 representative training runs in the figure and we showthe extracted pattern in dotted line with a new run shown insolid lines at the end. We see that, in the case of MongoDB,there is more variation in CPU utilization than for Cassandra.However, there is a repeatable low-high-low utilization patternfor this workload, which is extracted successfully by the RNN.

Figure ?? shows the DTW distance computed for the newnormal run, not previously seen in training (light grey) andanomalous runs (grey) from the representative run (black) forthe YCSB workload on MongoDB. The figure also showsthe CPU utilization for the predicted, normal and anomalousscenarios. Around time 65 seconds we inserted the disk delayfor around 10 seconds. This anomaly is also captured by ourmethod using the DTW distance.

C. Using User Feedback

Out of total 120 test scenarios of Cassandra and MongoDB,at first, we had 15 false positives (normal runs flagged asanomaly) and 10 false negatives (anomalous runs flagged asnormal). As mentioned before, we buffer the runs triggered asanomalous until the administrator diagnoses the real problem.In case of false positives, we update our wavelet databasewith new normal trends and the treatment of false negativesare done using future detection of anomalies and creating rep-resentative databases for anomalous runs. After incorporatinguser feedback, we ran 20 more runs, half normal and halfanomalous, for each of Cassandra and MongoDB. Out of thesenew test dataset of 160 runs (that included the 120 test runsthat we have initially), our method is able to reduce the totalnumber of false positives to 2 and false negatives to 0.

V. RELATED WORK

In this section, we briefly classify the existing techniquesin anomaly detection and compare them with our work.

A. Theoretical Model-Based Approaches

Theoretical model-based approaches [?], [?], [?], [?] lever-age theoretical concepts such as Queuing theory [?], [?] tocontrast observed system-behavior with theoretically expectedbehaviour, which may be based on unrealistic assumptions ofsystem behaviour.

0"

20"

40"

60"

80"

100"

CPU$Usage$

Predicted(Actual(

Run$1 Run$2 Test$Run.$.$.

Figure 7: The training wavelets, the predicted wavelet(dotted)overlayed with unseen actual wavelet (solid) for a workload inMongoDB. We show here 7 training runs used for extractingthe wavelet shown in the test run where we compare theprediction with new run of the same workload.

0"

20"

40"

60"

80"

100"

120"

1" 10" 19" 28" 37" 46" 55" 64" 73" 82" 91" 100" 109"

CPU$Usage$%$

Time$(Seconds)$

Prediction" Actual" Anomalous"

836$1079$

DTW$Distance$thres_normal,=,978

Figure 8: The DTW distance helps to detect an anomalous run(grey) from a normal run (light grey) for disk delay anomalyin MongoDB, injected 65-75 sec.

To mitigate brittleness of theoretical modeling, Threska etal. [?] introduce Ironmodel, which is an adaptive model thatadjusts its parameters based on the context that the systemoperates in. However, compared to our work, model buildingin theoretical approaches, including Ironmodel, needs to usemany assumptions, is complex, hence, expensive.

B. Expectation-Based Approaches

System programmers as well as system analysts can detectanomalies through expressing their expectations about the sys-tem’s communication structure, timing, and resource consump-

Page 8: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

tion. Language based approaches include MACE [?], TLA+[?], PCL [?], PSpec [?] and Pip [?]. PSpec [?] is a performancechecking assertion language that allows system designers toverify their expectations about a wide range of performanceproperties. Performance analysis tools [?], [?], [?], [?], [?], [?]allow programmers to analyze the performance of a system tofind sources of inefficiencies.

The limitation of these methods compared to our work isthat they assume extensive knowledge of the user about thesystem’s communication structure, timing, and resource con-sumption; anomalous patterns in modern large-scale systemsmay involve a combination of states and factors that cannotalways be predefined.

C. Mining Operational Logs

DISTALYZER [?] provides a tool for developers to comparea set of baseline logs with normal performance to another setwith anomalous performance. It categorizes logs into eventlogs and state logs. Xu et al. [?] leverage the schema oflogging statements in the source code to parse log records.Fu et al. [?] use Finite State Automata to learn the flow ofa normal execution from a baseline log set, and use it todetect anomalies in performance in a new log set. Similar toother off-line approaches [?], [?], [?], [?], they use text miningtechniques to convert unstructured text messages in log filesto log keys, which requires expensive text processing.Yuan etal. [?] introduce techniques to augment logging informationto enhance diagnosability.

Unlike these previous anomaly detection techniques, we arefocused on recognizing patterns in resource monitoring data,which may incur less overhead than processing verbose textlogs.

D. Context-Awareness in Anomaly Detection

Cohen et al. [?] correlate system metrics to high-level states,and leverage the model to find the root-cause of faults basedon past failure cases within similar settings [?]. Similarly,Lao et al. [?] identify known problems by mapping tracesof function calls to known high level problems and form aknowledge-base of association between symptoms and root-causes. Liblit et al. [?] samples traces from multiple instancesof a program from multiple users to limit the overhead ofmonitoring on each user. They use classification techniques topredict correlation between traces and known bugs.

In recent, gray-box solutions [?], [?], part of the applicationcontext in which the anomaly occurs is captured and/orrecovered as part of automatic monitoring or data analytics foranomaly detection in two main ways: via manual instrumen-tation or automatically, through static analysis of applicationcode. The part of the high level context thus captured is thenleveraged for root cause analysis of anomalies. For examplein Zhao et al [?] the application context is partly capturedthrough static analysis and also recovered through automaticinferences based on data flow analysis for block identifiers inlog records.

The above-mentioned gray-box methods present severaltrade-offs such as: run-time overhead, storage overhead, post-processing overhead and the extent of manual user interven-tion. Our work reduces all types of overhead and needs trivialinstrumentation.

Sambasivan et al. [?], Pip [?], PinPoint [?] infer context interms of execution flow and diagnose anomalies by comparingthe expected execution flow with the anomalous one. The com-mon denominator of these approaches is detailed fine-grainedtracing, which is provided by explicit code instrumentationor thorough distributed tracing enablers such as x-trace [?],Dapper [?] and Magpie [?]. Due to high monitoring resolu-tion and causality information, trace-based anomaly detectiontechniques are more precise than regular log-based approaches.However, tracing generates a large amount of data which needsto be processed offline. Also, because of the high resolutiondata collection, they impose undue runtime overhead.

Sherlog [?] infers likely paths of execution from log mes-sages to assist the operator in diagnosing the root-cause offailures. Like the above methods, SherLog is a postmortemerror diagnosis tool. In contrast, just like our previous work [?]our resource monitoring work in this paper is designed forreal-time anomaly detection.

VI. CONCLUSION

With a dramatic increase in scale and complexity of cloudinfrastracture it is no longer possible for the human user todiagnose anomalies by searching through vast amounts of logand monitoring information with no “guiding map”.

Our contribution in this paper is to bridge the gap betweenmachine-centric black-box data mining anomaly detectionsolutions, and human-centric theoretical or language-basedsolutions. In the former, the machine attempts to process vastamounts of operational data with little or no user support. Inthe latter, the human is expected to know and predeclare allor most system and application patterns beforehand.

We believe that it is unreasonable to expect a reliabilitysolution to work at either extreme of the automation spectrumin complex or large-scale distributed software for the cloud,such as Cassandra, and MongoDB. Instead, we propose asemantic-aware reliability solution that puts the human back inthe reliability loop, while still working with existing automatic,legacy, fault-masking, fault-tolerance and logging solutions,and with minimal modifications to running software systems.Our key idea is to record resource monitoring data pre-labeledwith the semantic application contexts they belong to. In thisway, instead of poring over vast amounts of machine generatedlow level data, the human is relegated to tasks that humans arenaturally good at performing i.e., inspecting interesting highlevel patterns and digging down into unusual patterns, thatare already signalled by the machine learning runtime. Thus,the user can obtain information from the system, and givefeedback to the system about a select few unusual patterns.This enables a synergy between human and machine into abest of both worlds solution, combining the strengths of both.

Page 9: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

Our semantically-tagged model is based on RNN models,and allows association of resource consumption signals withthe workload patterns they occur in. The model thus serves asa reference for both automatic runtime pattern classificationand the user’s precise interpretation or diagnosis of a detectedanomalous pattern.

We believe that our highly accurate and low overheadanomaly detection tool will form the basis for an effectivereal-time approach which synthesizes monitoring data intosemantically rich patterns that convey meaning according to alearned model of the system per-context.

REFERENCES

[1] Saeed Ghanbari, Ali B. Hashemi, and Cristiana Amza. 2014. Stage-awareanomaly detection through tracking log points. In Proceedings of the15th International Middleware Conference (Middleware ’14). ACM, NewYork, NY, USA, 253-264.

[2] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, andRussell Sears. 2010. Benchmarking cloud serving systems with YCSB.In Proceedings of the 1st ACM symposium on Cloud computing (SoCC’10). ACM, New York, NY, USA, 143-154.

[3] Williams, R.J., and Zipser, D. (1988). A learning algorithm for continuallyrunning fully recurrent neural networks (Tech. Rep. No. 8805). San Diego:University of California, Institute for Cognitive Science.

[4] Berndt D, Clifford J (1994) Using dynamic time warping to find patternsin time series. AAAI-94 workshop on knowledge discovery in databases,pp 229-248

[5] M. E. Crovella and T. J. LeBlanc. Performance debugging using parallelperformance predicates. SIGPLAN Not., 28(12):140–150, Dec. 1993.

[6] G. Jiang, H. Chen, and K. Yoshihira. Discovering likely invariantsof distributed transaction systems for autonomic system management.Cluster Computing, 9(4):385–399, Oct. 2006.

[7] Matt Massie, Bernard Li, Brad Nicholes, Vladimir Vuksan, RobertAlexander, Jeff Buchbinder, Frederiko Costa, Alex Dean, Dave Josephsen,Peter Phaal, and Daniel Pocock. 2012. Monitoring with Ganglia (1st ed.).O’Reilly Media, Inc.

[8] Wolfgang Barth. 2008. Nagios: System and Network Monitoring (2nded.). No Starch Press, San Francisco, CA, USA.

[9] G. Jiang, H. Chen, and K. Yoshihira. Modeling and tracking of transactionflow dynamics for fault detection in complex systems. IEEE Trans.Dependable Secur. Comput., 3(4):312–326, Oct. 2006.

[10] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthi-tacharoen. Performance debugging for distributed systems of black boxes.SIGOPS Oper. Syst. Rev., 37(5):74–89, Oct. 2003.

[11] X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha:Blackbox diagnosis of mapreduce systems. SIGMETRICS Perform. Eval.Rev., 37(3):8–13, Jan. 2010.

[12] M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problemdiagnosis in parallel file systems. In Proceedings of the 8th USENIXConference on File and Storage Technologies, FAST’10, pages 4–4, 2010.

[13] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: Apervasive network tracing framework. In Proceedings of the 4th USENIXConference on Networked Systems Design & Implementation,NSDI’07, pages 20–20, 2007.

[14] C. E. Killian, J. W. Anderson, R. Braud, R. Jhala, and A. M. Vahdat.Mace: Language support for building distributed systems. SIGPLAN Not.,42(6):179–188, June 2007.

[15] R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman,M. Stroucken, W. Wang, L. Xu, and G. R. Ganger. Diagnosing perfor-mance changes by comparing request flows. In Proceedings of the 8thUSENIX Conference on Networked Systems Design and Implementation,NSDI’11, pages 43–56, 2011.

[16] L. Lamport. Specifying Systems: The TLA+ Language and Tools forHardware and Software Engineers. Addison-Wesley Longman PublishingCo., Inc., Boston, MA, USA, 2002.

[17] H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automaticmisconfiguration troubleshooting with peerpressure. In Proceedings ofthe 6th Conference on Symposium on Opearting Systems Design &Implementation - Volume 6, OSDI’04, pages 17–17, 2004.

[18] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detectinglarge-scale system problems by mining console logs. In Proceedings ofthe ACM SIGOPS 22Nd Symposium on Operating Systems Principles,SOSP ’09, pages 117–132, 2009.

[19] K. Yamanishi and Y. Maruyama. Dynamic syslog mining for networkfailure monitoring. In Proceedings of the Eleventh ACM SIGKDDInternational Conference on Knowledge Discovery in Data Mining, KDD’05, pages 499–508, 2005.

[20] J. Dean. Designs, lessons and advice from building large distributedsystems. Keynote from LADIS, 2009.

[21] C. Yuan, N. Lao, J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y.Ma. Automated known problem diagnosis with event traces. SIGOPSOper. Syst. Rev., 40(4):375–388, Apr. 2006.

[22] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving softwarediagnosability via log enhancement. SIGARCH Comput. Archit. News,39(1):3–14, Mar. 2011.

[23] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase.Correlating instrumentation data to system states: A building block forautomated diagnosis and control. In Proceedings of the 6th Conferenceon Symposium on Opearting Systems Design & Implementation -Volume 6, OSDI’04, pages 16–16, 2004.

[24] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox.Capturing, indexing, clustering, and retrieving system history. SIGOPSOper. Syst. Rev., 39(5):105–118, Oct. 2005.

[25] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation viaremote program sampling. SIGPLAN Not., 38(5):141–154, May 2003.

[26] S. Ma and J. Hellerstein. Mining partially periodic event patternswith unknown periods. In Data Engineering, 2001. Proceedings. 17thInternational Conference on, pages 205–214, 2001.

[27] Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, YuLuo, Ding Yuan, and Michael Stumm. 2014. lprof: a non-intrusive requestflow profiler for distributed systems. In Proceedings of the 11th USENIXconference on Operating Systems Design and Implementation (OSDI’14).USENIX Association, Berkeley, CA, USA, 629-644.

[28] A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios. Clusteringevent logs using iterative partitioning. In Proceedings of the 15th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’09, pages 1255–1264, 2009.

[29] K. Shen, M. Zhong, and C. Li. I/o system performance debuggingusing model-driven anomaly characterization. In Proceedings of the 4thConference on USENIX Conference on File and Storage Technologies -Volume 4, FAST’05, pages 23–23, 2005.

[30] E. Thereska and G. R. Ganger. Ironmodel: Robust performance modelsin the wild. SIGMETRICS Perform. Eval. Rev., 36(1):253–264, June 2008.

[31] J. Stearley. Towards informatic analysis of syslogs. In Proceedings of the2004 IEEE International Conference on Cluster Computing, CLUSTER’04, pages 309–318, 2004.

[32] C. Stewart. Performance Modeling and System Management for InternetServices. PhD thesis, Rochester, NY, USA, 2009.

[33] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpiefor request extraction and workload modelling. In Proceedings of the6th Conference on Symposium on Opearting Systems Design &Implementation - Volume 6, OSDI’04, pages 18–18, 2004.

[34] D. Gross, J. F. Shortle, J. M. Thompson, and C. M. Harris. Fundamentalsof Queueing Theory. Wiley-Interscience, New York, NY, USA, 4thedition, 2008.

[35] Book review: The art of computer systems performance analysis: Tech-niques for experimental design, measurement, simulation, and modelingby raj jain (john wiley & sons 1991). SIGMETRICS Perform. Eval.Rev., 19(2):5–11, Sept. 1991. Reviewer-Al-Jaar, Robert Y.

[36] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B.Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradynparallel performance measurement tool. Computer, 28(11):37–46, Nov.1995.

[37] S. E. Perl and W. E. Weihl. Performance assertion checking. SIGOPSOper. Syst. Rev., 27(5):134–145, Dec. 1993.

[38] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, andA. Vahdat. Pip: Detecting the unexpected in distributed systems. InProceedings of the 3rd Conference on Networked Systems Design &Implementation - Volume 3, NSDI’06, pages 9–9, 2006.

[39] K. A. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman,A. D. Malony, L. C. McInnes, and B. Norris. Capturing performanceknowledge for automated analysis. In Proceedings of the 2008 ACM/IEEEConference on Supercomputing, SC ’08, pages 49:1–49:10, 2008.

Page 10: Semantic Aware Online Detection of Resource …amza/papers/cloudcom.pdfon traditional, relatively simple, performance debugging and reliability solutions for stateful services e.g.,

[40] Oprofile. http://oprofile.sourceforge.net/.[41] Apache Cassandra. http://cassandra.apache.org/.[42] U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible

distributed tracing from kernels to clusters. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11,pages 311–326, 2011.

[43] T. Marian, H. Weatherspoon, K.-S. Lee, and A. Sagar. Fmeter: Extractingindexable low-level system signatures by counting kernel function calls.In P. Narasimhan and P. Triantafillou, editors, Middleware 2012, volume7662 of Lecture Notes in Computer Science, pages 81–100. SpringerBerlin Heidelberg, 2012.

[44] K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis ofsystems logs to diagnose performance problems. In Proceedings of the 9thUSENIX Conference on Networked Systems Design and Implementation,NSDI’12, pages 26–26, 2012.

[45] Systemtap. http://sourceware.org/systemtap/.[46] Facebook Scribe. https://github.com/facebookarchive/scribe/wiki.

[47] YCSB Core workloads, Available at: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads.

[48] Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection indistributed systems through unstructured log analysis. In Proceedings ofthe 2009 Ninth IEEE International Conference on Data Mining, ICDM’09, pages 149–158, 2009.

[49] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal,D. Beaver, S. Jaspan, and A. Shanbhag. Dapper, a large-scale distributedsystems tracing infrastructure.

[50] M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, andE. Brewer. Path-based faliure and evolution management. In Proceedingsof the 1st Conference on Symposium on Networked Systems Design andImplementation - Volume 1, NSDI’04, pages 23–23, 2004.

[51] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog:Error diagnosis by connecting clues from run-time logs. SIGARCH

Comput. Archit. News, 38(1):143–154, Mar. 2010.