operations as a process

For Review Purposes

1 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu

DevOps: A Software Architect’s Perspective Chapter: Operations as a Process

By

Len Bass, Ingo Weber, Liming Zhu

With

Xiwei Xu and Min Fu

For Review Purposes


This is a chapter of a book we plan on releasing one chapter at a time in order to get

feedback. Information about this book and pointers to the other chapters can be found

at http://ssrg.nicta.com.au/projects/devops_book. We would like feedback about

what we got right, what we got wrong, and what we left out.

The table of contents is:

Part I - Background

1. What is devops? (initial release 2013.12.10)

2. The cloud as a platform (initial release: 2014.08.08)

3. Operations (initial release: 2014.08.25)

Part II Deployment Pipeline

4. Overall architecture (initial release 2014.01.23)

5. Build/Test (initial release 2014.03.13)

6. Deployment (initial release 2014.03.31)

Part III Cross Cutting Concerns

7. Monitoring (initial release: 2014.04.30)

8. Security (initial release: 2014.06.13)

9. Other ilities (initial release: 2014.09)

10. Business considerations (initial release 2014.08.20)

Part IV Case Studies

11. Case Study - Rafter, Inc (initial release 2014.04.16)

12. Case Study – Continuous Deployment Pipeline

13. Case Study - Migration

...

Part V The Future of DevOps

14. Operations as a process (this chapter)

15. Implications of industry trends on DevOps

For Review Purposes


Operations as a Process

with Xiwei Xu and Min Fu

If you can't describe what you are doing as a process,

you don't know what you' re doing.

W. Edwards Deming

1. Introduction As we discussed in the ility chapter, the continuous deployment pipeline is not

just another software product exhibiting system of systems characteristics, it also has

strong characteristics of a process. This is also true for many other operations such as

diagnosis, backup and recovery, upgrade, and maintenance. Even your favorite cron

jobs and scripts may be pipelining a set of small tools, a familiar concept in the

administrator’s world. We can view the world as a large number of such process-

oriented systems operating on your application of interest, not just sequentially, but

also with a lot of simultaneity and parallelism at both the process level and the task

level.

The purpose of this chapter is to discuss the implications of treating all operations

as processes. By treating, we mean you can discover a process model from existing

operation software/scripts and their logs. You can analyze the discovered process

models for improvement opportunities. You can use the process models to monitor

the progression of various operations and detect errors and recover from them as

early as possible. You want to achieve this at the step level rather than the whole

process level since that leads to earlier error detection and recovery. You can also use

the process models to help other activities such as setting monitoring rules and root

cause diagnosis. Performing these activities is difficult as they often lack some

context information on what is happening. The process model and the monitoring of

progression provides that context. The process models can also be a central place to

correlate different events and monitoring metrics to improve your understanding of

the runtime system. The opportunities are ample. The discovered process model can

be used to orchestrate variations in the original process and could become a

mechanism for executing future applications of the process.

From one perspective, a DevOps process may resemble a human-intensive

process similar to a software development process. You can apply the lessons learned

in improving software development processes to DevOps processes including agility,

For Review Purposes


life cycle models, quality controls and capability maturity models. The DevOps

movement arguably started as an agile attempt on Dev-Ops collaboration and the

application of software engineering practices to the Ops realm.

From another perspective, process-oriented systems can be seen as workflow

systems or business process management systems. Relevant results from these areas

include mining process models from logs and event traces, process analysis, runtime

monitoring and prediction, process quality improvement and human-intensive

processes. You can see an example of this perspective in our discussion about rolling

upgrade in Chapter: Deployment. In this chapter, we are focusing on the workflow or

business process perspective on operations processes.

One final introductory issue that you should consider is the level of abstraction in

your process models. A process model is a specification of a set of activities that

when carried out will result in the completion of a desired result. This set of activities

can be modeled at a fine grained level – every step in carrying out the process – or at

a coarse grained level – the major activities performed during the process. The

modeling level depends on the richness of the source of discovering the process

model and the results you wish to obtain from the model. As with a software system,

a process model can be understood from its run time properties – performance,

reliability, and security being three important qualities – as well as its development

time properties – interoperability and modfiability being two. In this chapter, we

focus on research that we have performed into the reliability of operations processes

whose process model specification is correct. Other perspectives involve ensuring the

correctness of the process (and its model), improving the performance of the

execution of the process, or constructing the model efficiently at a desired level of

granularity.

2. Motivation and Overview

As discussed in the Chapter: Ilities, reliability refers to the capability of the overall

deployment pipeline and its individual pieces to maintain its service provision for

defined periods of time. In the past, the use of the individual pieces in the pipeline

would be infrequent as software was released at a frequency of weeks if not months.

Now continuous delivery/deployment practices are making sporadic operations such

as deployment or integration test weekly or daily occurrences. As we mentioned

when we discussed testability, operation and infrastructure scripts are difficult to test

because mimicking the real complexity of the production environment is difficult.

Traditional exception handling mechanisms are usually designed with a single

language environment in mind. A typical pipeline, on the other hand, has to deal with

different types of error responses from different types of systems – ranging from the

error code of a Cloud API call to the potential silent failure of a configuration

For Review Purposes


change. The uncertainty inherent within a cloud that we discussed in Chapter: The

Cloud as a Platform, also introduces some random failures so that a script that was

previously successful may produce an invalid outcome. What this means is that

defensive programming strategies, while important, are not going to be sufficient.

Instead, we advocate deriving an understanding of what should be the desired state of

a process and comparing that with the actual state. In essence, that is the basis of the

approach we discuss in this chapter. Operations processes have several

characteristics that make this approach more tractable than for general business

processes. In particular

Operations processes manipulate only a few types of entities. In the

rolling upgrade example we use in this chapter these are ELB, ASG,

Launch Configuration, and instances. In general business processes,

there can be a large number of number different types of entities that are

being manipulated.

Operations processes have a time frame measured in 10s of minutes if not

in hours. This means that gathering logs, detecting errors, and recovering

from errors in a few minutes is useful. In general business processes, the

time frames can be much shorter.

Operations tools typically generate high quality logs that can be used to

create the process model without a lot of noise.

In order to know what the desired state of a process should be we first need to

discover the process. Once we have done this, we can prepare for error detection,

diagnosis, and recovery. We do this by processing logs from successful execution of

the process. This process is done offline after we have achieved successful execution

and generated associated logs.

During an execution of the process, we compare the desired state of the process with

the current state of the process. Any difference indicates a reliability problem and

provides the seeds of a recovery strategy. These activities happen online

(synchronously) with the process.

We will use the process of rolling upgrade as our running example throughout the

chapter. A rolling upgrade places a new version of an application into service one

server at a time. It removes a server from service, possibly deleting the server, loads

the new version of the application onto that server or a replacement server, and starts

the newly loaded server. We discussed rolling upgrade in more detail in Chapter:

Deployment. Figure 1 repeated from that chapter shows the rolling upgrade process

used by Asgard on AWS.

For Review Purposes


Figure 1: Rolling Upgrade process from Asgard on AWS

3. Offline activities.

The process model can be created manually based on your understanding of the

operation and the code/scripts. Alternatively, process-mining techniques can be used

to discover the process, especially from logs. In this section we describe the activities

that are carried out off line based on successful executions of the process. These

activities provide the basis for the on line error detection and recovery.

There are a number of reasons for preferring process mining techniques. First,

automation is critical for technology adoption. We are dealing with a large number

of constantly evolving operations. Manual model creation and later maintenance will

incur a very high cost. Secondly, frequently we do not have access to the source

code/scripts of the operation software so understanding of it has to be derived from

externally observable traces such as logs. Thirdly, runtime logs can be used to trigger

tests and diagnosis as the operation process progresses without modifying the

original operation software.

Recall that in a rolling upgrade, a small number of k instances at a time currently

running the old version are taken out of service and replaced with k instances running

the new version. The time taken by each wave of replacement is usually in the order

of minutes. But performing a rolling upgrade for 100s or 1000s of instances using a

small k will take on the order of hours. The Asgard tool performs a rolling upgrade

produces logs such as the ones shown in Listing x.1.

Start rolling upgrade task

Update launch

configuration

Sort instances

Status info

Rolling upgrade task

completed

Remove and deregister

old instance from ELB

New instance ready and

registered with ELB

Terminate old instance

Wait for ASG to start new

instance

For Review Purposes


Listing x.1 Original logs produced by Asgard Rolling Upgrade

"2014-05-26_13:17:36 Started on thread Task:Pushing ami-4583197f into

group testworkload-r01 for app testworkload."

"2014-05-26_13:17:38 The group testworkload-r01 has 8 instances. 8 will be

replaced, 2 at a time."

"2014-05-26_13:17:38 Remove instances [i-226fa51c] from Load Balancer ELB-

01"

"2014-05-26_13:17:39 Deregistered instances [i-226fa51c] from load

balancer ELB-01"

"2014-05-26_13:17:42 Terminating instance i-226fa51c"

…

"2014-05-26_13:17:43 Waiting up to 1h 10m for new instance of

testworkload-r01 to become Pending."

If you look at the log lines, you get a sense of what the operation is doing as a

process without looking at the source code. For example, the software is pushing an

Amazon virtual Machine Image (AMI) to an instance group which has 8 instances.

This AMI contains the new version of the software. The plan is to upgrade 2

instances at a time until all instances are upgraded. Then old instances were

removed/deregistered from the Elastic Load Balancer (ELB) and terminated while

the system waits for an instance containing the new versions being launched. Later

this new instance will be added/registered to the ELB (not shown in the listing). And

you would expect a loop for the replacement step until all instances are upgraded to

the new version.

Process-mining techniques allow the discovery of a process model as shown in

Figure 1 from these logs without having access to the source code. There are two

basic steps in creating a process model from logs: 1) group the logs based on the

activity they represent and tag them with an activity name and 2) use the tagged logs

to create the process model using a tool such as ProM. Figure 2 shows the logs from

Asgard being stored in Logstash – a log management tool – and then being used for

generating the process model.

For Review Purposes


Figure 2: Using Asgard logs to produce a process model

Asgard logs are not the only source of log information for this operation. In AWS,

a feature called CloudTrail will log all the Cloud API calls. The Asgard rolling

upgrade operation calls the Cloud APIs to complete certain steps, such as

register/terminate/start instances. These Asgard operation steps will leave a trace in

the CloudTrail logs but at a lower level of abstraction – the API call level. Some

other steps such as “Sort Instance” will not involve any Cloud API call thus not

leaving any traces in CloudTrail. It is possible to combine or correlate multiple

sources of information for the same operation process. Not only may this correlation

provide a more useful process model, the correlation itself can be used to associate

causes with effects and use that information for assertions, diagnosis or even

recovery. Figure 3 shows a sample log line from CloudTrail. Notice that this log line

identifies the AWS resource being manipulated – “AutoScaling Group” – as well as

identification information and parameters associated with the resource.

Figure 3: Sample CloudTrail log

Figure 4 shows how the process activities can be correlated with the CloudTrail

logs based on time stamps. This correlation allows determining which AWS

resources are being manipulated during which activities of the process model.

Furthermore, knowing the state of these resources at the beginning of an activity and

knowing the type of manipulations that should occur allows the determination of the

state of the AWS resources at the end of each activity. We elaborate on this idea

during the next section about on line activities.

For Review Purposes


Figure 4: Correlating CloudTrail logs with the process model to determine the

AWS resources manipulated by each activity of the process model.

There are two steps in the development of the process model that require human

intervention.

1. A human must examine the clusters to determine whether they are at a

desired level of granularity and assign names to the clusters. At the two

ends of the spectrum, every log line could be a separate cluster or there

could be only one cluster including all of the log lines. Choosing the

correct level of granularity takes some judgment.

2. A human must also examine the generated process model. It is possible

that there are spurious activities or transitions within the process model.

A human must determine that the model, in fact, represents the process

being modeled.

Creating the process model is an activity that takes less than a day for a skilled

analyst.

For Review Purposes


4. Online Activities Recall that our current focus is on the reliability of the rolling upgrade process. This means we want to detect, diagnose and recover from errors

that occur during the execution of a rolling upgrade. Error detection and recovery can be done on line during the execution of the rolling upgrade. Diagnosis is an off line activity that occurs subsequent to the detection of

an error.

Some timing information is useful at this point. Asgard logs are created quickly and can be processed quickly. CloudTrail logs, on the other hand, are not available, currently, for 15 minutes after the API calls have been

made. This means that the error detection and recovery proceed using just Asgard logs. The CloudTrail logs are useful for enabling the creation of

the desired state of AWS resources at the end of each activity but they cannot be used directly in either error detection or recovery because of the time delay.

4.1 Error Detection From the log lines being produced by Asgard, we can detect the start and end of each activity step. From the process model, we know the desired

sequence of steps. One error detection mode is to look for steps out of the desired sequence. Such an occurrence is called a “conformance error”.

Conformance checking can detect the following types of errors:

Unknown: a log line that is completely unknown.

Error: a log line that corresponds to a known error.

Unfit: a log line corresponds to a known activity, but that should

not happen given the current execution state of the process instance. This can be due to skipped activities (going forward in the process) or undone activities (going backwards).

For example,, after seeing the log line “"2014-08-26_11:12:30 Remove instances [i-116fb53d] from Load Balancer ELB-01", we should expect a

log line about terminating that instance [i-116fb53d] soon according to the discovered process model. If we do not see that log line within a time

For Review Purposes


period or see a different known or unknown log line, it indicates some type of error.

A conformance error will trigger a message to the operator and also trigger

the error recovery mechanism. The message to the operation is produced within seconds of the production of the log line out of sequence. This enables an operator to know where in the thousands or millions of log

lines being produced to begin looking to manually diagnosis and recover from the error.

The second type of error detection relies on the AWS resources manipulated by each activity. Recall that through correlating CloudTrail

logs with the process model, we can determine the AWS resources manipulated by each activity. For example, the activity “Remove and

Deregister Old Instance from ELB” should result in one fewer instance being registered with the ELB at the termination of an activity than there was at the beginning of the activity.

The concrete instance that will be removed during the run time execution

of the process will be different from the instance that was removed during off line analysis but we know that one instance fewer should exist. By recording the state of the ELB at the beginning of the activity and

comparing that to the state of the ELB at the end of the activity, we can determine 1) in fact, a particular instance has been removed from the ELB

and 2) the instance ID is known so that in the next activity “terminate old instance”, we know exactly which instance should have been terminated.

The rolling upgrade process manipulates only four AWS resources: ELB, ASG, Launch Configuration, and instances. This means that saving the

state of these resources at the beginning of an activity can be done quickly. Furthermore, at the end of an activity, we can determine the current state of these resources by querying AWS. The response time of these queries

depends on AWS but our experience is that the response time is on the order of several seconds. These two times mean that comparing the state

of these resources at the end of the activity to the desired state can be done on the order of seconds Furthermore, the saving and comparing is done by

For Review Purposes


a process operating independently from Asgard so there is no degradation to the normal rolling upgrade process unless an error is detected.

The kinds of errors that can be detected by these means include errors

caused by failures in the cloud such as the long tail and also errors caused by interference between two teams simultaneously deploying different instances. Examples of the kinds of errors we have detected are:

1. AMI changed during upgrade

2. Key pair management fault 3. Security group configuration fault 4. Instance type changed during upgrade

5. AMI is unavailable during upgrade 6. Key pair is unavailable during upgrade

7. Security group is unavailable during upgrade 8. ELB is unavailable during upgrade

4.2 Error Recovery Now suppose an error has been detected. We have three sets of states of

the AWS resources that are relevant. 1. The state at the beginning of the last activity

2. The current erroneous state 3. The desired state

We know that the current state is erroneous for some reason. There are

two options to automatically recover from the error: roll back to the state at the beginning of the last activity or roll forward to the desired state.

The difficulty of performing either of these activities varies with circumstances. Suppose, for example, that an old instance was not

deregistered from the ELB. Then recovery would involve re-trying the deregistration operation. On the other hand, with more complicated

processes, it may not be possible to return to the state at the beginning of the last activity. When an AMI is paused or deleted, its IP address is lost. Recovering the AMI with the correct IP address is not possible.

For Review Purposes


5. Error Diagnosis

Repairing an error may not repair the root cause of an error. For example, some of

the errors we mentioned in Section 4.1: error detection are caused by files that were

corrupted because of two different teams simultaneously deploying different versions

of a system. Consequently, we now turn our attention to diagnosing errors.

We are looking for error diagnoses due to typical causes in cloud operation rather

than bugs in software. Diagnosing bugs in software is certainly important and useful

but outside of the current scope of our research. We use fault trees as a reference

model for error diagnosis. In our fault tree, each node represents a failure or an error

which in turn could be caused by the errors in the child nodes. The children of these

child nodes, in turn, could have caused that error. Figure x.3 show a part of the fault

tree we used for detecting rolling upgrade errors. Although it involves some effort to

build this tree, this is an once-off effort and the tree can be reused for many different

cloud operations.

Our knowledge of the process progression helps us in diagnosis. Knowing during

which step an error occurred, restricts the possible causes to those involving the

AWS resources involved in that step. We can then prune the trees to retain those

elements that affect those resources but exclude the others. Furthermore, historical

data of the types of errors that have occurred allows us to associate probabilities with

each branch of the fault tree and present those probabilities to the operator to assist in

diagnosing the root cause.

For Review Purposes


Figure 5 Part of a fault tree for automated error diagnosis

6. Monitoring As we mentioned in Chapter: Monitoring, one of the problems with using thresholds for alerts or alarms is the number of false positives if the

thresholds are set low. Relaxing the threshold raises the possibilities of false negatives. Normal practice is to adjust the thresholds to achieve a

tolerable number of false positives. The number of alerts or alarms is increased during the execution of an

operations process because instances are being taken out of or added to service during these processes. Some organizations turn off alerts and

alarms during these windows so that they are not flooded with alerts or alarms.

For Review Purposes


Knowing the fact that a process is underway and knowing the activity of the process allows for dynamic adjustment to monitoring thresholds. For

example, if you are performing a rolling upgrade, you know when an instance is going to be taken out of service. This will have the effect of

temporarily increasing the load on the other servers if you assume a constant workload during this period. The CPU threshold, for example, can be increased temporarily when a server is taken out of service and

lowered again when a new server has been installed and is sharing the load.

7. Summary

In this chapter we have summarized some the research we have been

performing since mid 2013. Viewing an operations process as a process allows us to create a process model from log lines and to use that process

model to detect and sometimes repair errors caused by operational reasons. The crucial element in our research is the use of the process context to provide information enabling the determination of the desired

state of the AWS resources manipulated by the process being modelled. Knowing the desired state allows, in turn, the detection of errors and,

potentially, recovery from these errors. Furthermore, knowing the process context could allow for dynamic adjustment of monitoring thresholds to reduce the false positives generated when an operations process is

ongoing.

8. Sources W. v. d. Aalst, Process Mining: Discovery, Conformance and Enhancement of

Business Processes: Springer Verlag, 2011.

Sherry Xu, Liming Zhu, Ingo Weber, Len Bass and Daniel Sun, POD-

diagnosis: Error diagnosis of sporadic operations on cloud applications

International Conference on Dependable Systems and Networks (DSN),

Atlanta, GA, USA, June, 2014

AWS. "Error Codes--Amazon Elastic Compute Cloud,"

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/api-error-codes.html

Logstash – http://logstash.net

Asgard – https://github.com/Netflix/asgard

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/api-error-codes.html

https://github.com/Netflix/asgard

For Review Purposes


www.reliableops.com