operations as a process
DESCRIPTION
Tis chapter of the book DevOps: A Software Architect's Prespective describes the error recovery leverage that can be achieved by treating operations as a process and using processing modeling techniques.TRANSCRIPT
For Review Purposes
1 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
DevOps: A Software Architect’s Perspective Chapter: Operations as a Process
By
Len Bass, Ingo Weber, Liming Zhu
With
Xiwei Xu and Min Fu
For Review Purposes
2 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
This is a chapter of a book we plan on releasing one chapter at a time in order to get
feedback. Information about this book and pointers to the other chapters can be found
at http://ssrg.nicta.com.au/projects/devops_book. We would like feedback about
what we got right, what we got wrong, and what we left out.
The table of contents is:
Part I - Background
1. What is devops? (initial release 2013.12.10)
2. The cloud as a platform (initial release: 2014.08.08)
3. Operations (initial release: 2014.08.25)
Part II Deployment Pipeline
4. Overall architecture (initial release 2014.01.23)
5. Build/Test (initial release 2014.03.13)
6. Deployment (initial release 2014.03.31)
Part III Cross Cutting Concerns
7. Monitoring (initial release: 2014.04.30)
8. Security (initial release: 2014.06.13)
9. Other ilities (initial release: 2014.09)
10. Business considerations (initial release 2014.08.20)
Part IV Case Studies
11. Case Study - Rafter, Inc (initial release 2014.04.16)
12. Case Study – Continuous Deployment Pipeline
13. Case Study - Migration
...
Part V The Future of DevOps
14. Operations as a process (this chapter)
15. Implications of industry trends on DevOps
For Review Purposes
3 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Operations as a Process
with Xiwei Xu and Min Fu
If you can't describe what you are doing as a process,
you don't know what you' re doing.
W. Edwards Deming
1. Introduction As we discussed in the ility chapter, the continuous deployment pipeline is not
just another software product exhibiting system of systems characteristics, it also has
strong characteristics of a process. This is also true for many other operations such as
diagnosis, backup and recovery, upgrade, and maintenance. Even your favorite cron
jobs and scripts may be pipelining a set of small tools, a familiar concept in the
administrator’s world. We can view the world as a large number of such process-
oriented systems operating on your application of interest, not just sequentially, but
also with a lot of simultaneity and parallelism at both the process level and the task
level.
The purpose of this chapter is to discuss the implications of treating all operations
as processes. By treating, we mean you can discover a process model from existing
operation software/scripts and their logs. You can analyze the discovered process
models for improvement opportunities. You can use the process models to monitor
the progression of various operations and detect errors and recover from them as
early as possible. You want to achieve this at the step level rather than the whole
process level since that leads to earlier error detection and recovery. You can also use
the process models to help other activities such as setting monitoring rules and root
cause diagnosis. Performing these activities is difficult as they often lack some
context information on what is happening. The process model and the monitoring of
progression provides that context. The process models can also be a central place to
correlate different events and monitoring metrics to improve your understanding of
the runtime system. The opportunities are ample. The discovered process model can
be used to orchestrate variations in the original process and could become a
mechanism for executing future applications of the process.
From one perspective, a DevOps process may resemble a human-intensive
process similar to a software development process. You can apply the lessons learned
in improving software development processes to DevOps processes including agility,
For Review Purposes
4 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
life cycle models, quality controls and capability maturity models. The DevOps
movement arguably started as an agile attempt on Dev-Ops collaboration and the
application of software engineering practices to the Ops realm.
From another perspective, process-oriented systems can be seen as workflow
systems or business process management systems. Relevant results from these areas
include mining process models from logs and event traces, process analysis, runtime
monitoring and prediction, process quality improvement and human-intensive
processes. You can see an example of this perspective in our discussion about rolling
upgrade in Chapter: Deployment. In this chapter, we are focusing on the workflow or
business process perspective on operations processes.
One final introductory issue that you should consider is the level of abstraction in
your process models. A process model is a specification of a set of activities that
when carried out will result in the completion of a desired result. This set of activities
can be modeled at a fine grained level – every step in carrying out the process – or at
a coarse grained level – the major activities performed during the process. The
modeling level depends on the richness of the source of discovering the process
model and the results you wish to obtain from the model. As with a software system,
a process model can be understood from its run time properties – performance,
reliability, and security being three important qualities – as well as its development
time properties – interoperability and modfiability being two. In this chapter, we
focus on research that we have performed into the reliability of operations processes
whose process model specification is correct. Other perspectives involve ensuring the
correctness of the process (and its model), improving the performance of the
execution of the process, or constructing the model efficiently at a desired level of
granularity.
2. Motivation and Overview
As discussed in the Chapter: Ilities, reliability refers to the capability of the overall
deployment pipeline and its individual pieces to maintain its service provision for
defined periods of time. In the past, the use of the individual pieces in the pipeline
would be infrequent as software was released at a frequency of weeks if not months.
Now continuous delivery/deployment practices are making sporadic operations such
as deployment or integration test weekly or daily occurrences. As we mentioned
when we discussed testability, operation and infrastructure scripts are difficult to test
because mimicking the real complexity of the production environment is difficult.
Traditional exception handling mechanisms are usually designed with a single
language environment in mind. A typical pipeline, on the other hand, has to deal with
different types of error responses from different types of systems – ranging from the
error code of a Cloud API call to the potential silent failure of a configuration
For Review Purposes
5 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
change. The uncertainty inherent within a cloud that we discussed in Chapter: The
Cloud as a Platform, also introduces some random failures so that a script that was
previously successful may produce an invalid outcome. What this means is that
defensive programming strategies, while important, are not going to be sufficient.
Instead, we advocate deriving an understanding of what should be the desired state of
a process and comparing that with the actual state. In essence, that is the basis of the
approach we discuss in this chapter. Operations processes have several
characteristics that make this approach more tractable than for general business
processes. In particular
Operations processes manipulate only a few types of entities. In the
rolling upgrade example we use in this chapter these are ELB, ASG,
Launch Configuration, and instances. In general business processes,
there can be a large number of number different types of entities that are
being manipulated.
Operations processes have a time frame measured in 10s of minutes if not
in hours. This means that gathering logs, detecting errors, and recovering
from errors in a few minutes is useful. In general business processes, the
time frames can be much shorter.
Operations tools typically generate high quality logs that can be used to
create the process model without a lot of noise.
In order to know what the desired state of a process should be we first need to
discover the process. Once we have done this, we can prepare for error detection,
diagnosis, and recovery. We do this by processing logs from successful execution of
the process. This process is done offline after we have achieved successful execution
and generated associated logs.
During an execution of the process, we compare the desired state of the process with
the current state of the process. Any difference indicates a reliability problem and
provides the seeds of a recovery strategy. These activities happen online
(synchronously) with the process.
We will use the process of rolling upgrade as our running example throughout the
chapter. A rolling upgrade places a new version of an application into service one
server at a time. It removes a server from service, possibly deleting the server, loads
the new version of the application onto that server or a replacement server, and starts
the newly loaded server. We discussed rolling upgrade in more detail in Chapter:
Deployment. Figure 1 repeated from that chapter shows the rolling upgrade process
used by Asgard on AWS.
For Review Purposes
6 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Figure 1: Rolling Upgrade process from Asgard on AWS
3. Offline activities.
The process model can be created manually based on your understanding of the
operation and the code/scripts. Alternatively, process-mining techniques can be used
to discover the process, especially from logs. In this section we describe the activities
that are carried out off line based on successful executions of the process. These
activities provide the basis for the on line error detection and recovery.
There are a number of reasons for preferring process mining techniques. First,
automation is critical for technology adoption. We are dealing with a large number
of constantly evolving operations. Manual model creation and later maintenance will
incur a very high cost. Secondly, frequently we do not have access to the source
code/scripts of the operation software so understanding of it has to be derived from
externally observable traces such as logs. Thirdly, runtime logs can be used to trigger
tests and diagnosis as the operation process progresses without modifying the
original operation software.
Recall that in a rolling upgrade, a small number of k instances at a time currently
running the old version are taken out of service and replaced with k instances running
the new version. The time taken by each wave of replacement is usually in the order
of minutes. But performing a rolling upgrade for 100s or 1000s of instances using a
small k will take on the order of hours. The Asgard tool performs a rolling upgrade
produces logs such as the ones shown in Listing x.1.
Start rolling upgrade task
Update launch
configuration
Sort instances
Status info
Rolling upgrade task
completed
Remove and deregister
old instance from ELB
New instance ready and
registered with ELB
Terminate old instance
Wait for ASG to start new
instance
For Review Purposes
7 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Listing x.1 Original logs produced by Asgard Rolling Upgrade
"2014-05-26_13:17:36 Started on thread Task:Pushing ami-4583197f into
group testworkload-r01 for app testworkload."
"2014-05-26_13:17:38 The group testworkload-r01 has 8 instances. 8 will be
replaced, 2 at a time."
"2014-05-26_13:17:38 Remove instances [i-226fa51c] from Load Balancer ELB-
01"
"2014-05-26_13:17:39 Deregistered instances [i-226fa51c] from load
balancer ELB-01"
"2014-05-26_13:17:42 Terminating instance i-226fa51c"
…
"2014-05-26_13:17:43 Waiting up to 1h 10m for new instance of
testworkload-r01 to become Pending."
If you look at the log lines, you get a sense of what the operation is doing as a
process without looking at the source code. For example, the software is pushing an
Amazon virtual Machine Image (AMI) to an instance group which has 8 instances.
This AMI contains the new version of the software. The plan is to upgrade 2
instances at a time until all instances are upgraded. Then old instances were
removed/deregistered from the Elastic Load Balancer (ELB) and terminated while
the system waits for an instance containing the new versions being launched. Later
this new instance will be added/registered to the ELB (not shown in the listing). And
you would expect a loop for the replacement step until all instances are upgraded to
the new version.
Process-mining techniques allow the discovery of a process model as shown in
Figure 1 from these logs without having access to the source code. There are two
basic steps in creating a process model from logs: 1) group the logs based on the
activity they represent and tag them with an activity name and 2) use the tagged logs
to create the process model using a tool such as ProM. Figure 2 shows the logs from
Asgard being stored in Logstash – a log management tool – and then being used for
generating the process model.
For Review Purposes
8 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Figure 2: Using Asgard logs to produce a process model
Asgard logs are not the only source of log information for this operation. In AWS,
a feature called CloudTrail will log all the Cloud API calls. The Asgard rolling
upgrade operation calls the Cloud APIs to complete certain steps, such as
register/terminate/start instances. These Asgard operation steps will leave a trace in
the CloudTrail logs but at a lower level of abstraction – the API call level. Some
other steps such as “Sort Instance” will not involve any Cloud API call thus not
leaving any traces in CloudTrail. It is possible to combine or correlate multiple
sources of information for the same operation process. Not only may this correlation
provide a more useful process model, the correlation itself can be used to associate
causes with effects and use that information for assertions, diagnosis or even
recovery. Figure 3 shows a sample log line from CloudTrail. Notice that this log line
identifies the AWS resource being manipulated – “AutoScaling Group” – as well as
identification information and parameters associated with the resource.
Figure 3: Sample CloudTrail log
Figure 4 shows how the process activities can be correlated with the CloudTrail
logs based on time stamps. This correlation allows determining which AWS
resources are being manipulated during which activities of the process model.
Furthermore, knowing the state of these resources at the beginning of an activity and
knowing the type of manipulations that should occur allows the determination of the
state of the AWS resources at the end of each activity. We elaborate on this idea
during the next section about on line activities.
For Review Purposes
9 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Figure 4: Correlating CloudTrail logs with the process model to determine the
AWS resources manipulated by each activity of the process model.
There are two steps in the development of the process model that require human
intervention.
1. A human must examine the clusters to determine whether they are at a
desired level of granularity and assign names to the clusters. At the two
ends of the spectrum, every log line could be a separate cluster or there
could be only one cluster including all of the log lines. Choosing the
correct level of granularity takes some judgment.
2. A human must also examine the generated process model. It is possible
that there are spurious activities or transitions within the process model.
A human must determine that the model, in fact, represents the process
being modeled.
Creating the process model is an activity that takes less than a day for a skilled
analyst.
For Review Purposes
10 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
4. Online Activities Recall that our current focus is on the reliability of the rolling upgrade process. This means we want to detect, diagnose and recover from errors
that occur during the execution of a rolling upgrade. Error detection and recovery can be done on line during the execution of the rolling upgrade. Diagnosis is an off line activity that occurs subsequent to the detection of
an error.
Some timing information is useful at this point. Asgard logs are created quickly and can be processed quickly. CloudTrail logs, on the other hand, are not available, currently, for 15 minutes after the API calls have been
made. This means that the error detection and recovery proceed using just Asgard logs. The CloudTrail logs are useful for enabling the creation of
the desired state of AWS resources at the end of each activity but they cannot be used directly in either error detection or recovery because of the time delay.
4.1 Error Detection From the log lines being produced by Asgard, we can detect the start and end of each activity step. From the process model, we know the desired
sequence of steps. One error detection mode is to look for steps out of the desired sequence. Such an occurrence is called a “conformance error”.
Conformance checking can detect the following types of errors:
Unknown: a log line that is completely unknown.
Error: a log line that corresponds to a known error.
Unfit: a log line corresponds to a known activity, but that should
not happen given the current execution state of the process instance. This can be due to skipped activities (going forward in the process) or undone activities (going backwards).
For example,, after seeing the log line “"2014-08-26_11:12:30 Remove instances [i-116fb53d] from Load Balancer ELB-01", we should expect a
log line about terminating that instance [i-116fb53d] soon according to the discovered process model. If we do not see that log line within a time
For Review Purposes
11 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
period or see a different known or unknown log line, it indicates some type of error.
A conformance error will trigger a message to the operator and also trigger
the error recovery mechanism. The message to the operation is produced within seconds of the production of the log line out of sequence. This enables an operator to know where in the thousands or millions of log
lines being produced to begin looking to manually diagnosis and recover from the error.
The second type of error detection relies on the AWS resources manipulated by each activity. Recall that through correlating CloudTrail
logs with the process model, we can determine the AWS resources manipulated by each activity. For example, the activity “Remove and
Deregister Old Instance from ELB” should result in one fewer instance being registered with the ELB at the termination of an activity than there was at the beginning of the activity.
The concrete instance that will be removed during the run time execution
of the process will be different from the instance that was removed during off line analysis but we know that one instance fewer should exist. By recording the state of the ELB at the beginning of the activity and
comparing that to the state of the ELB at the end of the activity, we can determine 1) in fact, a particular instance has been removed from the ELB
and 2) the instance ID is known so that in the next activity “terminate old instance”, we know exactly which instance should have been terminated.
The rolling upgrade process manipulates only four AWS resources: ELB, ASG, Launch Configuration, and instances. This means that saving the
state of these resources at the beginning of an activity can be done quickly. Furthermore, at the end of an activity, we can determine the current state of these resources by querying AWS. The response time of these queries
depends on AWS but our experience is that the response time is on the order of several seconds. These two times mean that comparing the state
of these resources at the end of the activity to the desired state can be done on the order of seconds Furthermore, the saving and comparing is done by
For Review Purposes
12 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
a process operating independently from Asgard so there is no degradation to the normal rolling upgrade process unless an error is detected.
The kinds of errors that can be detected by these means include errors
caused by failures in the cloud such as the long tail and also errors caused by interference between two teams simultaneously deploying different instances. Examples of the kinds of errors we have detected are:
1. AMI changed during upgrade
2. Key pair management fault 3. Security group configuration fault 4. Instance type changed during upgrade
5. AMI is unavailable during upgrade 6. Key pair is unavailable during upgrade
7. Security group is unavailable during upgrade 8. ELB is unavailable during upgrade
4.2 Error Recovery Now suppose an error has been detected. We have three sets of states of
the AWS resources that are relevant. 1. The state at the beginning of the last activity
2. The current erroneous state 3. The desired state
We know that the current state is erroneous for some reason. There are
two options to automatically recover from the error: roll back to the state at the beginning of the last activity or roll forward to the desired state.
The difficulty of performing either of these activities varies with circumstances. Suppose, for example, that an old instance was not
deregistered from the ELB. Then recovery would involve re-trying the deregistration operation. On the other hand, with more complicated
processes, it may not be possible to return to the state at the beginning of the last activity. When an AMI is paused or deleted, its IP address is lost. Recovering the AMI with the correct IP address is not possible.
For Review Purposes
13 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
5. Error Diagnosis
Repairing an error may not repair the root cause of an error. For example, some of
the errors we mentioned in Section 4.1: error detection are caused by files that were
corrupted because of two different teams simultaneously deploying different versions
of a system. Consequently, we now turn our attention to diagnosing errors.
We are looking for error diagnoses due to typical causes in cloud operation rather
than bugs in software. Diagnosing bugs in software is certainly important and useful
but outside of the current scope of our research. We use fault trees as a reference
model for error diagnosis. In our fault tree, each node represents a failure or an error
which in turn could be caused by the errors in the child nodes. The children of these
child nodes, in turn, could have caused that error. Figure x.3 show a part of the fault
tree we used for detecting rolling upgrade errors. Although it involves some effort to
build this tree, this is an once-off effort and the tree can be reused for many different
cloud operations.
Our knowledge of the process progression helps us in diagnosis. Knowing during
which step an error occurred, restricts the possible causes to those involving the
AWS resources involved in that step. We can then prune the trees to retain those
elements that affect those resources but exclude the others. Furthermore, historical
data of the types of errors that have occurred allows us to associate probabilities with
each branch of the fault tree and present those probabilities to the operator to assist in
diagnosing the root cause.
For Review Purposes
14 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Figure 5 Part of a fault tree for automated error diagnosis
6. Monitoring As we mentioned in Chapter: Monitoring, one of the problems with using thresholds for alerts or alarms is the number of false positives if the
thresholds are set low. Relaxing the threshold raises the possibilities of false negatives. Normal practice is to adjust the thresholds to achieve a
tolerable number of false positives. The number of alerts or alarms is increased during the execution of an
operations process because instances are being taken out of or added to service during these processes. Some organizations turn off alerts and
alarms during these windows so that they are not flooded with alerts or alarms.
For Review Purposes
15 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
Knowing the fact that a process is underway and knowing the activity of the process allows for dynamic adjustment to monitoring thresholds. For
example, if you are performing a rolling upgrade, you know when an instance is going to be taken out of service. This will have the effect of
temporarily increasing the load on the other servers if you assume a constant workload during this period. The CPU threshold, for example, can be increased temporarily when a server is taken out of service and
lowered again when a new server has been installed and is sharing the load.
7. Summary
In this chapter we have summarized some the research we have been
performing since mid 2013. Viewing an operations process as a process allows us to create a process model from log lines and to use that process
model to detect and sometimes repair errors caused by operational reasons. The crucial element in our research is the use of the process context to provide information enabling the determination of the desired
state of the AWS resources manipulated by the process being modelled. Knowing the desired state allows, in turn, the detection of errors and,
potentially, recovery from these errors. Furthermore, knowing the process context could allow for dynamic adjustment of monitoring thresholds to reduce the false positives generated when an operations process is
ongoing.
8. Sources W. v. d. Aalst, Process Mining: Discovery, Conformance and Enhancement of
Business Processes: Springer Verlag, 2011.
Sherry Xu, Liming Zhu, Ingo Weber, Len Bass and Daniel Sun, POD-
diagnosis: Error diagnosis of sporadic operations on cloud applications
International Conference on Dependable Systems and Networks (DSN),
Atlanta, GA, USA, June, 2014
AWS. "Error Codes--Amazon Elastic Compute Cloud,"
http://docs.aws.amazon.com/AWSEC2/latest/APIReference/api-error-codes.html
Logstash – http://logstash.net
Asgard – https://github.com/Netflix/asgard
For Review Purposes
16 21 September 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu
www.reliableops.com