enhancing fault-observability by the use of feedback control

Enhancing Fault-Observability by the Use of Feedback Control

Mohamed A. Bin Shams, Hector M. Budman,* and Thomas A. Duever

Department of Chemical Engineering, UniVersity of Waterloo, 200 UniVersity AVenue,Waterloo, Ontario, Canada N2L 3G1

This paper deals with detection of faults in the Tennessee Eastman problems (TEP) that have been foundunobservable by previous studies. Hotelling’s T2 charting based on the cumulative sums of the faults’ relevantvariables was successful in detecting these faults, however, with significant delays. To reduce these delays,it is proposed to retune the feedback controllers involving controlled and manipulated variables used fordetection. An optimization-based methodology that searches for an optimal trade-off between fault detectionand control performance is formulated. The resulting controller design is compared with a design previouslyreported in the literature showing that significant reductions in the detection delays and overall reductions inthe plant’s costs can be achieved by proper tuning of controllers. Both individual and simultaneous occurrenceof faults is considered. Although the proposed method was illustrated using TEP process, the approach isgeneral and can be applied to other processes.

1. Introduction

An important aspect for the economical and safe operationof a chemical process is the rapid detection and removal ofmalfunctions or faults. A fault may be defined as a deviationof at least one variable from an acceptable level.1 Differentmethods have been proposed in the literature for fault detectionand fault diagnosis.1-5 These methods can be broadly catego-rized into three main classes: (1) analytical methods which aresolely based on first-principles models, e.g., observer-basedtechniques; (2) empirical methods, e.g., univariate and multi-variate statistical methods; and (3) semiempirical methods,which combine empirical models with prior knowledge aboutthe system. Each of these methods has its own advantages anddisadvantages depending on the problem. Analytical methodsrequire the use of first-principle models, thus making them lessattractive for large-scale systems. Therefore, they are notconsidered in the current work, and instead an empirical methodis used. A number of researchers suggest combining thesemethods to improve detection. For example, Chiang and Braatz6

and Lee et al.7 have observed that empirical methods areenhanced if knowledge of the process is used to account forthe fundamental causal relationships among variables. Themethod used in the current work makes use of processknowledge to enhance detection. Different problems associatedwith the fault detection and diagnosis problems have beenreported in the literature.1-5 Distinguishability and observabilityare two central problems related to the fault detection anddiagnosis problem.8 A distinguishability problem arises whena system exhibits similar responses in variables used fordetection while different faults occur.8-10 On the other hand,fault observability is related to the detection phase and it isinterpreted as a delay in identifying the occurrence of a fault.This work is focused on the observability problem and proposesa methodology to mitigate it through feedback control.

A recent study by the authors of the current paper11 hasproposed the application of cumulative-sum (CUSUM) basedmodels for the detection of faults in the Tennessee Eastmanproblem (TEP).12 Bin Shams et al.11 proposed the applicationof location CUSUM (LCS) and scale CUSUM (SCS) based

models to detect three particular faults that have been foundunobservable by other algorithms previously applied to theTEP.5-7,13-16 After demonstrating the detection capability ofthe univariate CUSUM-based methods for each one of the threefaults, a Hotelling’s T2 chart based on a cumulative sum of theobservations was proposed for the individual or simultaneousdetection of these three faults. The latter was found to beessential when large numbers of correlated variables areconsidered.

To quantify fault observability when using the CUSUM-basedstatistics, Bin Shams et al.11 proposed the out-of-control averagerun length (ARLoc). Their proposed CUSUM-based statisticswas successful in detecting the three unobservable faults (i.e.,compared to other methods reported in the literature); however,a long time of detection, i.e., large ARLoc values, was obtained.To mitigate this problem and since the faults are observed fromvariables that are embedded within control loops, the currentwork studies the effect of controllers’ tuning parameters on faultobservability.

Despite the evident interaction between fault detection andcontrol, the research on fault detection and control algorithmsevolved separately, mainly because of the challenges associatedwith each of these problems.21 The interaction between controland detection stems from the fact that most of the monitoredvariables are either process variables (PV) or manipulatedvariables (MV) operated within feedback control loops. Sincethe detection is highly dependent on the dynamic response ofthe monitored variables, there is a possibility to speed up thedetection of the faults by retuning the controllers involving thesevariables. The trade-offs between fault detection and controlperformance generally arise from the fact that faster detectionrequires higher variability in the variables used for detectionwhereas higher variability generally translates into poorer controlassociated with lower product uniformity or higher wear ofactuators. Bin Shams et al.17 have recently addressed theinteraction between control and fault detection for a simplechemical process by using linear transfer function modelsidentified in closed-loop operation. However, the previousapproach by Bin Shams et al.17 cannot be easily generalized toa more complex process such as the TEP since it requires theidentification of a large number of models and since it is limitedto linear systems only.17

* To whom correspondence should be addressed. Tel.: +1 519 8884567, ext 36980. Fax: +1 519 746 4979. E-mail: [email protected].

Ind. Eng. Chem. Res. 2011, 50, 916–925916

10.1021/ie101238q 2011 American Chemical SocietyPublished on Web 12/02/2010

The current study proposes a numerical simulation basedapproach for finding an optimal trade-off between control andfault detection. The tuning parameters of the controllers involv-ing the variables used for fault detection are used as optimizationvariables. The fault detection algorithm is based on CUSUMstatistics. Since the simulations are conducted with the fullnonlinear dynamic model of the process, assumptions of linearityare not needed as in the previous study of Bin Shams et al.17

Moreover, the tedious identification of transfer functionsrequired by the previous study are avoided in the currentapproach.

The paper is organized as follows: A description of theCUSUM statistics and the metric used to gauge fault observ-ability are given in section 2. Section 3 illustrates the observ-ability problem through three particular faults in the TennesseeEastman process and demonstrates the ability of the CUSUM-based statistics in tackling these faults. In section 4, theinteraction between feedback control and fault detection isdiscussed. The methodology for simultaneously optimizing faultdetection and feedback control to alleviate the problem ofobservability is given in section 5. Conclusions are given insection 6.

2. Preliminaries

2.1. Cumulative Sum Based Control Charts. Shewhartcharts are often used for assessment of faults.18,19 A keydisadvantage with these statistics is that they only use currenttime-interval information while not accounting for the entiretime history. Hence, those charts are relatively insensitive tosmall shifts in the process variables for small signal-to-noiseratio. These shortcomings have motivated the use of otheralternatives such as the univariate or the multivariate versionof the CUSUM-based charts.11 Three types of statistical chartsare used in this paper. Specifically, location cumulative sum(LCS), scale cumulative sum (SCS), and the Hotelling’s T2. TheLCS and SCS algorithms are examples of univariate statistics,while the Hotelling’s T2 is a multivariate statistic. Both the LCSand SCS are performed using the following two statistics,corresponding to a two-sided hypothesis test19

where K, µic, and Ci+ and Ci

- are the slack variable, the in-control mean, and the upper and the lower CUSUM statistics,respectively. The subscript ic stands for the in-control state. Therole of the slack variables is to introduce robustness to noise.At every new sample, the statistics in eqs 1 and 2 result in theaccumulations of small deviations in the mean (LCS) or smallchanges in the variability (SCS). These accumulations arecorrected using the slack variable and compared to zero usingthe (maximum) operation. When either one of the two statisticsin eqs 1 and 2 exceed a threshold H, the process is consideredto be out of control. Following their respective definitions, theLCS is especially effective for detecting changes in the average,whereas the SCS is suitable for detecting changes in variability.Guidelines for the selection of K and H have been reported.19

Typically K is selected to be half of the expected shift in eitherµ or σ in standard deviation units. H is determined so that aprespecified average run length ARLoc, to be defined in thefollowing section, is achieved. The latter is also determined in

standard deviation units. It should be noticed that when usingeqs 1 and 2, the LCS uses the original raw data xi, whereas theSCS uses the following standardized quantity:

where yi denotes the original raw data. A derivation of thequantities in eq 3 is given in the Appendix. Although LCS andSCS can be applied to individual measurements, there are manysituations in which a pooled representative statistic for morethan one variable is necessary. This is especially important whenit is desired to present the operators with compact informationto simplify the monitoring activities for the process. For thatpurpose, when the monitored variables are normally andstatistically independent, the Hotelling’s T2 can be used. TheHotelling’s T2 statistics and the upper and lower control limitsare given by

where p is the number of monitored variables, m is the totalnumber of samples, x is the sample vector, xj is the in-controlmean, S is the estimated covariance matrix, and F is the criticalvalue of F distribution at the R significance level. In the currentwork the cumulative sum statistics are combined together intoone statistic, namely, the Hotelling’s T2 as described in thefollowing section.

2.2. Multivariate CUSUM-Based T2. To simplify thepresentation of detection data to plant personnel, it is oftenadvantageous to summarize the information of several chartsinto one single chart. For that purpose, Bin Shams et al.11

proposed the use of a combined version of the CUSUMalgorithms as illustrated in Figure 1. The proposed CUSUM-based T2 performs LCS and SCS on identified fault relevantvariables xi. Then, the transformed variables are stacked in X∈ Rn×m matrix, where n is the number of samples and m thenumber of the fault’s relevant variables. Because of theinterconnected nature of the chemical plants, process measure-ments are most likely cross-correlated. To account for anypossible collinearity, the principal components of the X matrixare estimated using principal components analysis (PCA) andonly those representing the major variability are retained. Thelatter step can be determined with different tests.5 In this workthe parallel analysis method is used. The parallel analysis is anenhanced version of the Scree test. The reduction order isdetermined by comparing the Eigenvalues profile of the variablesmatrix X ∈ Rn×m to the Eigenvalues profile for a statisticallyindependent matrix Y ∈ Rn×m. The reduction order is the pointat which the two profiles intersect. This is the point wheresignificant process variation is separated from random noise andany possible linear cross-correlation between the variables. Ifthe candidate variables to be used for detection are found to beuncorrelated, they are directly used for monitoring through eq4. However, when the variables are found to be correlated, thescores corresponding to the principal components of the X matrixare used instead of the original variables.

2.3. Out-of-Control Average Run Length: ARLoc as anObservability Index. Observability of a fault is referred to asthe ability to detect the fault from the chosen set of measure-ments. On the basis of a previous study by Bin Shams et al.,17

Ci+ ) max[0, Ci-1

+ + xi - (µic + K)] (1)

Ci- ) max[0, Ci-1

- + (µic - K) - xi] (2)

C0+ ) C0

- ) 0

xi )√|yi| - 0.822

0.349(3)

T2 ) (x - x̄)TS-1(x - x̄) (4)

UCL )p(m + 1)(m - 1)

(m2 - mp)FR,p,m-p

LCL ) 0(5)

Ind. Eng. Chem. Res., Vol. 50, No. 2, 2011 917

the out-of-control average run length (ARLoc) was proposed asa statistical measure to quantify the observability of faults whena statistical chart is used for monitoring. The subscript oc standsfor out of control. The ARLoc is defined as the average numberof points that must be sampled or plotted before the chart signalsa violation of some prespecified threshold.19 Most statisticalmonitoring techniques are based on the statistical hypothesis-testing principle that involve two types of errors, namely, typeI and type II errors. A type I error occurs when a control chartindicates a fault in the absence of it, whereas a type II erroroccurs when a control chart fails to identify the occurrence ofa fault.19 Mathematically, ARLoc is a function of the probabilityof type II error (�); that is,

For simple statistical charts such as the univariate Shewhartchart, ARLoc has a closed form expression.19 However, for mostof the statistical charts, numerical-based approaches have beenreported in the literature to estimate ARLoc and they can beclassified broadly into two categories: (1) Markov chainapproach and (2) simulation of random realizations of thedisturbances.19 The latter approach is adopted in the currentstudy.

Due to their integrating nature, cumulative sum basedtechniques require some time before a fault can be detected,especially if the changes are very small. Accordingly, the ARLoc

is a suitable metric to quantify this expected delay in detectionin terms of the number of sampling intervals that takes theresponse to surpass a prespecified threshold. For example, if inresponse to a certain fault the estimated ARLoc ) 1, the faultwould be detected, on the average, after the first samplefollowing the onset of the fault. On the other hand, when ARLoc

) infinity or a very large number, the fault is unobservable orit takes a long time to observe it.

Knowledge of the ARLoc can be used to establish a frequencylimit, such as when the frequency of the fault is beyond thislimit, it cannot be detected. To illustrate this property of theARLoc, a simple example is given. Consider a unity gain processgiven by y ) x + n, where y is the measured value to be usedfor detection, x is the actual fault, and n is a white noise withµ ) 0 and σic

2 ) 1. The magnitude of the signal y is equal toor less than the noise level. Two square-wave-like faults withperiod T ) 24 samples and T ) 3 samples, respectively, areconsidered as candidate faults x to be monitored by the faultdetection algorithm. These faults are shown in Figure 2a andFigure 3a, respectively. Following the addition of noise, theresulting measured values of y corresponding to these two faults

are given in Figure 2b and Figure 3b, respectively, where thesolid circles represent measurements obtained at discretesampling intervals. For the implemented LCS, Montgomery19

gives the theoretical ARLoc ) 8.38 when K ) 0.5 and H ) 4,when σic

2 ) 1. On the basis of this value of ARLoc, it is expectedthat square pulses that are of longer duration than the ARLoc )8.38 will exceed the threshold H shown by a horizontal solidline in Figure 2c and therefore will be detected as shown inthis figure. On the other hand, pulses of shorter duration thanthe calculated ARLoc will not exceed the threshold H and,therefore, they will not be detected as shown in Figure 3c.Accordingly, the observability of faults can be estimated fromthe relative values of the period of the square wave versus theestimated theoretical ARLoc value (8.38 samples) or alternativelyto the inverse of the period, i.e., the frequency. In that case, toassess observability, the frequency of the fault can be comparedwith the inVerse of the estimated theoretical ARLoc. Since it iscommon to characterize disturbances or faults by their frequencycontent, (ARLoc)-1 will be used within the optimization problemdescribed later in the paper to quantify the faults’ frequencybands for which these faults will remain unobservable by thedetection algorithm.

3. Tennessee Eastman Process

The problem of fault observability will be illustrated usingan industrial process simulator referred to in the literature asthe Tennessee Eastman process. TEP has been proposed byDowns and Vogel12 and has been used as a benchmark problemin several studies to compare various control20 and monitoringsolutions.5-7,13-16 It consists of five major unit operations, asshown in Figure 4: reactor, condenser, compressor, separator,and stripper. The process produces two liquid products (G andH) and one byproduct (F) from four gaseous reactants (A, C,D, E) and an inert (B). On the basis of the required productmix and production rate, the plant can be operated accordingto six different modes of operation. The original open loopFORTRAN code was provided by Downs and Vogel.12 Theprocess is open loop unstable because of the exothermic reactionthat takes place in the reactor; hence, it cannot be operated inmanual mode. Several decentralized control structures have beenproposed for the TEP, and the structure used by Lyman andGeorgakis20 was used in this work. Different monitoringtechniques have been tested on a total of 15 particular faultsdefined for the TEP5-7,13-16 and listed in Table 1. Thesetechniques have shown different capabilities in detecting themajority of these faults. However, all of these previouslyreported techniques have consistently failed in detecting three

Figure 1. Proposed CUSUM-based T2. Reprinted from Bin Shams et al.,11 proceedings of the 2010 International Symposium on Dynamics and Control ofProcess Systems, per published guidelines available at http://www.dycops2010.org/index.php?option)com_content&view)article&id)11&Itemid)12.

ARLoc ) f(�) (6)

918 Ind. Eng. Chem. Res., Vol. 50, No. 2, 2011

particular faults listed in Table 1 and referred heretofore as faultsIDV(3), IDV(9), and IDV(15).

The lack of observability associated with these faults has beengenerally attributed to the statistically insignificant changes occur-ring in the monitored variables following the onset of the threefaults. For fairness, it should be stressed that in most of the reportedwork, the detection was based on a small number of current andpast time intervals; thus, the entire time histories of the measure-ments were not considered for detection as done in the CUSUM

calculations used in the current study. However, the fact remainsthat these faults have not been detected in previous studies, whereasthey may be significant negative economic or operational effectsfrom their occurrence. Thus, it is still very relevant to attempt todetect them. The next section describes how these faults can bedetected with CUSUM-based techniques.

3.1. CUSUM-Based Charting Approach for FaultsIDV(3), IDV(9), and IDV(15). Bin Shams et al.11 summarizedmost of the monitoring solutions proposed in the literature

Figure 2. Dependency of the LCS ARLoc on the fault frequency. (a) Fault with frequency smaller than (ARLoc)-1. (b) Noise added to the fault signal. (c)Monitoring using LCS, where dark circles depict the faulty signals. ARLoc ∼ 8 samples.19

Figure 3. Dependency of the LCS ARLoc on the fault frequency. (a) Fault with frequency larger than (ARLoc)-1. (b) Noise added to the fault signal. (c)Monitoring using LCS, where dark circles depict the faulty signals. ARLoc ∼ 8 samples.19


for the Tennessee Eastman process faults. The inability ofprevious techniques to detect the three faults emphasized inTable 1, namely, IDV(3), IDV(9), and IDV(15), motivatesthe use of the cumulative sum based measures. Initially, amultivariate cumulative sum (MCUSUM)18 that used all ofthe available TEP measurements was used for detection, butit was found to be unable to detect these three faults. It washypothesized that this inability was due to the occurrence ofa large amount of noise combined with the high correlationamong the process variable. Thus, the multivariate CUSUM-based statistics presented in section 2.2 was used, and it was

successful in detecting the three faults.11 In principle all ofthe monitored variables could be used for detection. However,it was found that, due to small signal-to-noise ratio in mostvariables combined with the fact that the variables are highlycorrelated in a nonlinear fashion, using all the variables didnot resulted in improved detection of the three faults.

Although for the TEP case it is relatively easy to identifythe variables that are most relevant for each fault, the identifica-tion of these variables for other large-scale systems solely basedon process knowledge may not be always a trivial task. To dothis selection, it is proposed to use contribution plots5,18 basedon CUSUM-based T2 for all available measurements, tosystematically identify those variables. For that purpose, anaugmented matrix is used that contains both the LCS and SCSof all the variables, i.e., a total of 104 columns (2 × 52measurements). After applying PCA to this augmented matrix,the contribution plots can be obtained. For example, the topand bottom plots in Figure 5 show that the dominant contribu-tions correspond to the LCS of XMV(10) (reactor cooling waterflow) and to the SCS of XMEAS(21) (reactor cooling waterexit temperature) for faults IDV(3) and IDV(9), respectively.The top and bottom plots in Figure 5 show the contribution ofXMV(10) and XMEAS(21) for IDV(3) and IDV(9), respec-tively. The horizontal axis contains numbers from 1 to 104corresponding to the 52 LCS and 52 SCS values used for thecalculations of the contributions. The large values in the plotcorrespond to the dominant contributions. It was found asexpected that the most dominant variable for detection of fault3 is an LCS indicator since the fault manifests itself as a changein bias, whereas for fault 9 an SCS indicator is required sincethe fault involves a change in variability.

Table 2 shows the fault-variables pairing for these threefaults as found from the contribution plots. Accordingly, theLCS algorithm was applied to XMV(10) for fault 3, whereas

Figure 4. Tennessee Eastman process with the second control scheme described in Lyman and Georgakis.20 The circles indicate the location of the threefaults described in Table 1.

Table 1. Tennessee Eastman Process Faults (Unobservable FaultsEmphasized with Bold Font)

faults description nature

IDV(0) nominal operating conditionIDV(1) A/C feed ratio, B composition

constant (stream 4)step

IDV(2) B composition, A/C ratioconstant (stream 4)

step

IDV(3) D feed temperature stepIDV(4) reactor cooling water

inlet temperaturestep

IDV(5) condenser cooling waterinlet temperature

step

IDV(6) A feed loss (stream 1) stepIDV(7) C header pressure loss stepIDV(8) A, B, C feed composition

(stream 4)random variation

IDV(9) D feed temperature random variationIDV(10) C feed temperature random variationIDV(11) reactor cooling water

inlet temperaturerandom variation

IDV(12) condenser cooling waterinlet temperature

random variation

IDV(13) reaction kinetics slow driftIDV(14) reactor cooling water valve stickingIDV(15) condenser cooling

water valvesticking


the SCS algorithm was applied to XMEAS(21) for fault 9 andXMV(11) for fault 15, and then the corresponding cumulativesums were used to drive the Hotelling T2 statistics defined ineq 4, where x is a vector sample composed of the threecorresponding cumulative sums.

Although the T2 statistic based on the augmented matrix couldbe used within the proposed optimization-based methodology,for simplicity this has not been done since the matrix has to beinverted at each optimization step resulting in significantlyhigher computational effort as compared to the case where themost relevant variables for each particular fault are used.

The use of three separate control charts was appropriate formonitoring the three faults;11 however, it is often convenientfor practical purposes as mentioned above to monitor the processwith a smaller number of charts. For example, Figure 6 andFigure 7 depict the ability of the proposed CUSUM-based T2

statistics in observing IDV(9) and simultaneous occurrence ofIDV(3) and IDV(15), respectively. Table 3 gives the estimatedARLoc values for the three faults using the univariate and

multivariate CUSUM-based statistics.11 The calculated ARLoc

is only an approximation since a more precise estimate of theARLoc requires averaging over a large number of noiserealizations, as done later in the case study.

A problem associated with these results is the relatively longperiods of time required to detect the occurrence of the faults.The immediate implication is that only faults that are of longerdurations than the corresponding ARLoc values, or alternativelyof smaller frequency than 1/ARLoc, can be detected using theCUSUM-based statistics, as explained in the previous section.Thus, faults with duration shorter than the ARLoc would goundetected, and a cost may be associated with this lack ofobservability. To partially address this problem, a method isproposed in the next section to retune the controllers so as toreduce the detection times while maintaining suitable controlof the process.

4. Approach for Finding an Optimal Trade-off betweenFault Detection and Control

To shorten the detection time given by the ARLoc, it isproposed to retune the corresponding controller.22 To illustratethe interaction between control and fault detection, IDV(15) andits corresponding controller are considered, i.e., the controllerthat manipulates the condenser’s cooling water valve XMV(11).The ARLoc corresponding to fault IDV(15) and the variabilityin XMV(11) are plotted as a function of the controllerproportional gain (K) and shown in Figure 8. The controllergain values were changed from 0.95 to 2 times the original gainvalue (1.09) corresponding to a change of approximately 100%.As can be seen from Figure 8, there is a significant interactionbetween the control, as manifested by the variability in themanipulated variable, and the detection scheme, e.g., themultivariate CUSUM-based T2 statistics in this case. Tuning ofthe controller significantly reduces the ARLoc that would berequired to observe IDV(15) but at the expense of significantdegradation in controller’s performance as shown by theincreased variability in the manipulated variable value. This

Figure 5. Contributions of all of the variables to CUSUM-based T2: topplot, IDV(3); bottom plot, IDV(9). The horizontal axis includes a total of104 variables corresponding to the LCS and the SCS values of all 52measurements, respectively. The vertical dotted line separates the LCS andSCS corresponding sets.

Table 2. Unobservable Faults/Process Variables Pairing

faults measurementsa description

IDV(3) XMV(10) reactor cooling water flowIDV(9) XMEAS(21) reactor cooling outlet tempIDV(15) XMV(11) condenser cooling water flow

a The variable measurements as appeared in Downs and Vogel.12

Figure 6. Hotelling’s T2 for IDV(9). Horizontal and vertical lines representthe statistical limit and the fault onset, respectively.

Figure 7. Hotelling’s T2 for the simultaneous occurrence of IDV(3) andIDV(15). Horizontal and vertical lines represent the statistical limit andthe fault onset, respectively.

Table 3. Estimated ARLoc for the LSC, SCS, and T2

fault statistics ARLoca (h)

IDV(3) LCS 127.05IDV(9) SCS 8.20IDV(15) SCS 41.00IDV(3) T2 102.40IDV(9) T2 276.05IDV(15) T2 89.65IDV(3) and IDV(15) T2 41.30

a All ARLoc are calculated from after onset of the faults (i.e., after8 h).


variability may translate into significant wear of the correspond-ing valve. Thus, there is a motivation to seek a trade-off betweenthe costs associated with detection speed and closed-loopperformance. An objective function that represents such a trade-off can be formulated as follows:

where yi represent the controlled variables, the manipulatedvariables, and the rate change of the manipulated variables fori ) 1, 2, and 3, respectively, and the corresponding γi representthe variability associated with these variables as calculated byeq 8. A key assumption in this work is that, if a fault can bedetected, the faulty situation can be immediately corrected; thus,there is no cost associated with observable faults. Thus, costsare assumed to be associated to faults that cannot be observedand therefore persist for a long time without being fixed.Accordingly, J is formulated to include only the costs associatedwith undetected faults, which, per the discussion in the previoussection, are those faults with frequencies larger than (ARLoc)-1.The tuning parameters of the controller are given by λ, and theyare the decision variables for the optimization problem. Thelower and upper bounds of summation lb and ub are the timesamples at the onset of the fault and at the end of simulation,respectively.

The coefficients in the objective function ai represent actualeconomic costs associated with variability in manipulated andcontrolled variables and with the rate of change in manipulatedvariables. These costs are problem-specific, and they should bechosen on the basis of knowledge of the process. For instance,the cost of variability in manipulated variables (second term inthe right hand side of eq 7) may be related to the cost of utilities,e.g., changes in steam, cooling water flow, and so on. On theother hand the cost assigned to the rate of change of a valve(third term in the right hand side of eq 7) could be obtainedfrom the cost of the valve and from the expected valve life fora specific variability level.

Although the objective function above was formulated for aparticular fault and a particular controller, it can be generalized

to the case where multiple faults may occur, either sequentiallyone after the other or simultaneously, and where multiplecontrollers are retuned to optimize the problem. Accordinglytwo objective functions will be considered. The first one assumesthat the faults occur simultaneously, and accordingly the costfunction is given as follows:

The second proposed objective function assumes that the faultsoccur sequentially one after another, and accordingly it is definedas follows:

In the latter, faults occur sequentially and it is assumed thatone fault occurs for some period of time after which it isremoved before another fault occurs.

In summary, the optimization problem consists of minimizingthe objective functions given by either eq 7 for one fault andeq 9 or 10 for multiple faults, with respect to the tuningparameters of the controllers. These minimizations are done onthe basis of numerical simulations of the process under closed-loop control for different realizations of the faults. To ensurethat the cost function is evaluated for faults that cannot beobserved, the frequency band of the faults introduced in thesimulations in the form of an RBS (random binary signal) isgiven between a lower limit of (ARLoc)-1 to an upper limit ofNyquist frequency (ωu), i.e., 0.5/T, where T is the samplinginterval used in the numerical simulations. To reduce thecomputation time within the proposed scheme, the numericalsimulations of the TEP model are performed in FORTRAN.However, the optimization and the processing of the results aredone in MATLAB. To achieve this, a program was designedto allow the two platforms to communicate. One of the keysources of computational burden in the optimization problemis the calculation of the ARLoc, which requires conductingrepeated simulations of the closed-loop model of a particularfault for different noise realizations. However, since thesecalculations are independent from each other, it was possibleto exploit the parallel computation capabilities of MATLAB tospeed up this task.

The optimization problem is solved per the following steps:(1) The controller parameters (λk) and the random number

generator’s seed are initialized.(2) The IDV(j) is set, and the Tennessee Eastman differential/

algebraic equations are solved (FORTRAN), wherej ∈ {3, 9,15}, the three unobservable faults.

(3) The simulated data are retrieved in MATLAB whereCUSUM-based T2 is calculated and the out-of-control run length(RLoc) for a single realization is estimated.

(4) The seed of the random number generator is changed,and steps 2 and 3 are repeated until the maximum number ofruns is reached.

(5) ARLoc is calculated.(6) A random binary sequence (RBS) is designed with

frequency content [(ARLoc)-1ωu].(7) The RBSj signal replaces the IDV(j), and the Tennessee

Eastman process is resimulated in FORTRAN.(8) Equation 8 is evaluated, for a given RBSj and λk, and eq

9 is minimized until convergence. A detailed flow chartsummarizing the above-mentioned steps is given in Figure 9.

It should be noticed that the methodology proposed above isbased on simulations of a first principles model of the system.

Figure 8. Change in variability (trade-off (higher variability vs shorterdetection)) and the T2-ARLoc as a function of the condenser controller’sgain, XMV(11). (Note: The proportional gain has been varied from 1.09 to2.18, which corresponds to 100% change in the gain’s original value of1.09.)

minλ

J ) a1γ1 + a2γ2 + a3γ3 (7)

γi ) σyi

2 ) ∑lb

ub

(y - yj)2/(ub - lb) (8)

minλ1,λ2, ..., λk

(J|f1, f2, ..., fk) ) a1γ1 + a2γ2 + a3γ3 (9)

minλ1,λ2, ..., λk

Jtotal ) Jf1+ Jf2

+ ... + Jfk(10)


When such a model is not available, the method could still beapplied by using empirical models of the different units thatmay be identified from input-output data plant data. Then, theseempirical models could be interconnected with the correspondingcontroller models by using appropriate software, e.g., Simulink(MathWorks), to provide a closed-loop plant simulator for whichthe method above could be readily applied. However, whenempirical models are used, model-plant mismatch will beimportant. In that case, the calculation of variability will requireeither the use of Monte Carlo based simulations for differentuncertainty realizations or, alternatively, robust control toolscould be used to estimate the worst case variability in thepresence of model errors.23

5. Result and Discussion

5.1. Individual Faults Case. First, the three faults areassumed to occur one at a time. For each one of these faults asingle controller is optimized. This controller is selected as theone that involves the variable that is deemed most relevant forthat particular fault as specified in Table 2. More specifically,the reactor cooling water valve (slave controller) is identifiedas the fault relevant controller for IDV(3) and IDV(9), whilethe condenser cooling water valve is identified for IDV(15);see Figure 10. The algorithm explained above and depicted inFigure 9 is implemented for each of the three faults. For eachfault, many different initial values have been assumed to avoidlocal optimal solutions. The results of the optimization for eachfault in terms of the optimal tuning parameters and the valueof the cost function at the optimum are given in Table 4. Forthese results, ai in eq 7 are set equal to 1, for i ) 1, 2, and 3,respectively, and a sampling frequency equal to (1/180) Hz ischosen. Also, in Table 4 the tuning parameters provided byLyman and Georgakis20 and the cost obtained with theseparameters are presented for comparison purposes. As can beseen from the table, the ARLoc has been significantly reducedfor each of the three faults. For example, for IDV(15), stiction

in the condenser’s cooling water valve, and with use of theproposed scheme, ARLoc has been reduced from 89.65 to 40.59h and the associated cost has been reduced from 7.34 to 0.87.On the other hand, for IDV(3), the proposed tuning schemedwas able to shorten the ARLoc by 2.69 h and the cost byapproximately 80% compared with the cost obtained with thetuning parameters proposed by Lyman and Georgakis.20 Asimilar trend has been obtained for IDV(9).

A close look at Table 4 reveals that the controllers are tunedless aggressively than the controllers used by Lyman andGeorgakis,20 as evidenced by the consistently smaller propor-tional gains resulting from the optimization. This is expectedsince the LCS and SCS operations, used to drive the CUSUM-based T2, need a relatively slow convergence of manipulatedvariables to their final steady-state solution to achieve thenecessary detection. By detuning the corresponding controllers,the manipulated variables converge more slowly to their steady-state values, thus providing the necessary variability for detectionbut at the expense of controller performance.

The proposed algorithm is implemented on eight Intel Xeonprocessors, each of 2 GHz and a total of 12.0 GB RAM. Theparallel computing paradigm induced in the proposed algorithmexploits these eight processors, and significant reduction in thecomputation time is noticed. The CPU time required to solveeach problem was approximately 3 h on the server specifiedabove, where most of the computational burden was due to theARLoc estimation.

5.2. Simultaneous Faults Case. Since simultaneous faultsare common in industrial systems, the proposed methodologycan be extended to account for such an event. For illustrationpurposes, the simultaneous occurrence of IDV(3) and IDV(15)is considered. IDV(3) and IDV(15) are two faults of differentnature, and their combined effect is not necessarily thesuperposition of the two inputs, in particular when a nonlinearsystem is considered. For this case, i.e., IDV(3) and IDV(15),the reactor cooling and the condenser cooling controllers are

Figure 9. Flow chart of solving the proposed dynamic optimization problem.


the two controllers that could be retuned to achieve an optimaltrade-off between detection and control performance. Since thereare two different controllers that involve the variables used formonitoring the faults, there are a total of three different optionscorresponding to the cases where either one of the twocontrollers are retuned or when the two controllers are retunedsimultaneously. For each one of these three options, two costsare considered per eqs 9 and 10, corresponding to the faultsoccurring either simultaneously or sequentially. The combination

of three possible tuning options with two possible optimizationcosts results in a total of six cases, as shown in Table 5.

Comparisons of the costs obtained with the controllers’settings given by Lyman and Georgakis20 are also presented inTable 5. It is clear from the table that the optimization procedureis leading to significant reductions in cost up to an order ofmagnitude as compared with the costs obtained with thecontrollers’ settings of Lyman and Georgakis.20 For examplefor the case in which the reactor cooling controller is tuned using

Figure 10. Fault relevant controllers, for IDV(3), IDV(9), IDV(15), and IDV(3) and IDV(15).

Table 4. Tuning Parameters and the ARLoc Obtained from Implementing the Proposed Algorithm (Figure 9)

fault tuning scheme Kc (unit of CV)-1 τI (h) ARLoc (h) cost

IDV(3) (reactor controller) Lyman and Georgakis20 -1.560 0.403 102.40 0.973proposed scheme -0.265 2.437 99.71 0.161

IDV(9) (reactor controller) Lyman and Georgakis20 -1.560 0.403 273.05 1.001proposed scheme -0.269 0.711 115.65 0.225

IDV(15) (condenser controller) Lyman and Georgakis20 1.090 0.722 89.65 7.343proposed scheme 0.236 0.914 40.59 0.876

Table 5. Tuning Parameters and the ARLoc Results from Implementing the Proposed Algorithm (Figure 9) and the Simultaneous Faults Case,i.e., IDV(3) and IDV(15)

cost function controller(s) λa ARLoc (h) costcost (Lyman and Georgakis20)

ARLoc ) 41.3 h

eq 9 reactor cooling water valve [-0.269, 0.679] 18.85 0.163 1.479eq 9 condenser cooling water valve [ 0.026, 1.062] 110.99 0.190 2.007eq 9 reactor and condenser cooling water valves [-1.117, 0.448 0.905, 0.787] 12.93 7.268 9.259eq 10 reactor cooling water valve [-0.265, 0.676] 97.64 0.326 3.685eq 10 condenser cooling water valve [0.235, 0.876] 118.43 1.574 3.639eq 10 reactor and condenser cooling water valves [-1.138, 0.541 0.659, 0.881] 115.91 10.469 18.680

a λ is the controller parameter and is of size 2 when a single PI controller is used and 4 when two PI controllers are considered. In the latter, λ )[λ1λ2].


eq 9 the ARLoc is reduced by approximately 54.36% and thecorresponding cost is reduced from 1.479 to 0.163. The secondcase is when both controllers, i.e., reactor cooling and thecondenser cooling controllers, are tuned simultaneously usingeq 9. In the latter case, the ARLoc is reduced form 41.30 to12.93 h and the associated cost is decreased from 9.26 to 7.27.It should be noticed that the cost was defined only in terms ofthe variables used for fault monitoring and control. Extendingthe proposed methodology to include the overall costs of theprocess is possible, but it significantly increases the computa-tional burden; thus, it has been left for future study.

6. Conclusion

CUSUM-based statistics combined with Hotelling’s T2 chart-ing is used to detect faults in a chemical process. This methodwas successful for detecting three faults in the TennesseeEastman problem that were impossible to observe with otherpreviously applied methods. However, long detection delays,as quantified by the ARLoc, are noticed. To speed up detection,the controllers involving the variables used for detection areretuned to achieve an optimal trade-off between fault detectionand control performance. For this purpose, an optimizationproblem is formulated. This optimization involves the minimi-zation of costs due to the occurrence of faults that cannot bedetected and control variability with respect to the tuningparameters of controllers that involve variables used for faultdetection. To quantify the limit of fault observability, the ARLoc

is used. Both the individual and the simultaneous occurrenceof faults are considered. The proposed objective functionminimizes the cost associated with the lack of detection, whilepenalizing any excessive variability. It is shown that theproposed methodology results in significant reductions of costswhen compared with the costs obtained with the controllers’setting previously reported in the literature for the process understudy.

Appendix: Scale CUSUM Parameters

The parameters of the scale CUSUM (SCS) have been derivedas follows. Let xi ∼ N (0,σ2), i ) 1, 2, ..., n, and yi ) |xi/σ|λ.The characteristics of the distribution of yi are easily workedout from the standard normal distribution. That is,

where Φ(...) denotes the standard normal function. Furthermore,the kth moment of yi is as follows:

where Γ(...) is the gamma function. With λ ) 0.5, thetransformed variate yi has a distribution which is very close tonormal. In particular, using (A2), the first and second momentsare given as follows:

Literature Cited

(1) Isermann, R. Fault Diagnosis Systems; Springer: Berlin, 2006.

(2) Venkatasubramanian, V.; Rengaswamy, R.; Yin, K.; Kavuri, N. Areview of process fault detection and diagnosis Part I: Quantitative modelbased methods. Comput. Chem. Eng. 2003, 27, 293.

(3) Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, N. A review ofprocess fault detection and diagnosis Part II: Qualitative models and searchstrategies. Comput. Chem. Eng. 2003, 27, 313.

(4) Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, N.; Yin, K. Areview of process fault detection and diagnosis Part III: Process historybased method. Comput. Chem. Eng. 2003, 27, 327.

(5) Chiang, L. H.; Russel, E. L.; Braatz, R. D. Fault Detection andDiagnosis in Industrial Systems; Springer: London, 2001.

(6) Chiang, L. H.; Braatz, R. Process monitoring using causal map andmultivariate statistics: Fault detection and identification. Chemom. Intell.Lab. Syst. 2003, 65, 159.

(7) Lee, J. M.; Yoo, C.; Lee, I. Statistical monitoring of dynamicprocesses based on dynamic independent component analysis. Chem. Eng.Sci. 2004, 59, 2995.

(8) Qin, S. J. Statistical process monitoring: Basics and beyond.J. Chemom. 2003, 54, 480.

(9) Benjamin, J. O.; De La Pena, D. M.; Davis, J. F.; Christofides, P. D.Enhancing data based fault isolation through nonlinear control. AIChE J.2008, 54, 223.

(10) Lou, S. J.; Budman, H.; Duever, T. A. Comparison of fault detectiontechniques. J. Process Control 2003, 13, 451.

(11) Bin Shams, M.; Budman, H.; Duever, T. Fault detection usingCUSUM based techniques with application to the Tennessee Eastmanprocess. 9th International Symposium on Dynamic and Control of ProcessSystems (DYCOPS), Leuven, Belgium; International Federation of AutomaticControl: Laxenburg, Austria, 2010.

(12) Downs, J. J.; Vogel, E. F. A plantwide industrial process controlproblem. Comput. Chem. Eng. 1993, 17, 245.

(13) Cheng, C. Y.; Hsu, C. C.; Chen, M. C. Adaptive kernel principalcomponent analysis (KPCA) for monitoring small disturbances of nonlinearprocesses. Ind. Eng. Chem. Res. 2010, 49, 2254.

(14) Ding, S. X.; Zhang, P.; Naik, A.; Ding, E. L.; Huang, B. Subspacemethod aided-driven design of fault detection and isolation system. J.Process Control 2009, 19, 1496.

(15) Zhang, Y. Enhanced statistical analysis of nonlinear process usingKPCA, KICA and SVM. Chem. Eng. Sci. 2009, 64, 801.

(16) Ku, W.; Storer, R. H.; Georgakis, C. Disturbance detection andisolation by dynamic principal component analysis. Chemom. Intell. Lab.Syst. 1995, 30, 179.

(17) Bin Shams, M.; Budman, H.; Duever, T. Finding a trade-off betweenobservability and economics in the fault detection of chemical processes.Comput. Chem. Eng., in press.

(18) MacGregor, J. F.; Kourti, T. Statistical process control of multi-variate processes. Control Eng.Pract. 1995, 3, 403.

(19) Montgomery, D. C. Introduction to Statistical Quality Control;Wiley: New York, 1997.

(20) Lyman, P. R.; Georgakis, C. Plantwide control of the TennesseeEastman problem. Comput. Chem. Eng. 1995, 19, 321.

(21) Tyler, M. L.; Morari, M. Optimal and robust design of integratedcontrol and diagnosis modules. Proceedings of the 1994 American ControlConference, Baltimore, MD; IEEE: New York, 1994; p 2060.

(22) Choudhury, M. A. A. S.; Kariwala, V.; Shah, S. L.; Douke, H.;Takada, H.; Thornhill, N. F. A simple test to confirm control valve stiction.Proceedings of the IFAC World Congress, Prague, Czech Republic;International Federation of Automatic Control: Luxenburg, Austria, 2005.

(23) Ricardez-Sandoval, L. A.; Budman, H. M.; Douglas, P. L.Simultaneous design and control of chemical processes with application tothe Tennessee Eastman process. J. Process Control 2009, 19, 1377.

ReceiVed for reView June 7, 2010ReVised manuscript receiVed November 6, 2010

Accepted November 15, 2010

IE101238Q

Pr[yi < c] ) 2Φ(c1/λ) - 1 (A1)

E[yik] ) µ ) 20.5λkΓ[0.5(λk + 1)]/√π (A2)

E[yik] ) E[yi

1] ) µ ) 20.25Γ(3/4)/√π ) 0.82218(A3)

V(yi) ) √(2/π)Γ(1) - µ2 ) (0.34915)2 (A4)


enhancing fault-observability by the use of feedback control

Documents