ieee transactions on reliability, vol. 63, no. 2, june ...davidc/pubs/svm_reliability2014.pdf ·...

13
IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty Detection With Support Vector Machines Lei Clifton, David A. Clifton, Member, IEEE, Yang Zhang, Peter Watkinson, Lionel Tarassenko, and Hujun Yin, Senior Member, IEEE Abstract—Novelty detection, or one-class classication, is of particular use in the analysis of high-integrity systems, in which examples of failure are rare in comparison with the number of examples of stable behaviour, such that a conventional multi-class classication approach cannot be taken. Support Vector Machines (SVMs) are a popular means of performing novelty detection, and it is conventional practice to use a train-validate-test approach, often involving cross-validation, to train the one-class SVM, and then select appropriate values for its parameters. An alternative method, used with multi-class SVMs, is to calibrate the SVM output into conditional class probabilities. A probabilistic ap- proach offers many advantages over the conventional method, including the facility to select automatically a probabilistic novelty threshold. The contributions of this paper are (i) the development of a probabilistic calibration technique for one-class SVMs, such that on-line novelty detection may be performed in a probabilistic manner; and (ii) the demonstration of the advantages of the pro- posed method (in comparison to the conventional one-class SVM methodology) using case studies, in which one-class probabilistic SVMs are used to perform condition monitoring of a high-integrity industrial combustion plant, and in detecting deterioration in pa- tient physiological condition during patient vital-sign monitoring. Index Terms—Support vector machine, novelty detection, one- class classication, calibration, condition monitoring. ABBREVIATIONS EHM Engine Health Monitoring SVM Support Vector Machine MLP Multi-Layer Perceptron PAV Pair-Adjacent Violators Manuscript received November 20, 2011; revised February 13, 2013; ac- cepted December 11, 2013. Date of publication April 10, 2014; date of current version May 29, 2014. The work of L. Clifton was supported by the Overseas Research Students Award Scheme, provided by the U.K. Government, and later by the NIHR Biomedical Research Centre Programme, Oxford. The work of D. A. Clifton was supported by a Royal Academy of Engineering Research Fel- lowship and the Centre of Excellence in Personalized Healthcare funded by the Wellcome Trust and EPSRCunder Grant WT 088877/Z/09/Z. Associate Editor: Shieh. L. Clifton and D. A. Clifton, Lionel Tarassenko are with the Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, U.K. (e-mail: [email protected]). Y. Zhang is with the Department of Mechanical Engineering, University of Shefeld, Shefeld, UK. P. Watkinson is with the Nufeld Department of Clinical Neurosciences, Uni- versity of Oxford, Oxford, U.K.. H. Yin is wih the School of Electrical and Electronic Engineering, University of Manchester, Manchester, U.K.. Digital Object Identier 10.1109/TR.2014.2315911 GMM Gaussian Mixture Model ROC Receiver Operating Characteristic AUC Area-under-the-Curve ROC Receiver Operating Characteristic SDU Step-Down Unit ICU Intensive Care Unit NOTATION -dimensional feature space a feature space by a non-linear transformation : support vectors number of support vectors ; i.e., kernel function bandwidth of a multivariate Gaussian Lagrangian multiplier novelty score re-scaled novelty score in the range data available during classier construction that are stable data available during classier construction that are unstable average distance to the centroid of real, stable data radius in a hyper-sphere number of real, stable data; i.e., articial stable data articial unstable data offset of an optimal hyperplane probability densities probabilities stepwise-constant isotonic function model of stability with parameters 0018-9529 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 25-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455

Probabilistic Novelty DetectionWith Support Vector Machines

Lei Clifton, David A. Clifton, Member, IEEE, Yang Zhang, Peter Watkinson, Lionel Tarassenko, andHujun Yin, Senior Member, IEEE

Abstract—Novelty detection, or one-class classification, is ofparticular use in the analysis of high-integrity systems, in whichexamples of failure are rare in comparison with the number ofexamples of stable behaviour, such that a conventional multi-classclassification approach cannot be taken. Support Vector Machines(SVMs) are a popular means of performing novelty detection, andit is conventional practice to use a train-validate-test approach,often involving cross-validation, to train the one-class SVM, andthen select appropriate values for its parameters. An alternativemethod, used with multi-class SVMs, is to calibrate the SVMoutput into conditional class probabilities. A probabilistic ap-proach offers many advantages over the conventional method,including the facility to select automatically a probabilistic noveltythreshold. The contributions of this paper are (i) the developmentof a probabilistic calibration technique for one-class SVMs, suchthat on-line novelty detection may be performed in a probabilisticmanner; and (ii) the demonstration of the advantages of the pro-posed method (in comparison to the conventional one-class SVMmethodology) using case studies, in which one-class probabilisticSVMs are used to perform conditionmonitoring of a high-integrityindustrial combustion plant, and in detecting deterioration in pa-tient physiological condition during patient vital-sign monitoring.

Index Terms—Support vector machine, novelty detection, one-class classification, calibration, condition monitoring.

ABBREVIATIONS

EHM Engine Health Monitoring

SVM Support Vector Machine

MLP Multi-Layer Perceptron

PAV Pair-Adjacent Violators

Manuscript received November 20, 2011; revised February 13, 2013; ac-cepted December 11, 2013. Date of publication April 10, 2014; date of currentversion May 29, 2014. The work of L. Clifton was supported by the OverseasResearch Students Award Scheme, provided by the U.K. Government, and laterby the NIHR Biomedical Research Centre Programme, Oxford. The work ofD. A. Clifton was supported by a Royal Academy of Engineering Research Fel-lowship and the Centre of Excellence in Personalized Healthcare funded by theWellcome Trust and EPSRCunder Grant WT 088877/Z/09/Z. Associate Editor:Shieh.L. Clifton and D. A. Clifton, Lionel Tarassenko are with the Institute of

Biomedical Engineering, Department of Engineering Science, University ofOxford, Oxford, U.K. (e-mail: [email protected]).Y. Zhang is with the Department of Mechanical Engineering, University of

Sheffield, Sheffield, UK.P.Watkinson is with the Nuffield Department of Clinical Neurosciences, Uni-

versity of Oxford, Oxford, U.K..H. Yin is wih the School of Electrical and Electronic Engineering, University

of Manchester, Manchester, U.K..Digital Object Identifier 10.1109/TR.2014.2315911

GMM Gaussian Mixture Model

ROC Receiver Operating Characteristic

AUC Area-under-the-Curve

ROC Receiver Operating Characteristic

SDU Step-Down Unit

ICU Intensive Care Unit

NOTATION

-dimensional feature space

a feature space by a non-linear transformation :

support vectors

number of support vectors ; i.e.,

kernel function

bandwidth of a multivariate Gaussian

Lagrangian multiplier

novelty score

re-scaled novelty score in the range

data available during classifier construction that arestable

data available during classifier construction that areunstable

average distance to the centroid of real, stable data

radius in a hyper-sphere

number of real, stable data; i.e.,

artificial stable data

artificial unstable data

offset of an optimal hyperplane

probability densities

probabilities

stepwise-constant isotonic function

model of stability with parameters

0018-9529 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

456 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

I. INTRODUCTION

A. Novelty Detection for On-Line Monitoring

N OVELTY detection, also termed one-class classification,involves construction of a model of stability using exam-

ples of stable system behaviour, and then classifies test data aseither stable or unstable with respect to that model. This tech-nique is particularly applicable to the monitoring of high-in-tegrity systems, such as jet engines, human patients, or man-ufacturing processes, in which examples of unstable system be-haviour are scarce in comparison to the number of examples ofstable system behaviour, due to the reliability of such systems.In such cases, there are typically too few examples of systemfaults to be used to construct a robust two- or multi-class classi-fier, as would be used in a conventional fault detection approachto condition monitoring. For example, an engine health moni-toring (EHM) system may have available many thousands ofhours of data acquired from stable flying time with very few ex-amples of failure throughout its operational life.

Note that, in keeping with the literature on novelty de-tection, the term stable will refer to data from a machine’sstable operating condition, rather than being used in thestatistical sense of stable probability distributions.

The difficulties of monitoring high-integrity systems arefurther compounded by inter-system variability. Differentinstances of the same system type (such as different patients ofthe same age) are often so different in their observed data thatexamples of unstable system behaviour from one instance areinapplicable to the condition of other instances. For example, aheart rate of 50 beats per minute may be indicative of consid-erable physiological instability in one hospital patient, while itmay be entirely expected for a healthier patient of the same ageand background.Finally, high-integrity systems typically exhibit a high degree

of structural complexity, and can often comprise many millionsof components and sub-systems that interact in a non-linearmanner. Thus, the potential space of instability is extremelylarge, and so the large resultant number of failure modes is oftenpoorly understood.Novelty detection avoids such problems by modelling the

stable mode of operation of the system, which is often well-un-derstood because most high-integrity systems function as in-tended most of the time. They then seek to identify deviationsfrom that stable model, termed the model of normality in themachine learning literature. Such condition monitoring appli-cations are of particular interest to the techniques developedwithin this paper.

B. Overview

This paper considers the one-class support vector machine(SVM), which is a commonly-used method of performing nov-elty detection. Its formulation is briefly recapped in Section II,where its disadvantages with respect to probabilistic methodsare discussed. A probabilistic extension to the one-class SVM is

proposed, which requires the generation of artificial data usingthe available stable data. Section III describes methods for gen-erating such data, which are then used for calibrating the outputof the SVM into estimates of the class-conditional probabili-ties, in Section IV. The proposed probabilistic approach is il-lustrated using both simulated data, and real-world case studies.The latter include data from monitoring a large industrial com-bustion system, and from the analysis of patient vital signs, de-scribed in Sections V, and Sections VI, respectively.Limitations of the method, and potential future extensions,

are also discussed in Section VII.

II. ONE-CLASS SVMS

The SVM is an oft-employed method of performing noveltydetection, and it has been applied to many such problems, in-cluding jet engine condition monitoring [1], signal segmenta-tion [2], and fMRI analysis [3], among many others, a reviewof which may be found in [4].

A. Formulation

This paper considers the one-class SVM formulation pro-posed by [5], in which a number of -dimensional data

are mapped into a (potentially infinite-di-mensional) feature space by some non-linear transformation: , where the data are linearly separable from the

origin in . We note in passing that this approach is typicallyemployed in favour of the one-class formulation proposed in[6] and [7], in which a hypersphere of minimum radius is foundto enclose the data in , but which can be considered to beequivalent to the separation from the origin in the feature space.We here define the output of the one-class SVM in the fol-

lowing manner, to be interpreted as a novelty score,

(1)

(2)

with parameters

(3)

(4)

where are the support vectors, of which there are , andwhere is a kernel function, typically a multivariate Gaussianwith bandwidth , which provides the dot product of trans-formed data in :

(5)

(6)

, and that are Lagrangian multipliers used tosolve the dual formulation, more details of which may be foundin [5], and which are not reproduced here. Thus, unstable data(i.e., those outside the single, stable training class) take higher

Page 3: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 457

values of than those for stable data, and hence is anovelty score.

B. Disadvantages of the Conventional Formulation

The one-class SVM, as with the conventional multi-classSVM, has been repeatedly demonstrated to perform well at thetask of separating classes within example datasets. Typically,it outperforms probabilistic methods in retrospective studies(in terms of misclassification rate), when on-line analysis isnot performed [8]. However, its non-probabilistic formulationgives rise to several disadvantages with respect to probabilisticmethods [9], [10].• Uncertainty in classification is not modelled. This is of im-portance in the monitoring of high-integrity systems, forexample, because it may be advantageous for a decisionto be made only if the certainty in classification is suf-ficiently high. Monitoring systems that provide so-calleddon’t know outputs can often be more accepted in prac-tice [11], due to their appropriate handling of uncertainty,which can reduce false-positive classifications (at the ex-pense of reduced sensitivity to system instability).

• Not explicitly modelling the uncertainty in the classifica-tion makes it difficult to obtain error bars on the output,as would be required, for example, in a decision supportsystem [12].

• Cross-validation is required to set model parameters, suchas the novelty threshold on the novelty score . Thiscalculation is computationally expensive, and can requirelarge amounts of data to perform independent testing, suchthat the quantity of data available for training is decreasedsubstantially. In many condition monitoring applications,the rate of data acquisition is low, and hence, if a model isbeing learned on-line, all available data must be used fortraining if the model is to be able to sufficiently charac-terise stable system behaviour.For example, a common application is the monitoring of jetengines, in which a very small number of data are broadcastfrom the aircraft to a ground-based monitoring station vialimited-bandwidth satellite- or airport-based communica-tions systems [13], [14]. This limited quantity of data mustbe used to learn a model of stability on-line, such that mon-itoring can take place after completion of a small numberof flights.Similarly, in the home monitoring of patients with chronicillnesses, a typical m-health monitoring system may ac-quire measurements of vital signs (such as blood pressureand temperature) daily or twice-daily, from which mean-ingful models of stability must be constructed on-line [15],[16].

• Finally, probabilistic approaches allow the use of periph-eral classification techniques, such as probabilistic featureselection [17] and probabilistic combination-of-classifiers[18]–[20].

Several probabilistic extensions have been proposed formulti-class SVMs, which will be reviewed in Section IV, whileother techniques such as Relevance Vector Machines (RVMs)have sought to embed probabilistic, sometimes Bayesian,methodologies into the multi-class kernel machine framework

[10], [21], [22]. This paper develops probabilistic calibrationtechniques for one-class SVMs, suitable for performing on-linenovelty detection, which will be described in Section IV. Priorto that, we introduce the notion of artificial data generationin Section III, which will be required by our probabilisticcalibration method.

III. GENERATING ARTIFICIAL CALIBRATION DATA

Suppose the range of novelty scores produced by a one-class classifier is , then may be mapped onto therange through a simple linear re-scaling

(7)

where is the re-scaled score in the range . However,tends to be poorly calibrated because the novelty scoremay not be proportional to the actual probability of the

sample being unstable [23]. Different approaches to calibrationmust therefore be considered.

A. Justification of the Need for Artificial Data

In the following, we here denote sets , and to refer to thosedata available during classifier construction that are stable, andunstable, respectively. Typically, in novelty detection applica-tion, is empty or under-represented.Previous studies of two-class SVMs have shown that calibra-

tion of output into probabilities requires a separate validationset [24], which comprises a suitable number of stable, and un-stable examples (i.e., , and of suitable size). However, asnoted in Section I, novelty detection applications typically havevery few examples in .Furthermore, for small datasets such as those that may occur

in some low-bandwidth condition monitoring applications, asdiscussed in the previous section, one cannot afford to furthersplit the training set of stable examples to form a validationset suitable for calibration. Thus, with an under-represented (orfrequently empty) unstable set , and a potentially limited stableset , an alternative approach must be taken to probabilisticcalibration.In this section, we provide a possible solution to the problem

by generating artificial samples to be used as a dataset suitablefor calibration of the novelty scores.We will describe the generation of artificial stable data and

artificial unstable data . These sets must be based entirely onthe real, available stable data acquired from the system, aswe cannot assume the existence of a non-empty , nor can weassume that what limited examples we may have in provide acomplete understanding of unstable conditions for our system.(This latter effect is due to inter-system variability, and the largenumbers of failure modes that may occur for a complex system,as described in Section I.)

B. Generating Stable Data , and Unstable Data

Generating artificial unstable data has been used in noveltydetection problems where only stable data are available [25],[26]. Generating a compact set of artificial unstable data canavoid dealing with excessive quantities of empty feature space

Page 4: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

458 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

around the stable data [26]. In [25], uniformly-distributed arti-ficial unstable data were generated to surround a hypersphereof stable data, where the latter were assumed to have a mul-tivariate, unimodal Gaussian distribution. A similar approachwas employed in [26], whereby artificial unstable data generatedaround stable data were used to form closed decision bound-aries for a multi-layer perceptron (MLP). We note that, in allsuch work, the artificial data have been used merely to ensurethat a decision boundary can be constructed around the single,known class of stable data; hence, no strong assumption is madeabout the unstable class, other than that it lies outside the locusof stable data.Let be the average Euclidean distance of the real stable data

to their centroid , and let be the number of those realstable data; i.e., . The procedure for generating arti-ficial data consists of the following four steps, which have beenadapted from [25], [26] to include the generation of artificialstable data .STEP 1. Generate uniformly distributed artificial data inside

and around real stable data in a hyper-sphere of radius .We have chosen to generate a compact set of dataoutside the real stable data, as required in step 4 below. Resultsfrom experiments not shown here indicate that does notimprove the performance of the calibration using the one-classSVM classifier, because this simply involves the generation ofdata further out into the already unstable regions of the dataspace.STEP 2. For each real stable define a local average

distance quantity to be the average of the Euclidean distanceof its -nearest neighbours, . The global average dis-tance is found by averaging over all the real stable data.STEP 3. For each artificial data-point, find the distance

from its nearest neighbour among the real stable data .STEP 4a. Those artificial data with lie

outside the locus of real stable data , and are thus used to formthe artificial unstable set .The parameter controls the boundary space between

the real stable data and the artificial unstable data ; a greatervalue of increases the separation between the two sets.STEP 4b. Those artificial data with are

used to form the artificial stable set . The parameteris usually a small positive value, which determines how close tothe boundary of the locus of real stable data that the artificialstable data are permitted to approach. As , the artificialstable data reach the edge of the locus of the real stable data.The values of parameters and are additional quantities

that one may optimise during the modelling process, and whichmay be determined with prior knowledge of the domain or withknowledge of any unstable examples that may exist; in the casestudies considered later, we follow the analogous method de-scribed in [26] by setting , and .We note in passing that there is no strong link between the

method used to generate artificial data and the algorithm used ul-timately to perform novelty detection. It is feasible, for example,that we might use the above algorithm to generate artificial data,and then re-use parts of the algorithm to form a -nearest neigh-

bour analysis for eventual novelty detection. However, the focusof the work described by this paper is to demonstrate the benefitof probabilistic output for one-class SVMs, due to the tradition-ally superior classification performance of the latter with respectto many other methods [8].

IV. CALIBRATING NOVELTY SCORES INTO PROBABILITIES

In this section, we investigate several methods of calibratingnovelty scores into class-conditional probabilities. Themost popular methods in the literature for achieving this cal-ibration (with multi-class classifiers) include sigmoid fitting,binning, and isotonic regression. The first two are briefly re-capped in Subsections IV.A, and Subsections IV.B, respectively.

A. Sigmoid Fitting

In [27], and recently refined in [28], a sigmoid function isused to map the output of a two-class SVM classifier onto prob-abilities. With training data labelled according to

, a typical unthresholded outputof a two-class SVM classifier [29] is

(8)

where is the offset of an optimal hyperplane.If we are to apply the method to one-class SVM classifiers for

the purposes of novelty detection, this technique requires furtherevaluation. It is reported in [23] that, for some data sets, the sig-moid shape does not appear to fit two-class naïve Bayes scoresas closely as it fits two-class SVM scores. Furthermore, the sig-moid-fitting method is based on the assumption that class-con-ditional densities are exponential, motivatedby the empirical distribution of many data sets [23], [27]. Fordifferent data sets, the suitability of a calibration method needsto be validated using a reliability diagram, which visualises thecalibration of a classifier [30]. For each score , the em-pirical probability can be calculated as beingthe number of data with score that belongs to class dividedby the number of all samples with score . If the classifier iswell-calibrated, the plot of versus will bethe line , meaning that the scores are equal to the empir-ical probabilities. However, for many of the novelty detectiondatasets that we have considered, such as those described laterin this paper, the reliability diagram shows that the above as-sumption is not reasonable [8].

B. Binning

A histogram method, also termed binning, can be used toobtain calibrated probabilities from a naïve Bayesian classifier[31]. The training samples are first sorted according to theirscores; the sorted set is then divided into subsets of equal size,called bins. For each bin , the lower and upperboundary of scores within that bin are calculated. Given a testexample from , the probability that belongs to classis estimated as the fraction of all samples in that belong toclass .The number of bins determines the number of different

probability estimates, and must be small enough to reduce the

Page 5: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 459

variance of the probability estimates [31]. For instance,was used in [31]. Compared with uncalibrated scores, binningimproves the accuracy of probability estimates by reducing bothvariance and bias, at the price of reduced resolution of proba-bility estimates.The binning method is a non-parametric method which does

not make any assumption about the mapping function betweenthe output scores and probabilities, but it has several disadvan-tages [23]. First, the number of bins is usually chosen bycross-validation. However, cross-validation often fails to indi-cate the optimal value of if the training set is too small or un-balanced, as may be the case in novelty detection applications.Second, the size of the bins is fixed, and the position of bound-aries are calculated accordingly. If two examples from the samebin are required to have different probability estimates, the bin-ning method may fail to produce meaningfully calibrated nov-elty scores. We have chosen not to use the binning method forour calibration task due to these shortcomings.

C. Isotonic Regression

In [23], and further examined in [32], it was proposed to usean intermediary approach between sigmoid fitting and binning,termed isotonic regression. The latter is a non-parametric formof regression, with the restriction that the mapping from noveltyscores into probabilities is isotonic (i.e., non-decreasing).Essentially, we wish to ensure that data are ranked correctly:a higher novelty score should always correspond to a higherprobability of belonging to the unstable class, and vice versa.The pool- or pair-adjacent violators (PAV) algorithm is

employed to perform the isotonic regression, which finds thestepwise-constant isotonic function that fits the dataaccording to a mean-squared-error criterion. This algorithmallows us to identify those cases in which correct ranking is nottaking place; that is, the PAV algorithms ensures that highernovelty scores always correspond to higher probabilities ofbeing unstable.Let be the training examples from stable and

unstable classes , let be the value of the functionto be learned for each training sample, and let be the iso-tonic function obtained from isotonic regression. The PAV al-gorithm works as follows.STEP 1. Sort the examples according to their novelty

scores , in ascending order. Initialise if, or if . It is most likely that at this stage

there will be several cases in which higher novelty scores do notcorrespond to higher probabilities of being unstable, and viceversa.STEP 2. If is isotonic, then return .

Else, proceed to STEP 3.STEP 3. Find a subscript such that . The

examples , and are called pair-adjacent violators. Theseare pairs that violate our requirement that increasing noveltyscores correspond to increasing probabilities of instability. Wethen replace and with their average

(9)

This replacement removes the conflict, by smoothing the cdf(by introducing a quantisation in the probability, according tothe average described above).STEP 4. Set as the new . Proceed to STEP 2.Thus, is a step-wise constant function which consists

of horizontal intervals, and may be interpreted as ,the probability that sample is unstable. For a test example ,we first find the interval to which its score belongs. Thenwe set the value of in this interval to be , theprobability estimate of given .If the scores rank all examples correctly, then all class

examples will appear before all class examples in the sorteddata set in STEP 1. The calibrated probability estimatefor class , and for class . Conversely, if the

scores do not provide any information, will be a constantfunction, taking the value of the average score over all examplesin class .The PAV algorithm used in isotonic regressionmay be viewed

as a binning algorithm, in which the position and the size of thebins are chosen according to how well the classifier ranks thesamples [23]. Therefore, the isotonic regression overcomes thepreviously-described drawbacks of the binning method.

D. Obtaining and Using the Calibration Function

Isotonic regression may now be used with the artificialdata generated as described in Section III, using the followingprocedure.STEP 1. Generate artificial stable and unstable data,

using the method in Section III.STEP 2. Construct a one-class SVM classifier following the

method in Section II, using a training set comprising all avail-able real stable data .STEP 3.Calibrate the novelty scores from the one-class

SVM classifier into probabilities , using . Thiscalibration results in a PAV isotonic regression function .STEP 4. Calibrate novelty scores of all real data into

probabilities according to obtained in STEP 3.A novelty threshold may be set by taking advantage of the

probabilistic nature of the output, at ,on the calibrated novelty scores; i.e., a test sample is classi-fied unstable if , and stable otherwise. Weuse the threshold following [21], [33] for problemsin which we have uniform class priors. Case studies presentedin subsequent chapters will demonstrate the suitability of thischoice in application to datasets acquired during the monitoringof example high-integrity systems. We will also show results ofarea-under-the-ROC, for which all values of the threshold areconsidered.

V. CASE STUDY I: INDUSTRIAL COMBUSTION MONITORING

A. Introduction

To demonstrate the proposed novelty detection method, thissection considers the condition monitoring of an industrial com-bustion system, the Typhoon G30 combustor (Siemans Indus-trial Turbomachinery Ltd.).

Page 6: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

460 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

Combustion instability, caused by the resonant couplingbetween combustive heat and acoustic pressure, is a majorproblem in the operation of jet engines and power generators.Early warning of combustion instability is required to preventcatastrophic system failure, and detection of a deviation awayfrom stable operating status is important for being able toperform pre-emptive maintenance, such that hazards (andassociated costs) can be avoided.

B. Methodology

As described in Section I, it is often desirable to performsystem-specific novelty detection, in which a model of stability

, with parameters , is constructed on-line using data ac-quired from an individual system during its service life. It istypically assumed that the first training interval comprises stabledata , such that a model can be constructed. (This assumptioncan be validated on-line by performing an initial comparison ofdata acquired during the system’s first period of operation witha population-generic model, trained using data acquired froma population of systems of the same type.) That modelis then used for testing further data acquired from the system,such that new data are compared for novelty with respect to theprevious, assumed stable, operation of that same system [34].This model may be periodically re-trained as further stable dataare acquired; for example, may be re-trained at the endof each flight that is deemed to be stable compared with the ex-isting model, in the case of aerospace EHM [35]. In the case ofhuman vital-signmonitoring, themodel could be re-trained afterevery -hour period of stable patient physiology [36], [37].In the case study described in this section, the initial stable

period of system operation was simulated by operating theTyphoon G30 combustor in a stable manner at atmosphericpressure. Stable data were acquired from the system, asdescribed below, from which a model of stability wasconstructed using both (i) the proposed method, in which prob-abilistic calibration was performed, and (ii) the conventionalmethod, in which probabilistic calibration was not performed.Subsequently-acquired data were then tested against this model.To examine the capability of the two novelty detection systems,the combustor then simulated a fault by being deliberatelyoperated in a manner that promoted unstable combustion.This unstable operation was achieved by increasing fuel flow-

rates above some threshold, while maintaining a constant airflow-rate, which provided unstable test data with which toevaluate the performance of the system-specific model of sta-bility constructed previously.

C. Datasets

Two combustion datasets were acquired from aTyphoon G30 combustor, as described in the previous section.Each combustion dataset consists of measurements from threechannels , with sampling frequencies of 1 kHz.Channel is the gas pressure of the fuel methane (CH ) inthe main burner. For stable combustion, the swirl air flow ratewas 0.039 kgs ; the fuel supplied to the main, and the pilotburners was fixed at flow rates kgs , and

kgs , respectively. To initiate combustion instabilities,

TABLE IDATASET INDICES FOR EXAMPLE REAL AND

SYNTHETIC COMBUSTION DATASETS

the flow rates of fuel supplied to the main and pilot burners wereincreased to kgs , and decreased tokgs , respectively. Channels and are luminosity mea-surements recorded within the combustion chamber. A bundleof fine optical fibres was mounted at the rear focal point of aNikon 35 mm camera, such that all light passing through thefront lens was collected. The flame luminosity from the combus-tion chamber was measured using this system. The fibre opticbundle was bifurcated, each channel connected to a photomul-tiplier (ORIEL model 70704). This design allowed the mea-surement of chemiluminescent emitters of radicals (visibleat light wavelength 513 nm), and the global intensity of unfil-tered light, corresponding to the second, and third channels inthe datasets, respectively.Combustion flame images from a high-speed camera have

been investigated to predict instability [38], in which a Gaussianmixture model (GMM) was constructed to identify novel flamepatterns. A novelty detection method using SVMs [39] wasable to achieve earlier identification of combustion instability,and greater distinction between stable and unstable classes thanthe conventional GMM method. The optical measurementsmethods described above have been used to study the flamedynamics of unstable combustion [40].The two triple-channel combustion datasets con-

tain 5 700, and 7 400 data-points, respectively, which were di-vided into non-overlapping windows of length . Thisdesign resulted in 89, and 115 windows of data for datasets, and , respectively. Wavelet analysis [41], [42] was used,

with the Daubechies-3 wavelet function, to obtain wavelet coef-ficients for each window. Following [43], the mean value of thefirst-level approximation coefficients and the energy of thefirst-level detail coefficients were obtained for each window,to provide a bivariate data space. We note that the case studydescribed in this section is bivariate, such that the data spacecan be plotted to illustrate the proposed novelty detection pro-cedure; a multivariate example using patient vital-sign data isconsidered in the next section.This procedure yielded 41 bivariate feature vectors of stable

data , and 48 bivariate feature vectors of unstable data forexample combustion dataset . 56, and 59 feature vectorsfor stable and unstable were obtained for dataset ,respectively.For comparison, a synthetic combustion dataset was also

generated, consisting of 200 stable examples , from whichto construct a model of stability, and 200 unstable exampleswith which to examine the performance of the resulting noveltydetection system. These examples were generated from a bi-variate, three-component GMM, with full covariance matrices.Table I shows the components of each dataset used by the casestudy in this section.

Page 7: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 461

A key point to note is that these datasets are used solely forthe purposes of retrospective evaluation of the proposed nov-elty detection technique, and hence we have collected exampleunstable test data with which to determine the effectivenessof the one-class SVMs under consideration. When novelty de-tection is performed in practice, as noted previously, only dataassumed to be stable are available, from which the modelof stability is constructed. Indeed, due to the rarity of unstableevents in most high-integrity systems, it is usually the case thatmost such systems run in a stable manner for the great majorityof their operational lives. If sufficient unstable data were avail-able at the training stage such that all possible fault conditionscould be represented, then one may consider taking a conven-tional multi-class approach in which each unstable condition isexplicitly modelled. Such a system could be expected to out-per-form a one-class classifier, due to the inclusion of prior knowl-edge of non-stable classes [21]. Performance of a system usingunlabelled data, with unlabelled unstable examples mixed withthe known stable examples, could also be expected to out-per-form a novelty detection system [44]–[46]. However, we willhere confine ourselves to evaluating the proposed extension tothe one-class SVM method, which represents the standard con-dition monitoring case in which only stable data are available atthe time of model construction.

D. Model Construction

The conventional one-class SVM requires use of a validationset to determine the threshold on its novelty-score output .Hence, we illustrate this case for the conventional method byusing 80% of the available stable data for construction of amodel of stability, and the remaining 20% of the stable datafor setting of the novelty threshold, such that ,where is the validation set.The SVMhas two key parameters, the values of which need to

be determined. The validation set is typically used to determinethese values. The first parameter is the kernel bandwidth , asdefined in (5), which corresponds to the width of the Gaussiankernel. The second parameter is typically termed , which de-fines the complexity of the decision boundary:

(10)

(11)

where are the individual errors [47]. We note the notationaldistinction between the SVM parameter , and the class labels

previously used to describe the stable and unstableclasses, respectively.More intuitively, the value of the parameter determines the

flexibility of the decision boundary; if takes large values in theabove, then misclassifications are penalised more significantly,resulting in a decision boundary that is more flexible, attemptingto include all training data. Conversely, if takes small valuesin the above, then misclassifications are penalised to a lesser de-gree, and therefore moremisclassifications are allowed to occur.This latter case results in a smoother decision boundary [48]. Agrid search is typically performed to obtain suitable values for

. We will term the conventional one-class SVM methodSVM-1.As noted in the previous section, the proposed probabilistic

approach allows automatic selection of a novelty threshold at, and hence all available stable data at the

time of model construction may be used as the training set forthe one-class SVM. The data in the training set were then usedto generate artificial data, and artificialdata, as described in Section III, for the purposes of calibratingthe SVM output into probabilities. The choice of in (5) mayalso be performed automatically, without the need for valida-tion, noting that, for a Gaussian kernel , the quantity

is the Euclidean distance between any two sam-ples scaled by a factor . Based on this close link betweenand Euclidean distance, we propose the following method to

determine an appropriate value for , based on a similar methodproposed by [49] for selecting in a density estimator.First, as described before, we calculate the local average Eu-

clidean distance of nearest neighbours from each samplein the training data, where is set to be the square root of thenumber of stable training samples . Next, the global averagedistance is found by averaging over all the training data.The value of provides a guide for the range of , such that

. Experiments not shown here (using each datasetconsidered in this article) indicate that the results obtained areinsensitive to the value of , and we have selected .These experiments varied , where ROC values weresimilar to 1 d.p. for values of over that range, using eachdataset considered in this paper. This insensitivity to the valueof was previously observed by other authors [49]. We notethat this value may be inappropriate for data spaces of particu-larly high dimension (e.g., 20-dimensional spaces), which thispaper does not consider, where a larger value of may beappropriate.Furthermore, it is common for the SVM parameter (the

flexibility of the decision boundary) to be written in terms ofa different parameter :

(12)

The support vector constraints [47]

(13)

imply that the range of the parameter is ; and so,from the above, we have the range for which is .This result shows that the parameters and have the samerange. The parameter was introduced in the SVM literaturebecause it serves as an upper bound on the number of trainingsamples that lie on the wrong side of the hyperplane (i.e., it isthe maximum mis-classification rate). It is also a lower boundon the fraction of support vectors among stable training data [5].We therefore use the parameterisation involving instead of ,due to its clear meaning, as described above. If we wish, thevalue of can be easily recovered by (12).We will name the proposed probabilistic one-class SVM

method SVM-1P.

Page 8: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

462 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

Fig. 1. Novelty detection results using procedure SVM-1, applied to synthetic data set . (a)SVM output, (b) SVM novelty scores.

E. Results—Conventional Methods

Fig. 1 shows the results obtained when applying conventionalmethod SVM-1 to the synthetic dataset . Part (a) shows thecontour plot of SVM output in the bivariate data space. Stabletraining data are shown by white . Stable and unstabletest data are shown by white , and black , respectively.Support vectors are circled. Part (b) shows the novelty scoreof each of the data. consists of 200 stable data, simulatingthe initial period of stable system operation, and 200 unstabledata here generated to evaluate the novelty detection system,and to simulate a system instability (which would be rare, inpractice). In procedure SVM-1, 80% of the real stable data (dataindices ) are used for training the one-class SVM, withthe remainder of the stable data (data indices ) usedas the validation set. The threshold is represented by a hori-zontal line, and is set to be the maximum score assigned todata in the validation set. The stable data occupy a region inthe upper-right quadrant of the data space, and the conventionalSVM has correctly delimited the locus of stability, as shownin Fig. 1(a). Data acquired from a conventional high-integritysystem would spend the majority (if not the entirety) of theirtime contained within this locus, and hence would be classifiedstable, because the corresponding conventional SVM noveltyscores fall beneath the novelty threshold , as illustratedin Fig. 1(b). In this synthetic example, the simulated fault, oc-curring at data index 201 and persisting for the remainder of thedataset, results in data that lie well-separated from the locus ofstable data. Note that stable data in the validation set take sig-nificantly larger novelty scores than the training data, as may beseen for , and . Hence, the selection ofa suitable novelty threshold requires care when using the con-ventional one-class SVM formulation, SVM-1.Fig. 2 shows the results obtained when applying the conven-

tional SVM-1 to the example bivariate dataset, . Parts (a),(b), and (c) show contour plots of SVM novelty scoresin the bivariate data space of each channel, using the same no-tation as previously shown. Part (d) shows the novelty scores

of channels , from upper to lower figures,

respectively. Using SVM-1, stable data indices havebeen used as the training data, with the remainder of stable data

used as validation data. The seeded fault data, used totest the novelty detection system, comprise the unstable dataindices . While the separation of fault data from stabledata is generally large, it may be seen that more overlap existsbetween the two classes than with the synthetic dataset . Thisoverlap adversely affects the ability of the conventional SVM-1method to set the novelty threshold on the novelty score suchthat the classifier is suitably sensitive to the subsequently-ac-quired unstable data from the seeded fault. This effect is mostevident for the SVM that is trained using data from channel ,in which one of the validation data takes a high noveltyscore, which significantly reduces the sensitivity of the noveltydetector to the fault data. Similarly, the SVM trained for channelresults in constantly high novelty scores throughout the

stable validation dataset.The earliest detection of the fault is at for channel , atfor channel , and at for channel , despite the onset

of the fault occurring at in each case. Thus, the conventionalone-class method SVM-1 has resulted in decreased sensitivitywith respect to fault data, with a number of false-negative clas-sifications of the earlier fault data. This result is of particularsignificance in the monitoring of high-integrity systems, as willbe discussed in Section VII.We consider also the use of a GMM cross-validated using the

same procedure as used to set the parameters of method SVM-1.We note that the GMM typically performs less well than theSVM, in terms of both earliest warning of instability and theoverall area-under-the-curve (AUC), where the curve is the re-ceiver operating characteristic (ROC) curve.

F. Results—Proposed Method SVM-1P

Fig. 3 illustrates the proposed calibrationmethod SVM-1P ap-plied to synthetic dataset . The upper figure shows real ,and artificial data generated around ; it may be seenthat the locus of artificial unstable data is a torus around thestable data from which they were generated. The lower figureshows the corresponding isotonic function obtained using the

Page 9: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 463

Fig. 2. Novelty detection results using procedure SVM-1, applied to combustion data set . (a) SVM output of channel , (b) SVM novelty scores.

artificial data , on which the calibrated probabilities ob-tained from all data are marked. The locus of data occupiedby the simulated fault is confined to the lower-left quadrant ofthe bivariate data space, which is typical for novelty detectionapplications. The fault is not a full description of instability,which motivates the use of the one-class approach for moni-toring high-integrity systems, where only stability is typicallywell-understood. Fig. 3 (lower) shows the step-wise linear na-ture of the isotonic function. The figure shows thattakes all values in the range .Results obtained by applying the calibration method to

exemplar dataset are shown in Fig. 4. Part (a) illustratescalibration results for channel , showing (upper) the gen-eration of artificial data from stable dataset , andthe corresponding isotonic function. Part (b) showsoutput estimated probabilities for each of the three channels

. Fig. 4 shows that the locus of data spaceoccupied by the data acquired during fault conditions liessignificantly far from the locus of stable data , and from thetoroidal structure of the artificial data that were used forcalibration. The corresponding calibrated probabilities for thiswell-separated data space are therefore extremal, taking values

Fig. 3. Isotonic regression results using procedure SVM-1P, applied to syn-thetic data set .

close to 0 and 1. The a priori selection of the novelty thresholdat is shown in Fig. 4(b), which shows that

Page 10: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

464 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

Fig. 4. Results obtained using procedure SVM-1P applied to combustion dataset , (a) , (b) Output.

TABLE IIDATA INDEX OF THE FIRST UNSTABLE CLASSIFICATION,

ACCORDING TO EACH CLASSIFIER

the estimated probabilities have been forced into the extremaof the range . The onset of the fault conditions isdetected earlier than with the coventional method SVM-1, asshown in Table II. Recall from Table I that the onset of insta-bility in dataset occurs at data index , which thereforerepresents the point at which earilest detection could occur.The proposed method identifies the deterioration in the secondunstable window, , which is earlier than the conventionalone-class SVM and the GMM.

TABLE IIIAUC VALUES FOR EACH METHOD

G. Discussion

Results obtained for dataset are shown in Table II,where it may be seen that the differences between methodsSVM-1, GMM, and SVM-1P are similar to those illustrated inmore detail for dataset , above. Recall from Table I thatthe onset of seeded fault conditions occurred at data index

for dataset , which indicates that the proposed methodSVM-1P is sufficiently sensitive to detect the onset of unstablecombustion, while the conventional method SVM-1 is lesssensitive, and, in the case of channel , generates a falsealarm by incorrectly classifying stable data index as beingunstable. Such false alarms are one of the principal reasonsthat conventional monitoring systems are ignored in practice,as discussed in Section VII. Again, the SVM-based methodsoutperform the GMM. The (specificity, sensitivity) for SVM-1,GMM, and SVM-1P were (0.92, 0.91), (0.88, 0.87), and (0.95,0.92), respectively, for dataset , using cross-validationthresholds for SVM-1 and GMM, and a probabilistic thresholdof for SVM-1P.AUC results for both methods, for all datasets, are reported

in Table III, where it may be seen that the proposed method per-forms similarly using the synthetic dataset , due to the separa-bility of the simulated unstable conditions from the stable con-ditions; however, the proposed method provides a noticeable in-crease in AUC for the exemplar combustion datasets ,indicating that the increase in sensitivity to fault conditions, il-lustrated in the previous subsections, has not come at the ex-pense of decreased specificity (i.e., the false-alarm rate is keptsufficiently low as to be usable in practice).

VI. CASE STUDY II: PATIENT VITAL-SIGN MONITORING

This section reports results obtained from evaluating bothSVM-1 and SVM-1P using dataset , which is an example ofnovelty detection for patient vital-sign monitoring. Whereas theprevious section evaluated the performance of the algorithmusing bivariate data, such that the data space could be plottedto illustrate the SVM output and the probabilistic calibrationmethod, this section examines the use of the algorithm forhigher-dimensional data.

A. Dataset

This section considers vital-sign data acquired from patientsin a step-down unit (SDU), which is a level of acuity lower thanthat of the intensive care unit (ICU). There is a need for effectivenovelty detection systems in such wards, because patient dete-rioration can go unnoticed by clinical staff, leading to adverse

Page 11: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 465

patient outcomes [50]. Existing patient monitors generated uni-variate alarms whenever vital signs exceed some pre-definedthreshold, and often go unheeded due to the high false-positiverate of such alarms, where [51] reported results of a study inwhich it was deemed that 84% of alarms were false.The dataset used for the work described by this section com-

prises measurements of heart rate, breathing rate, blood oxygensaturation, and systolic blood pressure. Data were acquired onceevery four hours by ward staff (as is common practice in mostSDU-level wards in the UK and the US) at the Oxford CancerHospital, Oxford, UK. 3 000 such vectors were ac-quired from 40 patients.

B. Methodology

As in Section V, procedures SVM-1 and SVM-1P were ap-plied, using the conventional and proposed methodologies, re-spectively. A novelty detection approach is particularly suitableto the analysis of vital-sign data from hospital patients, becauseit is unlikely that a full description of unstable classes could beobtained, given the range of potential physiological deteriora-tions that a human may undergo, and the variation in responsesto unstable conditions between patients. Hence, given a large setof stable data , a model of stability can be constructed that isthen used to detect unstable physiological variation with respectto that model.203 unstable data were acquired, deemed to be so by ex-

isting, manual clinical systems that are used to determine if a pa-tient requires review by senior medical personnel. Of the 2 797remaining stable data, 500 were provided to methods SVM-1and SVM-1P from which to learn a model of stability, with theremaining 2 297 stable data being used as test data, to evaluatethe false-positive rate of both methods.We note that the investigation described in this section is a

population-generic approach, collecting the data for multiplepatients to construct a single model of stable patient physiology.This approach is directly comparable to standard clinical prac-tice, in which population-generic sets of heuristic scores aremanually applied to determine if a patient’s vital signs are stableor unstable.

C. Results

Results obtained using SVM-1, and SVM-1P are shown inFig. 5(a), and (b), respectively. Each procedure was given thefirst 500 stable vital-sign values on which to train (and, in thecase of the conventional SVM-1 method, validate) a model ofstability, which are shown in grey . The final 203 data in

were classified unstable and representative of patient de-terioration, by clinicians, which are shown in black . Theremainder of the data are stable, and used for testing, and areshown in black . The conventional method SVM-1misclas-sifies significantly more stable data with respect to its model ofstability (including some training data) than does the proposedmethod SVM-1P, due to the poor separation of SVM-1 noveltyscores between the two classes.ROC curves for each method, evaluated over sta-

tistically independent experiments, are shown in Fig. 6, wheremethods SVM-1P, SVM-1, andGMM are shown in black dashed,black solid, and grey dashed-and-dotted lines, respectively. The

Fig. 5. Results obtained using procedures SVM-1, and SVM-1P, in (a), and(b), respectively, applied to patient vital-sign data set , (a) SVM-1 vital-signoutput, (b) SVM-1P vital-sign output.

Fig. 6. ROC curves for statistically independent evaluations.

mean ROC of the experiments is shown by the thick lines,where the curves for SVM-1P may be seen to be closer to theupper-left corner of the ROC plot. Error bars at areshown using thin dotted lines. It may be seen that the proposedmethod SVM-1P achieves both a higher sensitivity and speci-ficity than the conventional methods over all experiments. The(specificity, sensitivity) for SVM-1, GMM, and SVM-1P were(0.90, 0.92), (0.82, 0.91), and (0.94, 0.97), respectively, usingcross-validation thresholds for SVM-1 and GMM, and a proba-bilistic threshold of for SVM-1P.

VII. CONCLUSIONS, AND DISCUSSION

The conventional method of using a one-class SVM is well-understood in the literature, and has been evaluated here in com-parison with a proposed method that (i) calibrates the noveltyscores output by the one-class SVM into estimated posteriorclass probabilities, where special care was required due to theone-class formulation, (ii) utilises the probabilistic nature of the

Page 12: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

466 IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014

result to define a novelty threshold without the need for the con-ventional validation set, and (iii) proposes a procedure for de-termining other SVM parameters.These proposals have been illustrated using lower-di-

mensional data from a large-scale combustor, whereby thegeneration of artificial data (as is required by the calibrationstep) and the data space itself may be visualised. A higher-di-mensional evaluation was performed using patient vital-signdata. In both cases, the proposed method achieved better overallsensitivity and specificity than the conventional technique. Wenote that the application of the method to datasets of veryhigh dimension (e.g., spaces of dimension greater than 20) isnot considered by the work described in this paper, in whichwe restrict ourselves to considering the applications typicallyencountered in condition monitoring.In practice, the acceptance of most methods for monitoring

high-integrity systems is determined by their false-positiverates. While a high false-positive rate may be acceptable in, forexample, the screening of cancer patients (in which the priorityis to detect all cancers, and where the false-positive cost isrelatively low), monitoring systems must seldom generate falsealerts. In the case of industrial systems, such alerts could resultin the premature landing of an aeroplane, or the halting ofa power-generation engine; in the case of human vital-signsmonitoring, such alerts would result in senior clinicians beingbrought to the patient bed-side to review patients that arestable. In all cases, the cost of false-positive alerts far exceedsthe cost of false-negative examples. Such monitoring methodsare typically part of a redundantly-configured network ofsystems, and so failure to detect all instabilities exhibited bythe system-under-test is relatively less costly. We have demon-strated that the proposed probabilistic approach can result insignificantly fewer false-alerts (i.e., it has higher specificity)than the conventional method, while still remaining acceptablysensitive to the detection of unstable examples.A second requirement of monitoring systems is that unstable

conditions are detected as early as possible, such that preven-tative action may be taken to avoid system damage from con-tinued unstable operation (whether it be machine maintenance,or patient review by a clinician). This is particularly important inlow-bandwidth monitoring systems, in which data are acquiredat a low frequency, such as in the case of aircraft monitoring,where a summary of engine performance may be downloadedat the end of each flight [13], or, as in the second case studyconsidered by this paper, when patient vital signs are observedevery four hours. We have demonstrated that the proposed prob-abilistic method achieves earlier warning of unstable combus-tion than is provided by the conventional method, which wouldenable the system to be shut down at the onset of unstable com-bustion conditions, thus avoiding further system risk. Furtherimprovement could be gained by training the model with a se-quentially-updating on-line training algorithm.We note that in cases where the quantity of stable data ac-

quired is particularly large, as would be the case if high sam-pling-rate sensors were used, then the differences between thetwo procedures would be smaller than is considered here, dueto both systems being able to form complete models of stabilityfrom the sufficient quantity of training data. However, even in

such cases, the proposed method could be exploited to providerapid re-training of new system-specific models of stability inthe case of systemmaintenance, wherebymodels of the system’spre-maintenance behaviour must be discarded, and new modelshave to be learned.

REFERENCES

[1] P. Hayton, L. Tarassenko, B. Scholkopf, and P. Anuzis, “Supportvector novelty detection applied to jet engine vibration spectra,” inProc. NIPS, Denver, CO, USA, 2000, pp. 946–952.

[2] A. Gretton and F. Desobry, “On-line one-class support vector ma-chines. an application to signal segmentation.,” presented at the IEEEICASSP, Hong-Kong, China, 2003.

[3] D. R. Hardoon and L. M. Manevitz, “fMRI analysis via one-class ma-chine learning techniques,” in Proc. 19th Int. Joint Conf. Artif. Intell.(IJCAI), Edinburgh, U.K., 2005, pp. 1604–1605.

[4] M. Markou and S. Singh, “Novelty detection: A review - part 2:Neural network based approaches,” Signal Process., vol. 83, no. 12,pp. 2499–2521, 2003.

[5] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.Williamson, “Estimating the support of a high-dimensional distribu-tion,” Neural Compu., vol. 13, no. 7, pp. 1443–1471, 2001.

[6] D.M. J. Tax and R. P.W.Duin, “Data domain description using supportvectors,” in Proc. ESAN99, Brussels, Belgium, 1999, pp. 251–256.

[7] D. Tax and R. Duin, “Support vector domain description,” PatternRecognit. Lett., vol. 20, pp. 1191–1199, 1999.

[8] L. Clifton, “Multi-Channel Novelty Detection and Classifier Combina-tion,” Ph.D. dissertation, Electrical and Electronic Engineering, Univ.Manchester, Manchester , U.K., 2007.

[9] J. Drish, Obtaining Calibrated Probability Estimates From SupportVector Machines Univ. California, San Diego, CA, USA, Tech. Rep.10.1.1.161.3828, 2001.

[10] H. Chen, P. Tino, and X. Yao, “Probabilistic classification vector ma-chines,” IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 901–914, Jun.2009.

[11] D. A. Clifton, “Novelty Detection With Extreme Value Theory in jetEngine Vibration Ddata,” Ph.D. dissertation, Univ. Oxford, London,U.K., 2009.

[12] P. Sollich, “Probabilistic methods for support vector machines,” Adv.Neural Inf. Proces. Syst., vol. 12, pp. 349–355, 2000.

[13] D. Clifton, N. McGrogan, L. Tarassenko, S. King, P. Anuzis, and D.King, “Bayesian extreme value statistics for novelty detection in gas-turbine engines,” in Proc. IEEE Aerospace, Montana, USA, 2008, pp.1–11.

[14] S. King, P. Bannister, D. Clifton, and L. Tarassenko, “Probabilistic ap-proaches to condition monitoring of aerospace engines,” IMechE PartG: J. Aerospace Eng., vol. 223, no. 5, pp. 553–541, 2009.

[15] A. Fleury, M. Vacher, and N. Noury, “SVM-based multi-modal classi-fication of activities of daily living in health smart homes,” IEEE Trans.Inf. Technol. Biomed., vol. 14, no. 2, pp. 274–283, Mar. 2009.

[16] M. Jaana, G. Pare, and C. Sicotte, “Home telemonitoring for respiratoryconditions: A systematic review,” Amer. J. Managed Care, vol. 15, no.5, pp. 313–320, 2009.

[17] K. Shen, C. Ong, X. Li, and E. Wilder-Smith, “Feature selection viasensitivity analysis of SVM probabilistic outputs,” Mach. Learn., vol.70, no. 1, pp. 1–20, 2008.

[18] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239,Mar. 1998.

[19] D. M. J. Tax and R. P. W. Duin, “Combining one-class classifiers,” inProc. Multiple Classifier Syst., 2001, pp. 299–308.

[20] L. I. Kuncheva, “A theoretical study on six classifier fusion stragies,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 281–286,Feb. 2002.

[21] C. M. Bishop, Pattern Recognit. Mach. Learn.. Berlin, Germany:Springer-Verlag, 2006.

[22] Y. Grandvalet, J. Marithoz, and S. Bengio, “A probabilistic interpreta-tion of SVMs with an application to unbalanced classification,” in Adv.Neural Inf. Process. Syst. 18. : MIT Press, 2006, pp. 467–474.

[23] B. Zadrozny and C. Elkan, “Transforming classifier scores into accu-rate multiclass probability estimates,” in Proc. ACM SIGKDD, 2002,pp. 694–699.

[24] A. Niculescu-Mizil and R. Caruana, “Predicting good probabilitieswith supervised learning,” in Proc. ICML, 2005, pp. 625–632.

Page 13: IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE ...davidc/pubs/svm_reliability2014.pdf · IEEE TRANSACTIONS ON RELIABILITY, VOL. 63, NO. 2, JUNE 2014 455 Probabilistic Novelty

CLIFTON et al.: PROBABILISTIC NOVELTY DETECTION WITH SUPPORT VECTOR MACHINES 467

[25] D. M. J. Tax and R. P. W. Duin, “Uniform object generation foroptimizing one-class classifiers,” J. Mach. Learn. Res., vol. 22, pp.155–173, 2001.

[26] M. Markou and S. Singh, “A neural network-based novelty detectorfor image sequence analysis,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 28, no. 10, pp. 1664–1677, Oct. 2006.

[27] J. C. Platt, “Probabilistic outputs for support vector machines and com-parison to regularized likelihoodmethods,” inAdv. LargeMargin Clas-sifiers, 1999, pp. 61–74.

[28] H. Lin, C. Lin, and R. Weng, “A Note on Platt’s Probabilistic Out-puts for Support Vector Machines,” Mach. Learn., vol. 68, no. 3, pp.267–276, 2007.

[29] V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed. Berlin,Germany: Springer-Verlag, 2000.

[30] M. H. DeGroot and S. E. Fienberg, “The comparison and evaluation offorecasters,” Statistician, vol. 32, no. 1, pp. 12–22, 1982.

[31] B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimatesfrom decision trees and naive bayesian classifiers,” in Proc. ICML,2001, pp. 609–616.

[32] S. Ruping, “Robust probabilistic calibration,” in Proc. ECML, 2006,pp. 743–750.

[33] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nded. New York: John Wiley and Sons, 2001.

[34] I. Nabney, Netlab: Algorithms for Pattern Recognition, 1sted. London, U.K.: Springer-Verlag, 2002.

[35] D. Clifton, S. Hugueny, and L. Tarassenko, “Novelty detection withmultivariate extreme value statistics,” J. Signal Process. Syst., 2010,in press.

[36] D. Clifton, L. Clifton, and L. Tarassenko, “Patient-specifi biomedicalcondition monitoring for post-operative cancer patients,” in Proc. Con-dition Monitoring, Dublin, Ireland, 2009, pp. 424–433.

[37] S. Hugueny, D. Clifton, and L. Tarassenko, “Probabilistic patient mon-itoring using extreme value theory,” in Proc. Biomed. Syst. Technol.,Valencia, Spain, 2010, pp. 5–12.

[38] L. Wang and H. Yin, “Wavelet analysis in novelty detection for com-bustion image data,” in Proc. 10th CACSC, Liverpool, U.K., 2004, pp.79–82.

[39] L. Clifton, H. Yin, and Y. Zhang, “Support vector machine in nov-elty detection for multi-channel combustion data,” in Proc. ISNN (3),Chengdu, China, 2006, pp. 836–843.

[40] W. B. Ng, E. Clough, K. J. Syed, and Y. Zhang, “The combined inves-tigation of the flame dynamics of an industrial gas turbine combustorusing high-speed imaging and an optically integrated data collectionmethod,” Measur. Sci. Technol., vol. 15, pp. 2303–2309, 2004.

[41] I. Daubechies, “Orthonormal bases of compactly supported wavelet,”Commun. Pure Appl. Math., vol. 41, pp. 909–996, 1988.

[42] S. G. Mallat, “A theory for multiresolution signal decomposition: thewavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol.11, no. 7, pp. 674–693, Nov. 1989.

[43] L. Clifton, H. Yin, D. A. Clifton, and Y. Zhang, “Combined supportvector novelty detection for multi-channel combustion data,” in Proc.IEEE ICNSC, London, UK, 2007, pp. 495–500.

[44] F. Letouzey, F. Denis, and R. Gilleron, “Learning from positive and un-labeled examples,” in Proc. 11th Int. Conf. Algorithmic Learn. Theory,2000, pp. 71–85.

[45] D. Zhang, “A simple probabilistic approach to learning from positiveand unlabeled examples,” presented at the 5th Annu. UK WorkshopComput. Intell., 2005.

[46] C. Elkan and K. Noto, “Learning classifiers from only positive andunlabeled data,” presented at the 14th Int. Conf. Knowledge DiscoveryData Mining, 2008.

[47] C. Burges, “A tutorial on support vector machines for pattern recogni-tion,” Data Mining Knowledge Discovery, vol. 2, pp. 121–167, 1998.

[48] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning, 2nd ed. Berlin, Germany: Springer-Verlag, 2009.

[49] C. M. Bishop, “Novelty detection and neural network validation,” inProc. IEE Conf. Vis. Image Signal Process., 1994, vol. 141, no. 4, pp.217–222.

[50] N. P. S. Association, Safer Care for Acutely Ill Patients: Learning FromSerious Accidents NPSA, 2007.

[51] C. Tsien and J. Fackler, “Poor prognosis for existing monitors in theintensive care unit,” Critical Care Med., vol. 25, no. 4, pp. 614–619,1997.

Lei Clifton received a B.Sc. and M.Sc. in electrical engineering from BeijingInstitute of Technology, China, and a Ph.D. degree in electrical engineering fromManchester University, U.K. After six years of post-doctoral research at theUniversity of Oxford, UK, she was appointed as a Medical Statistician at theCentre for Statistics in Medicine, University of Oxford. Her research interestsinclude statistical signal processing, and machine learning for intelligent healthmonitoring systems.

David A. Clifton (S’07–M’08) is a member of faculty in the Department of En-gineering Science at the University of Oxford, from which he previously grad-uated in 2009. He is the group leader of the Computational Health Informatics(CHI) laboratory in that Department, and the Associate Director of the OxfordCentre for Affordable Healthcare Technology. He is a Research Fellow of theRoyal Academy of Engineering; a Fellow of Kellogg College, Oxford; a Fellowof Mansfield College, Oxford; and a College Lecturer in Engineering at BalliolCollege, Oxford.

Yang Zhang received a B.Eng. from Zhejiang University, China, and a Ph.D. inthe Engineering Department of Cambridge University. After his post-doctoralresearch in Cambridge University, he moved to UMIST, and then the Universityof Manchester before taking the Chair of Combustion and Energy in Sheffield.

Peter Watkinson is an Intensive Care Physician at Oxford University Hospi-tals NHS Trust, and is one of two clinical leads for the Critical Care researchgroup at the Kadoorie Centre in Oxford. His research focus combines the fieldsof bioengineering and acute medicine to generate innovative methods for iden-tification of at-risk patients, both in and out of hospital.

Lionel Tarassenko received a B.A. in engineering science, and a D.Phil. inmedical engineering, both from the University of Oxford, UK, in 1978, and1985, respectively. He then held a number of positions in academia and in-dustry, before taking up an Assistant Professorship in Oxford in 1988. He hasbeen the holder of the Chair in Electrical Engineering at Oxford University sinceOctober 1997. He is the author of 150 journal papers, 160 conference papers,and 3 books; and holds 24 granted patents. He was the founding Director ofthe Oxford Institute of Biomedical Engineering in 2008, and has been the Di-rector of the Centre of Excellence in Medical Engineering funded by the Well-come Trust and EPSRC since October 2009. Prof. Tarassenko was awarded the2006 Silver Medal of the Royal Academy of Engineering for his contributionto British engineering.

Hujun Yin (S’93–M’96–SM’03) has been with The University of Manchester,School of Electrical and Electronic Engineering since 1996. He receivedB.Eng., and M.Sc. degrees from Southeast University, China; and a Ph.D.degree from University of York, UK, in 1983, 1986, and 1996 respectively.His main research interests are neural networks, self-organising systems inparticular, pattern recognition, bio-&neuro-informatics, and face recognition.He has published over 150 peer-reviewed articles in a range of topics. He isa senior member of the IEEE. He had been an Associate Editor of the IEEETransactions on Neural Networks from 2006 to 2010, and has been a memberof the Editorial Board of the International Journal of Neural Systems since2005.