a novel neural network-based survival analysis model

10
2003 Special issue A novel neural network-based survival analysis model Antonio Eleuteri a,b, * , Roberto Tagliaferri c,d , Leopoldo Milano b,e , Sabino De Placido f , Michele De Laurentiis f a DMA, Universita ` di Napoli ‘Federico II’, Naples, Italy b INFN Sez. Napoli, Naples, Italy c DMI, Universita ` di Salerno, Salerno, Italy d INFM Unita ` di Salerno, Salerno, Italy e Dipartimento di Scienze Fisiche, Universita ` di Napoli ‘Federico II’, Naples, Italy f Dipartimento di Endocrinologia ed Oncologia Molecolare e Clinica, Universita ` di Napoli ‘Federico II’, Naples, Italy Abstract A feedforward neural network architecture aimed at survival probability estimation is presented which generalizes the standard, usually linear, models described in literature. The network builds an approximation to the survival probability of a system at a given time, conditional on the system features. The resulting model is described in a hierarchical Bayesian framework. Experiments with synthetic and real world data compare the performance of this model with the commonly used standard ones. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Survival analysis; Conditioning probability estimation; Neural networks; Bayesian learning; MCMC methods 1. Introduction Survival analysis is used when we wish to study the occurrence of some event in a population of subjects and the time until the event is of interest. This time is called survival time or failure time. Survival analysis is often used in many fields of study. Examples include: time until failure of a light bulb and time until occurrence of an anomaly in an electronic circuit in industrial reliability, time until relapse of cancer and time until pregnancy in medical statistics, duration of strikes in economics, length of tracks on a photographic plate in particle physics. In literature we find many different modeling approaches to survival analysis. Conventional parametric models may involve too strict assumptions on the distributions of failure times and on the form of the influence of the system features on the survival time, assumptions which usually extremely simplify the experimental evidence, particularly in the case of medical data (Kalbfleisch & Prentice, 1980). In contrast, semi-parametric models do not make assumptions on the distributions of failures, but make instead assumptions on how the system features influence the survival time. Finally, non-parametric models only allow for a qualitative description of the data. Neural networks (NN) have been recently used for survival analysis, but most of the efforts have been devoted to extending Cox’s model (Faraggi & Simon, 1995; Bakker & Heskes, 1999; Lisboa & Wong, 2001; Liestøl, Andersen, & Andersen, 1994), or in making piecewise or discrete approximations of the hazard functions (Neal, 2001; Biganzoli, Boracchi, & Marubini, 2002), or treating the problem as a classification one (De Laurentiis, De Placido, Bianco, Clark, & Ravdin, 1999) or by ignoring the censoring, which may introduce significant bias into the estimates, as criticized in Ripley and Ripley (1998) and Schwarzer, Vach, and Schumacher (2000). We refer to the latter two excellent surveys for the description of the many limitations which arise in the current use of NNs, limitations due to the use of heuristic approaches which do not take into account the peculiarities of survival data, and/ or ignore fundamental issues such as regularization. In this paper we describe a novel NN architecture which overcomes all the limitations of currently available NN models without making assumptions on the underlying survival distribution and which is proven to be flexible enough to model the most complex survival data. We describe our model in a Bayesian framework, which, as advocated by Raftery, Madigan, and Volinsky (1994) in 0893-6080/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0893-6080(03)00098-4 Neural Networks 16 (2003) 855–864 www.elsevier.com/locate/neunet * Corresponding author. Address: INFN Sez, Napoli, Naples, Italy. E-mail addresses: [email protected] (A. Eleuteri), [email protected] (R. Tagliaferri).

Upload: antonio-eleuteri

Post on 04-Jul-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A novel neural network-based survival analysis model

2003 Special issue

A novel neural network-based survival analysis model

Antonio Eleuteria,b,*, Roberto Tagliaferric,d, Leopoldo Milanob,e, Sabino De Placidof,Michele De Laurentiisf

aDMA, Universita di Napoli ‘Federico II’, Naples, ItalybINFN Sez. Napoli, Naples, Italy

cDMI, Universita di Salerno, Salerno, ItalydINFM Unita di Salerno, Salerno, Italy

eDipartimento di Scienze Fisiche, Universita di Napoli ‘Federico II’, Naples, ItalyfDipartimento di Endocrinologia ed Oncologia Molecolare e Clinica, Universita di Napoli ‘Federico II’, Naples, Italy

Abstract

A feedforward neural network architecture aimed at survival probability estimation is presented which generalizes the standard, usually

linear, models described in literature. The network builds an approximation to the survival probability of a system at a given time, conditional

on the system features. The resulting model is described in a hierarchical Bayesian framework. Experiments with synthetic and real world

data compare the performance of this model with the commonly used standard ones.

q 2003 Elsevier Science Ltd. All rights reserved.

Keywords: Survival analysis; Conditioning probability estimation; Neural networks; Bayesian learning; MCMC methods

1. Introduction

Survival analysis is used when we wish to study the

occurrence of some event in a population of subjects and the

time until the event is of interest. This time is called survival

time or failure time. Survival analysis is often used in many

fields of study. Examples include: time until failure of a

light bulb and time until occurrence of an anomaly in an

electronic circuit in industrial reliability, time until relapse

of cancer and time until pregnancy in medical statistics,

duration of strikes in economics, length of tracks on a

photographic plate in particle physics.

In literature we find many different modeling approaches

to survival analysis. Conventional parametric models may

involve too strict assumptions on the distributions of failure

times and on the form of the influence of the system features

on the survival time, assumptions which usually extremely

simplify the experimental evidence, particularly in the case

of medical data (Kalbfleisch & Prentice, 1980). In contrast,

semi-parametric models do not make assumptions on the

distributions of failures, but make instead assumptions on

how the system features influence the survival time.

Finally, non-parametric models only allow for a qualitative

description of the data.

Neural networks (NN) have been recently used for

survival analysis, but most of the efforts have been devoted

to extending Cox’s model (Faraggi & Simon, 1995; Bakker

& Heskes, 1999; Lisboa & Wong, 2001; Liestøl, Andersen,

& Andersen, 1994), or in making piecewise or discrete

approximations of the hazard functions (Neal, 2001;

Biganzoli, Boracchi, & Marubini, 2002), or treating the

problem as a classification one (De Laurentiis, De Placido,

Bianco, Clark, & Ravdin, 1999) or by ignoring the

censoring, which may introduce significant bias into

the estimates, as criticized in Ripley and Ripley (1998)

and Schwarzer, Vach, and Schumacher (2000). We refer to

the latter two excellent surveys for the description of the

many limitations which arise in the current use of NNs,

limitations due to the use of heuristic approaches which do

not take into account the peculiarities of survival data, and/

or ignore fundamental issues such as regularization.

In this paper we describe a novel NN architecture which

overcomes all the limitations of currently available NN

models without making assumptions on the underlying

survival distribution and which is proven to be flexible

enough to model the most complex survival data.

We describe our model in a Bayesian framework, which,

as advocated by Raftery, Madigan, and Volinsky (1994) in

0893-6080/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved.

doi:10.1016/S0893-6080(03)00098-4

Neural Networks 16 (2003) 855–864

www.elsevier.com/locate/neunet

* Corresponding author. Address: INFN Sez, Napoli, Naples, Italy.

E-mail addresses: [email protected] (A. Eleuteri), [email protected]

(R. Tagliaferri).

Page 2: A novel neural network-based survival analysis model

the context of survival analysis, helps taking into account

model uncertainty, which can help to improve the predictive

performance of a model.

The paper is organized as follows. In Section 2 we

describe the fundamentals of survival analysis and pose the

censoring problem. In Section 3 we describe two of the most

used standard modeling techniques, which will be used in

comparison to our model. In Section 4 we describe the

architecture of our NN. In Section 5 we describe our model

in a hierarchical Bayesian framework and a Monte Carlo

estimator of the survival function is shown. In Section 6

some experimental results are shown, in comparison with

some commonly used modeling techniques.

2. Survival analysis

Let TðxÞ denote an absolutely continuous random

variable (rv) describing the failure time of a system defined

by a vector of features x. If FT ðtlxÞ is the cumulative

distribution function for TðxÞ; then we can define the

survival function:

SðtlxÞ ¼ P{TðxÞ . t} ¼ 1 2 FT ðtlxÞ

¼ 1 2ðt

0fT ðulxÞdu ð1Þ

which is the probability that the failure occurs after time t:

The requirements for a function SðtlxÞ to be a survival

function are, uniformly w.r.t. x:

(1) Sð0lxÞ ¼ 1 (there cannot be a failure before time 0),

(2) Sðþ1lxÞ ¼ 0 (asymptotically all events realize),

(3) Sðt1lxÞ $ Sðt2lxÞ when t1 # t2:

From the definitions of survival and density of failure, we

can derive the hazard function which gives the probability

density of an event occuring around time t; given that it has

not occured before t: It can be shown that the hazard has this

form (Kalbfleisch & Prentice, 1980):

lðtlxÞ ¼fT ðtlxÞSðtlxÞ

: ð2Þ

In many studies (e.g. in medical statistics or in quality

control), we do not observe realizations of the rv TðxÞ:

Rather, this variable is associated with some other rv V , r

such that the observation is a realization of the rv Z ¼

qðT ;VÞ: In this case we say that the observation of the time

of the event is censored. If V is independent of T ; we say

that censoring is uninformative.

Depending on the form of the function q; we have

different forms of censoring. Two of the most common ones

in applications are right censoring, when q ; minð·; ·Þ and

left censoring, when q ; maxð·; ·Þ:

Survival data can then be represented by triples of the

form ðt; x; yÞ where x is a vector of features defining

the process, t is an observed time, and y is an indicator

variable:

y ¼1 if t is an event time

0 if t is a censored time

(: ð3Þ

We can define the sampling density lðylt; xÞ of a survival

process (assuming, e.g. right censoring) by noting that, if

we observe an event ðy ¼ 1Þ; it directly contributes to the

evaluation of the sample density; if we do not observe the

event ðy ¼ 0Þ; the best contribution is to evaluate SðtlxÞ:We have then:

lðylt; xÞ ¼ SðtlxÞ12yfT ðtlxÞy: ð4Þ

The joint sample density for a set of independent

observations D ¼ {ðtk; xk; ykÞ} can then be written as:

L ¼Y

k

lðykltk; xkÞ: ð5Þ

Since the censoring is supposed uninformative, it does not

influence inference on the failure density, but gives

contribution to the sample density.

Note that censored data models can be seen as a

particular class of missing data models in which the

densities of interest are not sampled directly (Robert &

Casella, 1999).

3. Standard models for survival analysis

In the next sections we will describe some of the most

commonly used models, both for homogeneous (time-only)

and heterogeneous modeling.

3.1. The Kaplan–Meier non-parametric estimator

The Kaplan–Meier (KM) estimator is a non-parametric

maximum likelihood estimator of the survival function. It is

piecewise constant, and it can be thought of as an empirical

survival function for censored data. In its basic form, it is

only homogeneous.

Let k be the number of events in the sample, t1; t2;…; tk

the event times (supposed unique and ordered), ei the

number of events at time ti and ri the number of times (event

or censored) greater than or equal to ti: The estimator is

given by the formula:

SKMðtÞ ¼Y

i:ti,t

ri 2 ei

ri

: ð6Þ

It can be shown that the KM. estimator maximizes the

generalized likelihood over the space of all distributions, so

its evaluation on large data sets gives a good qualitative

description of the true survival function. However, it should

be noted that it is noisy when the data are few, in particular

when the events are rare, since it is piecewise constant.

Therefore KM estimates from datasets sampled from

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864856

Page 3: A novel neural network-based survival analysis model

the same distribution but with different number of samples

can differ quite a lot.

3.2. Proportional hazards models

The most used survival specification which takes into

account system features is to allow the hazard function to

have the form:

lðtlxÞ ¼ lðtÞexpðwTxÞ ð7Þ

where lðtÞ is some homogeneous hazard function (called the

baseline hazard) and w are the feature dependent par-

ameters of the model. This is called a proportional hazards

(PH) model. The two most used approaches for this kind of

model are:

† Choose a parameterized functional form for the baseline

hazard, then use Maximum Likelihood (ML) techniques

to find values for the parameters of the model;

† do not fix the baseline hazard, but make a ML estimation

of the feature dependent part of the model, and a non-

parametric estimation of the baseline hazard. This is

called Cox’s model (Kalbfleisch & Prentice, 1980).

NNs have simply been used to estimate the feature

dependent part of the model; however, the main drawback

remains, which is the assumption of proportionality between

the time-dependent and feature-dependent parts of the

model.

4. A neural network model

In this section we define a NN which, given a data set of

system features and times, provides a model for the survival

function (and implicitly, by suitable transforms, of the other

functions of interest, e.g. failure density and hazard).

It is well known (Bishop, 1996) that a sigmoid is the

appropriate activation function for the output of a NN to be

interpretable as a probability. If VðaÞ ¼ ð1 þ expð2aÞÞ21 is

such a sigmoid function (a is the net input to the activation),

by taking into account Eq. (1) and using S ¼ 1 2 V we can

define the activation function of the NN as:

SðaÞ ¼1

1 þ expðaÞ: ð8Þ

This is not sufficient to define a survival function, since we

must also enforce the requirements in Section 3.

(1) Sð0lxÞ ¼ 1 and Sðþ1lxÞ ¼ 0: We set one of the

activation functions of the hidden layer to be the hðtlwt1Þ ;

at ¼ logðtwt1Þ function. We call this hidden unit the time

unit. Note that the only input feeding into this unit through

the weight wt1is the time input (neither biases nor other

inputs feed this unit).

The other activation functions (denoted by g) are

assumed to be squashing functions. Since these functions

are all bounded (and the integration functions for hidden and

output nodes is assumed to be the scalar product), this

ensures us that the required conditions are satisfied if the

first-layer weight wt1connecting the input t is non-negative

and the second-layer weight wt2connecting the h unit to the

output is also non-negative. In this way the net input to

the output activation goes to negative infinity or zero, so

that the output saturates to 1 or 0, respectively. Note that the

input t is also connected to the squashing hidden units with

variable strength weights so that interactions with other

input features are taken into account. As shown in the next

sections, these weights must also be constrained in some

way.

(2) SðtlxÞ must be non-increasing as t increases. We must

look at the functional form of the network output. Since it

must be a non-increasing function of a; we must ensure that

a is a non-decreasing function of t: We have:

a ; aN þ at ¼XNj¼1

wjgðajÞ þ wt2logðwt1

tÞ ð9Þ

where g is the activation function in the hidden layer, wj is a

weight of the second layer1 and N is the number of

squashing hidden units (we have not included the biases in

the summation). If we suppose that the gðajÞ are non-

decreasing, then a is non-decreasing if and only if the

weights wj are non-negative. We must now impose that the

gðajÞ are non-decreasing. Since they are monotonic non-

decreasing functions, then we must impose that their

arguments are non-decreasing functions of t: This can be

accomplished if we impose that the weights associated with

the time input are non-negative. In short, all the weights on

paths connecting the time input to the network output must

be non-negative. In Fig. 1 the architecture of the Survival

Neural Network (SNN) is shown (the bias connections are

not shown).

Note that the use of an unbounded activation function as

hidden time unit does not put discontinuities in the function

Fig. 1. Architecture of SNN.

1 From here on, to simplify the notation, when we must refer to

expressions involving all the weights in a layer (including the weights into

and out of the time unit), we will denote generic second-layer weights with

a single index, and generic first-layer weights with two indices.

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 857

Page 4: A novel neural network-based survival analysis model

defined by the network, since it can be easily verified that

Eq. (8) is equivalent to:

S ¼1

1 þ ðwt1tÞwt2 expðaNÞ

ð10Þ

which, however, has not the form of a standard MLP

network function.

5. Hierarchical Bayesian modeling

In the conventional ML approach to training, a single

weight vector is found which minimizes an error function;

in contrast, the Bayesian scheme considers a probability

distribution over weights.

We made many tests with ML approaches, training a

single network with constrained optimization algorithms,

and then using ensembles of such networks. The results,

however, were always less than satisfactory and uniformly

worse than those obtained in a Bayesian framework. For an

in-depth discussion on Bayesian learning in the context of

NNs we refer to Neal (1996).

As pointed out by Watanabe (2001), Bayesian learning

methods are necessary when we deal with hierarchical

statistical models like NNs, which are non-identifiable and

non-regular. The use of networks with a large number of

hidden units implies that neither the distribution of the ML

estimator nor the a posteriori probability density of weights

converges to the normal distribution, even in the case of

infinite data sets. This implies that model selection methods

like Akaike’s Information Criterion (AIC), Bayesian

Information Criterion (BIC) and Minimum Description

Length (MDL) cannot be used in the context of complex

NNs.

Since SNN is a very complex network, we shall build

a Bayesian statistical model, which is composed by a

parametric statistical model (the sample density) and a prior

distribution over the weights.

We can then describe the learning process as an updating

of a prior distribution pðwÞ; based on the observation of the

data D through the joint sample density pðDlwÞ: This process

can be expressed by Bayes’ theorem:

pðwlDÞ ¼pðDlwÞpðwÞÐpðDlwÞpðwÞdw

: ð11Þ

The joint sample density is given in Eq. (5), in which S is the

function realized by the network and fT is given by minus

the Jacobian of the network, which can be efficiently

evaluated with a backpropagation procedure.

5.1. Choosing the prior distribution

The prior over weights should reflect any knowledge we

have about the mapping we want to build. For the SNN

model, we must then take into account the fact that we want

a smooth mapping and that we have both constrained and

unconstrained weights; furthermore, we should take into

account the specialized role the weights have in the network

depending on their position. We can thus define the

following weight groups:

† Unconstrained weights:

a group for each of the covariate input, comprising the

weights out of that input into the hidden squashing

units;

one group for the biases feeding the hidden squashing

units;

one group for the output bias.

† Constrained weights:

one group for the weights out of the input time unit

feeding the hidden squashing units;

one group for the weight out of the input time unit

feeding the hidden time unit;

one group for the weight out of the hidden time unit;

one group for the weights out of the hidden squashing

units;

If n is the number of inputs then we have n þ 5 groups.

We assign a distribution to each group, governed by its

set of parameters (hyperparameters). Since the weights are a

priori statistically independent, then the prior over the whole

set of weights will be the product of the priors over each

group.

(1) Priors for unconstrained weights. We can give a zero

mean Gaussian prior to unconstrained weights, so as to

favor small weights and consequently smooth mappings.

This kind of prior, due to its simplicity and analytical

properties has been extensively used in the literature (Neal,

1996; MacKay, 1992; Bishop, 1996).

By assigning a different prior to the weights out of each

covariate input allows us to use the Automatic Relevance

Determination (ARD) technique (MacKay, 1992; Neal,

1996), which can give hints on the relevance of inputs to

non-linear relations with the output. Furthermore, the

weights out of each input can be scaled to reflect

the different scales of the different inputs.

The Gaussian prior over the unconstrained weights has

the following form:

pðwulauÞ /Y

k[Gu

exp 2ak

2

XWk

ik¼1

w2ik

0@

1A ð12Þ

where wu is the vector of unconstrained weights, Gu is the

set of the unconstrained weight groups, Wk is the cardinality

of the k-th group and au is the vector of hyperparameters

(inverse variances).

(2) Priors for constrained weights. The positivity

constraints on some groups forces us to choose a distribution

with positive support. The most simple distribution with

good analytical properties and which expresses well our

prior requirements on positivity and smoothness is

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864858

Page 5: A novel neural network-based survival analysis model

the exponential distribution. The exponential prior over the

constrained weights has this form:

pðwðeÞc laðeÞ

c Þ /Y

k[GðeÞc

exp 2ak

XWk

ik¼1

wik

0@

1A IðwðeÞ

c Þ ð13Þ

where wðeÞc is the vector of constrained weights, GðeÞ

c is the

set of the unconstrained weight groups, Wk is the cardinality

of the k-th group, aðeÞc is the vector of hyperparameters

(inverse means) and Ið·Þ is the indicator function over the

positive half-space in which the weights are constrained. It

can be shown that this prior is appropriate for all constrained

weights, except those out of the hidden squashing units.

(3) Limit behaviour for priors with finite variance. We

now analyze the prior over the survival functions implied by

an exponential prior on the weights out of the hidden

squashing units when the network inputs are fixed to a given

value. From Eq. (9), we see that a contribution to the overall

sum is given by the weighted sum of N squashing hidden

units. The terms in the sum are a priori independent and

identically distributed. We can evaluate the expected value

of each squashing hidden unit’s contribution by noting that

the wj are a priori independent of the activations gj (from

here on, we omit the indices since the distribution is the

same for all j):

mwg ; E½wg� ¼ E½w�E½g� ; mwmg: ð14Þ

The expectation of the contribution is finite because mg is

finite since the activation is a bounded function.

We can also evaluate the variance of each squashing

hidden unit contribution:

s2wg ; E½ðwgÞ2�2 E½wg�2 ¼ E½w2�E½g2�2 m2

wm2g

¼ ðs2w þ m2

wÞðs2g þ m2

gÞ2 m2wm

2g

¼ s2wðs

2g þ m2

gÞ þ m2ws

2g: ð15Þ

The variance of the contribution is finite because s2g is finite

since the activation is a bounded function.

We can then apply the Central Limit Theorem and

conclude that for large N the total contribution converges to

a Gaussian distribution with parameters:

m ¼ Nmwmg s2 ¼ N½s2wðs

2g þ m2

gÞ þ m2ws

2g�: ð16Þ

This result extends Neal’s analysis in Neal (1996) to non-

Gaussian distributions with finite variance. To obtain a finite

limit for the location parameter which does not saturate the

output activation, we must scale mw according to N; by

choosing mw ¼ reN21; for fixed re: However, this choice is

troublesome. In fact, for the exponential distribution we

have mw ¼ sw; so that the variance of the contribution is, for

large N :

limN!1

Nre

N

�2

ðs2g þ m2

gÞ þre

N

�2

s2g

" #¼ 0: ð17Þ

This behaviour is not desirable, since we get a degenerate

distribution centered on remg: To get a non-degenerate

distribution we should let the location grow as N grows,

which is undesirable since we do not want to saturate the output

of the network. Furthermore, since all the weights would tend

to assume the same value, we would lose the possibility that the

network represents hidden features of the data (Neal, 1996).

We could think to solve the problem by choosing a

different distribution with continuous positive support.

However, the problem is more general than it might appear.

In fact, what we said before is true whenever the mean and

the standard deviation of the distribution are in a

proportionality relation. And this is true for many common

distributions such as: the gamma family (which includes the

exponential and Maxwell distributions), truncated normal,

inverse gamma, Fisher, x2; Weibull.

(4) Limit behaviour for stable priors. A possible solution

is to use distributions which are not in the normal domain of

attraction of the Gaussian, that is densities whose limiting

sums do not tend to the Gaussian density. It turns out that all

the densities with finite variance are in the normal domain of

attraction of the Gaussian.

Densities which are not in the normal domain of attraction

of the Gaussian are said to be stable, and are characterized by

an index a such that, if X1;…;Xn are independent identically

distributed with the same index, then ðX1 þ · · · þ XnÞ=n1=a is

stable with the same index (note that Gaussian densities are

stable with a ¼ 2). For an in-depth discussion on stable

processes we refer to Samorodnitsky and Taqqu (1994). We

shall concentrate our attention on the Levy density, for

which a ¼ 1=2: The joint distribution of the weights out of

the squashing hidden units is:

pðwðLÞc lgÞ / exp 2

XNi¼1

g

2wi

þ3

2logwi

� !IðwðLÞ

c Þ: ð18Þ

To obtain a well defined limit as N grows large, we need to

scale the parameter: g ¼ rLN22; with fixed rL:

With this choice of the prior, we will have a stable non-

degenerate distribution whatever the number N of squashing

hidden units is. Furthermore, since this density is heavy

tailed, the larger the value of the g parameter is, the higher

the probability that many weights have values in a wide

range, favoring complex models and the representation of

hidden features in the data (Neal, 1996).

5.2. Markov chain Monte Carlo methods for sampling the

posterior

The best prediction we can obtain given a new input and

the observed data D can be written in the following way:

pðylðx; tÞ;DÞ ¼ð

pðy;w;alx; tÞ;DÞdw da

¼ð

pðylðx; tÞ;wÞpðw;aÞlDÞdw da ð19Þ

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 859

Page 6: A novel neural network-based survival analysis model

where pðylðx; tÞ;wÞ is the predictive model of the network.

This can be seen as the evaluation of the expectation of the

network function with respect to the posterior distribution of

the network weights. In our case the integral cannot be

solved analytically, so we must use a numerical approxi-

mation scheme.

The posterior density in Eq. (11) can be efficiently

sampled by using a Markov chain Monte Carlo (MCMC)

method, which consists in generating a sequence of

weight vectors, and then using it to approximate the

integral in Eq. (19) with the following the Monte Carlo

estimator (M is the number of samples from the

posterior):

pðylx; tÞ;DÞ <1

M

XMi¼1

pðylx; tÞ;wiÞ ð20Þ

which is guaranteed to converge by the Ergodic Theorem

to the true value of the integral as M goes to infinity.

The MCMC method we used for generating the sequence

of weight vectors is the Slice Sampling method. For a

detailed discussion see Neal (2000).

5.3. Gibbs sampling

As we have seen, the priors over the weights used in

the inference of the posterior are themselves parametrized.

Instead of setting the values of these hyperparameters to

arbitrary values, we can make them part of the inference

process, by sampling from Eq. (11). This can efficiently

be done by using the Gibbs sampling process. In a first

step the hyperparameters are kept constant, and we use an

MCMC method to sample from Eq. (11). The second step

consists in keeping the weights constant, while the

hyperparameters are sampled from the following distri-

bution:

pðalw;DÞ ¼pða;w;DÞ

pðw;DÞ

¼pðDlwÞpðwlaÞpðaÞ

pðDlwÞpðwÞ/ pðwlaÞpðaÞ: ð21Þ

This process can be done efficiently if we choose a

hyperprior pðaÞ which is conjugate to the hyperposterior

pðalw;DÞ (i.e. they both belong to the same family

(Robert, 2001)). It is possible to show that in the case of

Gaussian, exponential and Levy priors, by choosing

gamma hyperprior distributions, the resulting hyperposter-

iors are gamma distributions.

It can be proven that this hierarchical extension is

beneficial in many respects, one of which is the

increased robustness of the model. For a detailed

discussion on the topic of hierarchical Bayesian modeling

we refer to Robert (2001) and Neal (1996) for

applications to NNs.

6. Experiments

In this section we show the performance of the SNN

model on two data sets, a synthetic heterogeneous problem

and a real heterogeneous medical problem. All the software

for the simulations has been written by the authors in the

MATLABq language. In each experiment two models

were fitted: a Cox PH model and a SNN model with 50

hidden units. KM estimates were then evaluated for each

model.

The Cox PH model was trained using standard ML

estimation while the SNN model was trained with the

MCMC slice sampling algorithm with Gibbs sampling of

the hyperparameters. The first 400 samples were omitted

from the chain. Then, 700 samples were generated by

alternating between normal and overrelaxed steps.

Finally, the last 100 samples were used to obtain the

Monte Carlo estimator of the SNN output, based on an

inspection of the energies of the chain. Note that in a

real use of the algorithm, a simple inspection of the

energies might not be sufficient to assess convergence;

instead, a formal test of convergence, like the Estimated

Potential Scale Reduction diagnostic (Robert & Casella,

1999) should be used.

6.1. Synthetic heterogeneous data

In this experiment we considered a mixture of Weibull

and gamma densities with parameters dependent on the

variables x1; x2 :

fT ðtlxÞ ¼x2Weðx21; sinðx1Þ

3 þ 2Þ þ ð1 2 x2Þ

£ Gaððx1 þ 1Þ2; expðcosðx1Þ þ 2ÞÞ: ð22Þ

This data set was built to test the SNN model in the case in

which the proportional hazards assumption is not verified.

The covariates x1 and x2 were first sampled from a bivariate

Gaussian with mean vector (0, 0) and full covariance matrix

such that the correlation coefficient of the variables was

20.71. Next, the variable x2 was transformed to a binary

indicator: values greater than or equal to 0 were set to 1,

values less than 0 were set to 0. After the transformation the

correlation coefficient was 20.57. Then, 7000 samples were

generated from the mixture, with about 27% uniformly right

censored data in the interval [0,80]. The data were then

uniformly split in two disjoint data sets: a training set of

4000 samples and a testing set of 3000 samples (with the

same percentage of censored data in each set).

To get a clear understanding of how well the SNN

model is capturing the behaviour of the censored data,

we made a comparison with KM estimates. The test data

was stratified in five quantities based on the survival

estimates of the models at t ¼ 8:4 (median time).

Then, the mean (with respect to the covariates) survival

probabilities were estimated, togheter with KM estimates

of the quantiles induced by the models. If a model is

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864860

Page 7: A novel neural network-based survival analysis model

Fig. 2. SNN model of a synthetic survival function with covariates (evaluated on the test set).

Fig. 3. Cox PH model of a synthetic survival function with covariates (evaluated on the test set).

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 861

Page 8: A novel neural network-based survival analysis model

Fig. 4. SNN model of a real survival function with covariates (evaluated on the test set).

Fig. 5. Cox PH model of a real survival function with covariates (evaluated on the test set).

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864862

Page 9: A novel neural network-based survival analysis model

able to capture the distribution of the data we expect

the output of the model and KM estimates to be similar.

As can be seen in Figs. 2 and 3, the SNN model shows

good performance on all quantiles, while the Cox PH

model shows good performance only on bad prognosis

quantiles. As we expected, the proportional hazards

model could not describe well the data.

6.2. Real heterogeneous data

These data are taken from trials of chemotherapy for

stage B/C colon cancer.2 It is composed of 1776 samples,

with 12 covariates. The dataset was split into a training set

of 1076 samples, and a test set of 700 samples. About 49%

of the data are censored, which makes the analysis of the

data a complex task.

The test data was stratified in three quantiles based on

the survival estimates of the models at t ¼ 217 weeks.

Then, the mean (with respect to the covariates) survival

probabilities were estimated, togheter with KM estimates

of the quantiles induced by the models. If the model can

capture the distribution of the data we expect the output of

the model and KM estimates to be similar. As can be seen

in Figs. 4 and 5, the SNN and Cox PH model both exhibit

good discriminatory power, also if the Cox estimate is

noisy, since it is piecewise constant. This test shows that

the SNN model is able to work with real-world data for

which the proportional hazards assumption is roughly

verified.

7. Conclusions and perspectives

In this paper we have described a novel NN architecture

that addresses survival analysis in a neural computation

paradigm.

Starting from the requirements on the survival

function, we have defined a feedforward NN which

satisfies those requirements with appropriate choice of

hidded unit activations and constraints on the weight

space.

The proposed model does not make any assumption on

the form of the survival process, and does not use

discretizations or piecewise approximations as other NN

approaches to survival analysis do.

Experiments on synthetic and real survival data demon-

strate that the SNN model can approximate complex

survival functions. On synthetic survival data it out-

performed the PH model, since the data did not verify the

proportional hazards assumptions by construction. On real

survival data which verifies the PH assumptions it showed

very similar performance to the Cox PH model, thus

showing the capability to model also proportional hazards,

real-word data.

As an empirical proof of effectiveness, we have also

shown that the SNN model closely follows the time

evolution of the KM non-parametric estimator on large

data sets, for which the KM estimator is known to give very

accurate descriptions of the survival process.

The development of DNA microarrays and other

techniques to produce high-quality, highly informative and

complex datasets will require correspondingly complex

modeling techniques. We believe that the neural compu-

tation paradigm if correctly employed can be a powerful

tool to analyze complex datasets, and the SNN model shows

that complex data can be analyzed in a robust Bayesian

framework.

Acknowledgements

This work was partially supported by MIUR Progetto di

Rilevanza Nazionale, MIUR-CNR, AIRC.

References

Bakker, B., & Heskes, T. (1999). A neutral-Bayesian approach to survival

analysis. Proceedings of IEE Artificial Neural Networks, 832–837.

Biganzoli, E., Boracchi, P., & Marubini, E. (2002). A general framework

for neural network models on censored survival data. Neural Networks,

15, 209–218.

Bishop, C. M. (1996). Neural networks for pattern recognition. Oxford:

Clarendon Press.

De Laurentiis, M., De Placido, S., Bianco, A. R., Clark, G. M., & Ravdin,

P. M. (1999). A prognostic model that makes quantitative estimates of

probability of relapse for breast cancer patients. Clinical Cancer

Research, 19, 4133–4139.

Faraggi, D., & Simon, R. (1995). A neural network model for survival data.

Statistics in Medicine, 14, 73–82.

Kalbfleisch, J. D., & Prentice, R. L. (1980). The statistical analysis of

failure time data. New York: Wiley.

Liestøl, K., Andersen, P. K., & Anderson, U. (1994). Survival analysis and

neural nets. Statistics in Medicine, 13, 1189–1200.

Lisboa, P. J. G., & Wong, H. (2001). Are neural networks best used to help

logistic regression? An example from breast cancer survival analysis.

IEEE Transactions on Neural Networks, 2472–2477.

MacKay, D. J. C. (1992). A practical Bayesian framework for back-

propagation networks. Neural Computation, 4(3), 448–472.

Neal, R. M. (1996). Bayesian learning for neural networks. New York:

Springer.

Neal, R. M (2000). Slice Sampling. Technical Report No. 2005, Department

of Statistics, University of Toronto.

Neal, R. M (2001). Survival Analysis Using a Bayesian Neural Network.

Joint Statistical Meetings report, Atlanta.

Raftery, A. E., Madigan, D., & Volinsky, C. T. (1994). Accounting for

model uncertainty in survival analysis improves predictive perform-

ance. Bayesian Statistics, 5, 323–349.

Ripley, B. D., & Ripley, R. M. (1998). Neural Networks as statistical

methods in survival analysis. In R. Dybowsky, & V. Gant (Eds.),

Artificial neural networks: Prospects for medicine. Landes Bios-

ciences.

2 The data are distributed with the freeware R statistical software,

available at http://www.r-project.org

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 863

Page 10: A novel neural network-based survival analysis model

Robert, C. P. (2001). The Bayesian choice (2nd ed). New York: Springer.

Robert, C. P., & Casella, G. (1999). Markov chain Monte Carlo methods.

New York: Springer.

Samorodnitsky, G., & Taqqu, M. S. (1994). Stable non-Gaussian random

processes: Stochastic models with infinite variance. New York:

Chapman and Hall.

Schwarzer, G., Vach, W., & Schumacher, M. (2000). On the misuses of

artificial neural networks for prognostic and diagnostic classification in

oncology. Statistics in Medicine, 19, 541–561.

Watanabe, S (2001). Learning Efficiency of Redundant Neural Networks in

Bayesian Estimation. Technical Report No. 687, Precision and

Intelligence Laboratory, Tokyo Institute of Technology.

A. Eleuteri et al. / Neural Networks 16 (2003) 855–864864