a novel neural network-based survival analysis model
TRANSCRIPT
2003 Special issue
A novel neural network-based survival analysis model
Antonio Eleuteria,b,*, Roberto Tagliaferric,d, Leopoldo Milanob,e, Sabino De Placidof,Michele De Laurentiisf
aDMA, Universita di Napoli ‘Federico II’, Naples, ItalybINFN Sez. Napoli, Naples, Italy
cDMI, Universita di Salerno, Salerno, ItalydINFM Unita di Salerno, Salerno, Italy
eDipartimento di Scienze Fisiche, Universita di Napoli ‘Federico II’, Naples, ItalyfDipartimento di Endocrinologia ed Oncologia Molecolare e Clinica, Universita di Napoli ‘Federico II’, Naples, Italy
Abstract
A feedforward neural network architecture aimed at survival probability estimation is presented which generalizes the standard, usually
linear, models described in literature. The network builds an approximation to the survival probability of a system at a given time, conditional
on the system features. The resulting model is described in a hierarchical Bayesian framework. Experiments with synthetic and real world
data compare the performance of this model with the commonly used standard ones.
q 2003 Elsevier Science Ltd. All rights reserved.
Keywords: Survival analysis; Conditioning probability estimation; Neural networks; Bayesian learning; MCMC methods
1. Introduction
Survival analysis is used when we wish to study the
occurrence of some event in a population of subjects and the
time until the event is of interest. This time is called survival
time or failure time. Survival analysis is often used in many
fields of study. Examples include: time until failure of a
light bulb and time until occurrence of an anomaly in an
electronic circuit in industrial reliability, time until relapse
of cancer and time until pregnancy in medical statistics,
duration of strikes in economics, length of tracks on a
photographic plate in particle physics.
In literature we find many different modeling approaches
to survival analysis. Conventional parametric models may
involve too strict assumptions on the distributions of failure
times and on the form of the influence of the system features
on the survival time, assumptions which usually extremely
simplify the experimental evidence, particularly in the case
of medical data (Kalbfleisch & Prentice, 1980). In contrast,
semi-parametric models do not make assumptions on the
distributions of failures, but make instead assumptions on
how the system features influence the survival time.
Finally, non-parametric models only allow for a qualitative
description of the data.
Neural networks (NN) have been recently used for
survival analysis, but most of the efforts have been devoted
to extending Cox’s model (Faraggi & Simon, 1995; Bakker
& Heskes, 1999; Lisboa & Wong, 2001; Liestøl, Andersen,
& Andersen, 1994), or in making piecewise or discrete
approximations of the hazard functions (Neal, 2001;
Biganzoli, Boracchi, & Marubini, 2002), or treating the
problem as a classification one (De Laurentiis, De Placido,
Bianco, Clark, & Ravdin, 1999) or by ignoring the
censoring, which may introduce significant bias into
the estimates, as criticized in Ripley and Ripley (1998)
and Schwarzer, Vach, and Schumacher (2000). We refer to
the latter two excellent surveys for the description of the
many limitations which arise in the current use of NNs,
limitations due to the use of heuristic approaches which do
not take into account the peculiarities of survival data, and/
or ignore fundamental issues such as regularization.
In this paper we describe a novel NN architecture which
overcomes all the limitations of currently available NN
models without making assumptions on the underlying
survival distribution and which is proven to be flexible
enough to model the most complex survival data.
We describe our model in a Bayesian framework, which,
as advocated by Raftery, Madigan, and Volinsky (1994) in
0893-6080/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved.
doi:10.1016/S0893-6080(03)00098-4
Neural Networks 16 (2003) 855–864
www.elsevier.com/locate/neunet
* Corresponding author. Address: INFN Sez, Napoli, Naples, Italy.
E-mail addresses: [email protected] (A. Eleuteri), [email protected]
(R. Tagliaferri).
the context of survival analysis, helps taking into account
model uncertainty, which can help to improve the predictive
performance of a model.
The paper is organized as follows. In Section 2 we
describe the fundamentals of survival analysis and pose the
censoring problem. In Section 3 we describe two of the most
used standard modeling techniques, which will be used in
comparison to our model. In Section 4 we describe the
architecture of our NN. In Section 5 we describe our model
in a hierarchical Bayesian framework and a Monte Carlo
estimator of the survival function is shown. In Section 6
some experimental results are shown, in comparison with
some commonly used modeling techniques.
2. Survival analysis
Let TðxÞ denote an absolutely continuous random
variable (rv) describing the failure time of a system defined
by a vector of features x. If FT ðtlxÞ is the cumulative
distribution function for TðxÞ; then we can define the
survival function:
SðtlxÞ ¼ P{TðxÞ . t} ¼ 1 2 FT ðtlxÞ
¼ 1 2ðt
0fT ðulxÞdu ð1Þ
which is the probability that the failure occurs after time t:
The requirements for a function SðtlxÞ to be a survival
function are, uniformly w.r.t. x:
(1) Sð0lxÞ ¼ 1 (there cannot be a failure before time 0),
(2) Sðþ1lxÞ ¼ 0 (asymptotically all events realize),
(3) Sðt1lxÞ $ Sðt2lxÞ when t1 # t2:
From the definitions of survival and density of failure, we
can derive the hazard function which gives the probability
density of an event occuring around time t; given that it has
not occured before t: It can be shown that the hazard has this
form (Kalbfleisch & Prentice, 1980):
lðtlxÞ ¼fT ðtlxÞSðtlxÞ
: ð2Þ
In many studies (e.g. in medical statistics or in quality
control), we do not observe realizations of the rv TðxÞ:
Rather, this variable is associated with some other rv V , r
such that the observation is a realization of the rv Z ¼
qðT ;VÞ: In this case we say that the observation of the time
of the event is censored. If V is independent of T ; we say
that censoring is uninformative.
Depending on the form of the function q; we have
different forms of censoring. Two of the most common ones
in applications are right censoring, when q ; minð·; ·Þ and
left censoring, when q ; maxð·; ·Þ:
Survival data can then be represented by triples of the
form ðt; x; yÞ where x is a vector of features defining
the process, t is an observed time, and y is an indicator
variable:
y ¼1 if t is an event time
0 if t is a censored time
(: ð3Þ
We can define the sampling density lðylt; xÞ of a survival
process (assuming, e.g. right censoring) by noting that, if
we observe an event ðy ¼ 1Þ; it directly contributes to the
evaluation of the sample density; if we do not observe the
event ðy ¼ 0Þ; the best contribution is to evaluate SðtlxÞ:We have then:
lðylt; xÞ ¼ SðtlxÞ12yfT ðtlxÞy: ð4Þ
The joint sample density for a set of independent
observations D ¼ {ðtk; xk; ykÞ} can then be written as:
L ¼Y
k
lðykltk; xkÞ: ð5Þ
Since the censoring is supposed uninformative, it does not
influence inference on the failure density, but gives
contribution to the sample density.
Note that censored data models can be seen as a
particular class of missing data models in which the
densities of interest are not sampled directly (Robert &
Casella, 1999).
3. Standard models for survival analysis
In the next sections we will describe some of the most
commonly used models, both for homogeneous (time-only)
and heterogeneous modeling.
3.1. The Kaplan–Meier non-parametric estimator
The Kaplan–Meier (KM) estimator is a non-parametric
maximum likelihood estimator of the survival function. It is
piecewise constant, and it can be thought of as an empirical
survival function for censored data. In its basic form, it is
only homogeneous.
Let k be the number of events in the sample, t1; t2;…; tk
the event times (supposed unique and ordered), ei the
number of events at time ti and ri the number of times (event
or censored) greater than or equal to ti: The estimator is
given by the formula:
SKMðtÞ ¼Y
i:ti,t
ri 2 ei
ri
: ð6Þ
It can be shown that the KM. estimator maximizes the
generalized likelihood over the space of all distributions, so
its evaluation on large data sets gives a good qualitative
description of the true survival function. However, it should
be noted that it is noisy when the data are few, in particular
when the events are rare, since it is piecewise constant.
Therefore KM estimates from datasets sampled from
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864856
the same distribution but with different number of samples
can differ quite a lot.
3.2. Proportional hazards models
The most used survival specification which takes into
account system features is to allow the hazard function to
have the form:
lðtlxÞ ¼ lðtÞexpðwTxÞ ð7Þ
where lðtÞ is some homogeneous hazard function (called the
baseline hazard) and w are the feature dependent par-
ameters of the model. This is called a proportional hazards
(PH) model. The two most used approaches for this kind of
model are:
† Choose a parameterized functional form for the baseline
hazard, then use Maximum Likelihood (ML) techniques
to find values for the parameters of the model;
† do not fix the baseline hazard, but make a ML estimation
of the feature dependent part of the model, and a non-
parametric estimation of the baseline hazard. This is
called Cox’s model (Kalbfleisch & Prentice, 1980).
NNs have simply been used to estimate the feature
dependent part of the model; however, the main drawback
remains, which is the assumption of proportionality between
the time-dependent and feature-dependent parts of the
model.
4. A neural network model
In this section we define a NN which, given a data set of
system features and times, provides a model for the survival
function (and implicitly, by suitable transforms, of the other
functions of interest, e.g. failure density and hazard).
It is well known (Bishop, 1996) that a sigmoid is the
appropriate activation function for the output of a NN to be
interpretable as a probability. If VðaÞ ¼ ð1 þ expð2aÞÞ21 is
such a sigmoid function (a is the net input to the activation),
by taking into account Eq. (1) and using S ¼ 1 2 V we can
define the activation function of the NN as:
SðaÞ ¼1
1 þ expðaÞ: ð8Þ
This is not sufficient to define a survival function, since we
must also enforce the requirements in Section 3.
(1) Sð0lxÞ ¼ 1 and Sðþ1lxÞ ¼ 0: We set one of the
activation functions of the hidden layer to be the hðtlwt1Þ ;
at ¼ logðtwt1Þ function. We call this hidden unit the time
unit. Note that the only input feeding into this unit through
the weight wt1is the time input (neither biases nor other
inputs feed this unit).
The other activation functions (denoted by g) are
assumed to be squashing functions. Since these functions
are all bounded (and the integration functions for hidden and
output nodes is assumed to be the scalar product), this
ensures us that the required conditions are satisfied if the
first-layer weight wt1connecting the input t is non-negative
and the second-layer weight wt2connecting the h unit to the
output is also non-negative. In this way the net input to
the output activation goes to negative infinity or zero, so
that the output saturates to 1 or 0, respectively. Note that the
input t is also connected to the squashing hidden units with
variable strength weights so that interactions with other
input features are taken into account. As shown in the next
sections, these weights must also be constrained in some
way.
(2) SðtlxÞ must be non-increasing as t increases. We must
look at the functional form of the network output. Since it
must be a non-increasing function of a; we must ensure that
a is a non-decreasing function of t: We have:
a ; aN þ at ¼XNj¼1
wjgðajÞ þ wt2logðwt1
tÞ ð9Þ
where g is the activation function in the hidden layer, wj is a
weight of the second layer1 and N is the number of
squashing hidden units (we have not included the biases in
the summation). If we suppose that the gðajÞ are non-
decreasing, then a is non-decreasing if and only if the
weights wj are non-negative. We must now impose that the
gðajÞ are non-decreasing. Since they are monotonic non-
decreasing functions, then we must impose that their
arguments are non-decreasing functions of t: This can be
accomplished if we impose that the weights associated with
the time input are non-negative. In short, all the weights on
paths connecting the time input to the network output must
be non-negative. In Fig. 1 the architecture of the Survival
Neural Network (SNN) is shown (the bias connections are
not shown).
Note that the use of an unbounded activation function as
hidden time unit does not put discontinuities in the function
Fig. 1. Architecture of SNN.
1 From here on, to simplify the notation, when we must refer to
expressions involving all the weights in a layer (including the weights into
and out of the time unit), we will denote generic second-layer weights with
a single index, and generic first-layer weights with two indices.
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 857
defined by the network, since it can be easily verified that
Eq. (8) is equivalent to:
S ¼1
1 þ ðwt1tÞwt2 expðaNÞ
ð10Þ
which, however, has not the form of a standard MLP
network function.
5. Hierarchical Bayesian modeling
In the conventional ML approach to training, a single
weight vector is found which minimizes an error function;
in contrast, the Bayesian scheme considers a probability
distribution over weights.
We made many tests with ML approaches, training a
single network with constrained optimization algorithms,
and then using ensembles of such networks. The results,
however, were always less than satisfactory and uniformly
worse than those obtained in a Bayesian framework. For an
in-depth discussion on Bayesian learning in the context of
NNs we refer to Neal (1996).
As pointed out by Watanabe (2001), Bayesian learning
methods are necessary when we deal with hierarchical
statistical models like NNs, which are non-identifiable and
non-regular. The use of networks with a large number of
hidden units implies that neither the distribution of the ML
estimator nor the a posteriori probability density of weights
converges to the normal distribution, even in the case of
infinite data sets. This implies that model selection methods
like Akaike’s Information Criterion (AIC), Bayesian
Information Criterion (BIC) and Minimum Description
Length (MDL) cannot be used in the context of complex
NNs.
Since SNN is a very complex network, we shall build
a Bayesian statistical model, which is composed by a
parametric statistical model (the sample density) and a prior
distribution over the weights.
We can then describe the learning process as an updating
of a prior distribution pðwÞ; based on the observation of the
data D through the joint sample density pðDlwÞ: This process
can be expressed by Bayes’ theorem:
pðwlDÞ ¼pðDlwÞpðwÞÐpðDlwÞpðwÞdw
: ð11Þ
The joint sample density is given in Eq. (5), in which S is the
function realized by the network and fT is given by minus
the Jacobian of the network, which can be efficiently
evaluated with a backpropagation procedure.
5.1. Choosing the prior distribution
The prior over weights should reflect any knowledge we
have about the mapping we want to build. For the SNN
model, we must then take into account the fact that we want
a smooth mapping and that we have both constrained and
unconstrained weights; furthermore, we should take into
account the specialized role the weights have in the network
depending on their position. We can thus define the
following weight groups:
† Unconstrained weights:
a group for each of the covariate input, comprising the
weights out of that input into the hidden squashing
units;
one group for the biases feeding the hidden squashing
units;
one group for the output bias.
† Constrained weights:
one group for the weights out of the input time unit
feeding the hidden squashing units;
one group for the weight out of the input time unit
feeding the hidden time unit;
one group for the weight out of the hidden time unit;
one group for the weights out of the hidden squashing
units;
If n is the number of inputs then we have n þ 5 groups.
We assign a distribution to each group, governed by its
set of parameters (hyperparameters). Since the weights are a
priori statistically independent, then the prior over the whole
set of weights will be the product of the priors over each
group.
(1) Priors for unconstrained weights. We can give a zero
mean Gaussian prior to unconstrained weights, so as to
favor small weights and consequently smooth mappings.
This kind of prior, due to its simplicity and analytical
properties has been extensively used in the literature (Neal,
1996; MacKay, 1992; Bishop, 1996).
By assigning a different prior to the weights out of each
covariate input allows us to use the Automatic Relevance
Determination (ARD) technique (MacKay, 1992; Neal,
1996), which can give hints on the relevance of inputs to
non-linear relations with the output. Furthermore, the
weights out of each input can be scaled to reflect
the different scales of the different inputs.
The Gaussian prior over the unconstrained weights has
the following form:
pðwulauÞ /Y
k[Gu
exp 2ak
2
XWk
ik¼1
w2ik
0@
1A ð12Þ
where wu is the vector of unconstrained weights, Gu is the
set of the unconstrained weight groups, Wk is the cardinality
of the k-th group and au is the vector of hyperparameters
(inverse variances).
(2) Priors for constrained weights. The positivity
constraints on some groups forces us to choose a distribution
with positive support. The most simple distribution with
good analytical properties and which expresses well our
prior requirements on positivity and smoothness is
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864858
the exponential distribution. The exponential prior over the
constrained weights has this form:
pðwðeÞc laðeÞ
c Þ /Y
k[GðeÞc
exp 2ak
XWk
ik¼1
wik
0@
1A IðwðeÞ
c Þ ð13Þ
where wðeÞc is the vector of constrained weights, GðeÞ
c is the
set of the unconstrained weight groups, Wk is the cardinality
of the k-th group, aðeÞc is the vector of hyperparameters
(inverse means) and Ið·Þ is the indicator function over the
positive half-space in which the weights are constrained. It
can be shown that this prior is appropriate for all constrained
weights, except those out of the hidden squashing units.
(3) Limit behaviour for priors with finite variance. We
now analyze the prior over the survival functions implied by
an exponential prior on the weights out of the hidden
squashing units when the network inputs are fixed to a given
value. From Eq. (9), we see that a contribution to the overall
sum is given by the weighted sum of N squashing hidden
units. The terms in the sum are a priori independent and
identically distributed. We can evaluate the expected value
of each squashing hidden unit’s contribution by noting that
the wj are a priori independent of the activations gj (from
here on, we omit the indices since the distribution is the
same for all j):
mwg ; E½wg� ¼ E½w�E½g� ; mwmg: ð14Þ
The expectation of the contribution is finite because mg is
finite since the activation is a bounded function.
We can also evaluate the variance of each squashing
hidden unit contribution:
s2wg ; E½ðwgÞ2�2 E½wg�2 ¼ E½w2�E½g2�2 m2
wm2g
¼ ðs2w þ m2
wÞðs2g þ m2
gÞ2 m2wm
2g
¼ s2wðs
2g þ m2
gÞ þ m2ws
2g: ð15Þ
The variance of the contribution is finite because s2g is finite
since the activation is a bounded function.
We can then apply the Central Limit Theorem and
conclude that for large N the total contribution converges to
a Gaussian distribution with parameters:
m ¼ Nmwmg s2 ¼ N½s2wðs
2g þ m2
gÞ þ m2ws
2g�: ð16Þ
This result extends Neal’s analysis in Neal (1996) to non-
Gaussian distributions with finite variance. To obtain a finite
limit for the location parameter which does not saturate the
output activation, we must scale mw according to N; by
choosing mw ¼ reN21; for fixed re: However, this choice is
troublesome. In fact, for the exponential distribution we
have mw ¼ sw; so that the variance of the contribution is, for
large N :
limN!1
Nre
N
�2
ðs2g þ m2
gÞ þre
N
�2
s2g
" #¼ 0: ð17Þ
This behaviour is not desirable, since we get a degenerate
distribution centered on remg: To get a non-degenerate
distribution we should let the location grow as N grows,
which is undesirable since we do not want to saturate the output
of the network. Furthermore, since all the weights would tend
to assume the same value, we would lose the possibility that the
network represents hidden features of the data (Neal, 1996).
We could think to solve the problem by choosing a
different distribution with continuous positive support.
However, the problem is more general than it might appear.
In fact, what we said before is true whenever the mean and
the standard deviation of the distribution are in a
proportionality relation. And this is true for many common
distributions such as: the gamma family (which includes the
exponential and Maxwell distributions), truncated normal,
inverse gamma, Fisher, x2; Weibull.
(4) Limit behaviour for stable priors. A possible solution
is to use distributions which are not in the normal domain of
attraction of the Gaussian, that is densities whose limiting
sums do not tend to the Gaussian density. It turns out that all
the densities with finite variance are in the normal domain of
attraction of the Gaussian.
Densities which are not in the normal domain of attraction
of the Gaussian are said to be stable, and are characterized by
an index a such that, if X1;…;Xn are independent identically
distributed with the same index, then ðX1 þ · · · þ XnÞ=n1=a is
stable with the same index (note that Gaussian densities are
stable with a ¼ 2). For an in-depth discussion on stable
processes we refer to Samorodnitsky and Taqqu (1994). We
shall concentrate our attention on the Levy density, for
which a ¼ 1=2: The joint distribution of the weights out of
the squashing hidden units is:
pðwðLÞc lgÞ / exp 2
XNi¼1
g
2wi
þ3
2logwi
� !IðwðLÞ
c Þ: ð18Þ
To obtain a well defined limit as N grows large, we need to
scale the parameter: g ¼ rLN22; with fixed rL:
With this choice of the prior, we will have a stable non-
degenerate distribution whatever the number N of squashing
hidden units is. Furthermore, since this density is heavy
tailed, the larger the value of the g parameter is, the higher
the probability that many weights have values in a wide
range, favoring complex models and the representation of
hidden features in the data (Neal, 1996).
5.2. Markov chain Monte Carlo methods for sampling the
posterior
The best prediction we can obtain given a new input and
the observed data D can be written in the following way:
pðylðx; tÞ;DÞ ¼ð
pðy;w;alx; tÞ;DÞdw da
¼ð
pðylðx; tÞ;wÞpðw;aÞlDÞdw da ð19Þ
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 859
where pðylðx; tÞ;wÞ is the predictive model of the network.
This can be seen as the evaluation of the expectation of the
network function with respect to the posterior distribution of
the network weights. In our case the integral cannot be
solved analytically, so we must use a numerical approxi-
mation scheme.
The posterior density in Eq. (11) can be efficiently
sampled by using a Markov chain Monte Carlo (MCMC)
method, which consists in generating a sequence of
weight vectors, and then using it to approximate the
integral in Eq. (19) with the following the Monte Carlo
estimator (M is the number of samples from the
posterior):
pðylx; tÞ;DÞ <1
M
XMi¼1
pðylx; tÞ;wiÞ ð20Þ
which is guaranteed to converge by the Ergodic Theorem
to the true value of the integral as M goes to infinity.
The MCMC method we used for generating the sequence
of weight vectors is the Slice Sampling method. For a
detailed discussion see Neal (2000).
5.3. Gibbs sampling
As we have seen, the priors over the weights used in
the inference of the posterior are themselves parametrized.
Instead of setting the values of these hyperparameters to
arbitrary values, we can make them part of the inference
process, by sampling from Eq. (11). This can efficiently
be done by using the Gibbs sampling process. In a first
step the hyperparameters are kept constant, and we use an
MCMC method to sample from Eq. (11). The second step
consists in keeping the weights constant, while the
hyperparameters are sampled from the following distri-
bution:
pðalw;DÞ ¼pða;w;DÞ
pðw;DÞ
¼pðDlwÞpðwlaÞpðaÞ
pðDlwÞpðwÞ/ pðwlaÞpðaÞ: ð21Þ
This process can be done efficiently if we choose a
hyperprior pðaÞ which is conjugate to the hyperposterior
pðalw;DÞ (i.e. they both belong to the same family
(Robert, 2001)). It is possible to show that in the case of
Gaussian, exponential and Levy priors, by choosing
gamma hyperprior distributions, the resulting hyperposter-
iors are gamma distributions.
It can be proven that this hierarchical extension is
beneficial in many respects, one of which is the
increased robustness of the model. For a detailed
discussion on the topic of hierarchical Bayesian modeling
we refer to Robert (2001) and Neal (1996) for
applications to NNs.
6. Experiments
In this section we show the performance of the SNN
model on two data sets, a synthetic heterogeneous problem
and a real heterogeneous medical problem. All the software
for the simulations has been written by the authors in the
MATLABq language. In each experiment two models
were fitted: a Cox PH model and a SNN model with 50
hidden units. KM estimates were then evaluated for each
model.
The Cox PH model was trained using standard ML
estimation while the SNN model was trained with the
MCMC slice sampling algorithm with Gibbs sampling of
the hyperparameters. The first 400 samples were omitted
from the chain. Then, 700 samples were generated by
alternating between normal and overrelaxed steps.
Finally, the last 100 samples were used to obtain the
Monte Carlo estimator of the SNN output, based on an
inspection of the energies of the chain. Note that in a
real use of the algorithm, a simple inspection of the
energies might not be sufficient to assess convergence;
instead, a formal test of convergence, like the Estimated
Potential Scale Reduction diagnostic (Robert & Casella,
1999) should be used.
6.1. Synthetic heterogeneous data
In this experiment we considered a mixture of Weibull
and gamma densities with parameters dependent on the
variables x1; x2 :
fT ðtlxÞ ¼x2Weðx21; sinðx1Þ
3 þ 2Þ þ ð1 2 x2Þ
£ Gaððx1 þ 1Þ2; expðcosðx1Þ þ 2ÞÞ: ð22Þ
This data set was built to test the SNN model in the case in
which the proportional hazards assumption is not verified.
The covariates x1 and x2 were first sampled from a bivariate
Gaussian with mean vector (0, 0) and full covariance matrix
such that the correlation coefficient of the variables was
20.71. Next, the variable x2 was transformed to a binary
indicator: values greater than or equal to 0 were set to 1,
values less than 0 were set to 0. After the transformation the
correlation coefficient was 20.57. Then, 7000 samples were
generated from the mixture, with about 27% uniformly right
censored data in the interval [0,80]. The data were then
uniformly split in two disjoint data sets: a training set of
4000 samples and a testing set of 3000 samples (with the
same percentage of censored data in each set).
To get a clear understanding of how well the SNN
model is capturing the behaviour of the censored data,
we made a comparison with KM estimates. The test data
was stratified in five quantities based on the survival
estimates of the models at t ¼ 8:4 (median time).
Then, the mean (with respect to the covariates) survival
probabilities were estimated, togheter with KM estimates
of the quantiles induced by the models. If a model is
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864860
Fig. 2. SNN model of a synthetic survival function with covariates (evaluated on the test set).
Fig. 3. Cox PH model of a synthetic survival function with covariates (evaluated on the test set).
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 861
Fig. 4. SNN model of a real survival function with covariates (evaluated on the test set).
Fig. 5. Cox PH model of a real survival function with covariates (evaluated on the test set).
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864862
able to capture the distribution of the data we expect
the output of the model and KM estimates to be similar.
As can be seen in Figs. 2 and 3, the SNN model shows
good performance on all quantiles, while the Cox PH
model shows good performance only on bad prognosis
quantiles. As we expected, the proportional hazards
model could not describe well the data.
6.2. Real heterogeneous data
These data are taken from trials of chemotherapy for
stage B/C colon cancer.2 It is composed of 1776 samples,
with 12 covariates. The dataset was split into a training set
of 1076 samples, and a test set of 700 samples. About 49%
of the data are censored, which makes the analysis of the
data a complex task.
The test data was stratified in three quantiles based on
the survival estimates of the models at t ¼ 217 weeks.
Then, the mean (with respect to the covariates) survival
probabilities were estimated, togheter with KM estimates
of the quantiles induced by the models. If the model can
capture the distribution of the data we expect the output of
the model and KM estimates to be similar. As can be seen
in Figs. 4 and 5, the SNN and Cox PH model both exhibit
good discriminatory power, also if the Cox estimate is
noisy, since it is piecewise constant. This test shows that
the SNN model is able to work with real-world data for
which the proportional hazards assumption is roughly
verified.
7. Conclusions and perspectives
In this paper we have described a novel NN architecture
that addresses survival analysis in a neural computation
paradigm.
Starting from the requirements on the survival
function, we have defined a feedforward NN which
satisfies those requirements with appropriate choice of
hidded unit activations and constraints on the weight
space.
The proposed model does not make any assumption on
the form of the survival process, and does not use
discretizations or piecewise approximations as other NN
approaches to survival analysis do.
Experiments on synthetic and real survival data demon-
strate that the SNN model can approximate complex
survival functions. On synthetic survival data it out-
performed the PH model, since the data did not verify the
proportional hazards assumptions by construction. On real
survival data which verifies the PH assumptions it showed
very similar performance to the Cox PH model, thus
showing the capability to model also proportional hazards,
real-word data.
As an empirical proof of effectiveness, we have also
shown that the SNN model closely follows the time
evolution of the KM non-parametric estimator on large
data sets, for which the KM estimator is known to give very
accurate descriptions of the survival process.
The development of DNA microarrays and other
techniques to produce high-quality, highly informative and
complex datasets will require correspondingly complex
modeling techniques. We believe that the neural compu-
tation paradigm if correctly employed can be a powerful
tool to analyze complex datasets, and the SNN model shows
that complex data can be analyzed in a robust Bayesian
framework.
Acknowledgements
This work was partially supported by MIUR Progetto di
Rilevanza Nazionale, MIUR-CNR, AIRC.
References
Bakker, B., & Heskes, T. (1999). A neutral-Bayesian approach to survival
analysis. Proceedings of IEE Artificial Neural Networks, 832–837.
Biganzoli, E., Boracchi, P., & Marubini, E. (2002). A general framework
for neural network models on censored survival data. Neural Networks,
15, 209–218.
Bishop, C. M. (1996). Neural networks for pattern recognition. Oxford:
Clarendon Press.
De Laurentiis, M., De Placido, S., Bianco, A. R., Clark, G. M., & Ravdin,
P. M. (1999). A prognostic model that makes quantitative estimates of
probability of relapse for breast cancer patients. Clinical Cancer
Research, 19, 4133–4139.
Faraggi, D., & Simon, R. (1995). A neural network model for survival data.
Statistics in Medicine, 14, 73–82.
Kalbfleisch, J. D., & Prentice, R. L. (1980). The statistical analysis of
failure time data. New York: Wiley.
Liestøl, K., Andersen, P. K., & Anderson, U. (1994). Survival analysis and
neural nets. Statistics in Medicine, 13, 1189–1200.
Lisboa, P. J. G., & Wong, H. (2001). Are neural networks best used to help
logistic regression? An example from breast cancer survival analysis.
IEEE Transactions on Neural Networks, 2472–2477.
MacKay, D. J. C. (1992). A practical Bayesian framework for back-
propagation networks. Neural Computation, 4(3), 448–472.
Neal, R. M. (1996). Bayesian learning for neural networks. New York:
Springer.
Neal, R. M (2000). Slice Sampling. Technical Report No. 2005, Department
of Statistics, University of Toronto.
Neal, R. M (2001). Survival Analysis Using a Bayesian Neural Network.
Joint Statistical Meetings report, Atlanta.
Raftery, A. E., Madigan, D., & Volinsky, C. T. (1994). Accounting for
model uncertainty in survival analysis improves predictive perform-
ance. Bayesian Statistics, 5, 323–349.
Ripley, B. D., & Ripley, R. M. (1998). Neural Networks as statistical
methods in survival analysis. In R. Dybowsky, & V. Gant (Eds.),
Artificial neural networks: Prospects for medicine. Landes Bios-
ciences.
2 The data are distributed with the freeware R statistical software,
available at http://www.r-project.org
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864 863
Robert, C. P. (2001). The Bayesian choice (2nd ed). New York: Springer.
Robert, C. P., & Casella, G. (1999). Markov chain Monte Carlo methods.
New York: Springer.
Samorodnitsky, G., & Taqqu, M. S. (1994). Stable non-Gaussian random
processes: Stochastic models with infinite variance. New York:
Chapman and Hall.
Schwarzer, G., Vach, W., & Schumacher, M. (2000). On the misuses of
artificial neural networks for prognostic and diagnostic classification in
oncology. Statistics in Medicine, 19, 541–561.
Watanabe, S (2001). Learning Efficiency of Redundant Neural Networks in
Bayesian Estimation. Technical Report No. 687, Precision and
Intelligence Laboratory, Tokyo Institute of Technology.
A. Eleuteri et al. / Neural Networks 16 (2003) 855–864864