the ssc presented a data set on cervical cancer for analysis
DESCRIPTION
INTRODUCTION. The SSC presented a data set on cervical cancer for analysis. Purpose of the analysis: determine the different attributes (covariates) for predicting relapse for women that had cervical cancer and surgery, as well as classifying the patients into Low , Medium and High risk. - PowerPoint PPT PresentationTRANSCRIPT
•The SSC presented a data set on cervical cancer for analysis.
•Purpose of the analysis: determine the different attributes (covariates) for predicting relapse for women that had cervical cancer and surgery, as well as classifying the patients into Low, Medium and High risk.
•It has been assumed that prediction will be done with the information obtained right after the surgery. Hence, variable outcomes observed in between surgery date and last follow-up date will not be used. Such variables are "if patients received radiation therapy or not” and "dead with disease, dead without disease, alive with disease, etc." which was taken at time of last follow-up.
•905 patients entered the study, 34 patients were dropped since they had no follow-up date yet.
Covariates:•surgery date•last follow-up date•age of the patient at time of surgery•capillary lymphatic spaces (0=negative, 1,2=positive) (Cls)•cell differentiation (1=better, 2=moderate, 3=worst) (Grad)•histology of the cancer cells (determined by the pathologist, ranges from 0 to 6) (Histolog)•disease left after surgery (0=clear, 1=para-vaginal area, 2=vaginal area, 3=both) (Margins)•depth of the tumour (in mm.) (Maxdepth)•pelvis involvement (O=negative, 1=positive) (Pellymph)•size of the tumour (in mm.)
INTRODUCTION
EXPLORATORY ANALYSIS
20
30
40
50
60
70
Age by relapse
ag
e
No yes
relapse
02
04
06
0
Size by relapse
siz
e
01
02
03
04
05
0
Maxdepth by relapse
ma
xd
ep
th
No yes
relapse
No yes
01
00
20
03
00
40
05
00
No yes
01
00
20
03
00
40
05
00
No yes
01
00
20
03
00
40
05
00 Cls=positive2
Cls=positive1Cls=negative
relapse
relapserelapserelapse
Univariate plots by variables, such as these, were performed to better understand their behaviour.
Also pairwise contingency tables were used as an exploratory tool.
EXPLORATORY ANALYSIS
Complex model, a smaller tree might do...
|maxdepth<19.5
size<12.5
histolog:3
age<36.5
age<28.5
grad:moderate size<17.5
maxdepth<4.5
cls:negative
age<54.5
maxdepth<11.75
maxdepth<7.5
maxdepth<6.5
age<48.5
maxdepth<12.5
age<32
age<34.5
maxdepth<25.5cls:positive1 age<47.5
No No
NoNo No
NoNo No
No No
No
No No
No No
No
No
No NoNo No
Classification tree no NA's on data
217 obs
When dropping observations with NA,too much information is lost, will use NA’s as a factor in all variables.
Classification trees are used to uncover inherent structure in data. These are binary arrangements created by splitting observations into “more homogeneous” groups, dictated by rules of the form:(e.g.) “if Age<24 and Cls is positive then response is likely 1”
|maxdepth:abc
cls:bc
grad:abc
size:ad
maxdepth:a
histolog:b age<32.5
grad:a
maxdepth:b
grad:b
histolog:bd
age<42.5
age<40.5
age<33.5
cls:b
maxdepth:b
age<38.5
age<46.5
age<51.5
age<57
size:ab
grad:ab
age<35.5
age<54.5
age<49.5
maxdepth:ab
age<42.5
grad:a
age<35.5
age<33.5
size:d
grad:b
cls:b age<40
maxdepth:d
pellymph:a
cls:bc
age<44.5
grad:ab
size:b
age<35.5
histolog:c
maxdepth:d
size:bc
age<35.5
cls:b
histolog:c
grad:bc
histolog:bc
age<33.5
age<40.5
maxdepth:d
age<35.5grad:b
cls:a
age<36
age<52.5
grad:ac
age<34.5
age<59.5
No
NoNo
No
NoNo
No
No
No
NoNo
No
NoNo
No
NoNo
No
No
No
No
NoNo
No
No
No
NoNo
No
NoNoNo
yesNo
yes
No
No
No
NoNo
yes
NoNo
NoNo
No
No
NoNoNoNo
NoNo
NoNo
No
NoNo
NoyesNo
Classification tree with NA's as factor
871 obs
Misclassification Rate= .06774Residual Mean Deviance= 0.2995
size
de
via
nce
25
03
00
35
04
00
45
0
1 10 20 30 40 50 60
39.000 4.900 4.000 3.500 3.000 1.900 1.500 0.680Deviance reduction per # of terminal nodes
NA's included
EXPLORATORY ANALYSIS
|maxdepth:abc
cls:bc
grad:abc
size:ad
size:ab
grad:ab
size:d
grad:b
cls:b
pellymph:a
cls:bc
age<44.5
grad:ab
size:b size:bc
cls:b
histolog:c
grad:bc
histolog:bc
age<33.5
age<40.5
cls:a
age<52.5
grad:ac
No No
No
No No
No
No No
yes
No No yes No No No
No
No
No
No No
No
No No No
No
Pruned Tree (NA's inc)
This smaller tree is easier to follow and the misclassification ratio is still of acceptable size.Maxdepth, Size and Cls are observed to be important variables in the structure.
Just as regression uses Residual Sum of Squares as a diagnostic of fit, trees use Residual Deviance. Hence a decrease in deviance means a better fitted tree. In regression, more parameters might give a better fit but complex interpretation. Here, number of terminal nodes is analogous to the latter. Pruning of a tree can be done based on the following:
Misclassification Rate= .07233Residual Mean Deviance= 0.3696
•Variable Size is of importance as seen in trees. Nevertheless, it has many missing values, and analyses usually drop such observations. In order to keep information we categorised it with the missing values as the lowest of the levels and used the quartiles as cutoffs for the other levels.
SURVIVAL ANALYSIS
A Cox Proportional Hazards model was assumed.
During the process of modeling, it was seen that the important levels of Size were three categories: Not Measured (NA’s), 30 and >30
The model for prediction agreed on included Age, Cls, Maxdepth and Size as predictors, along with two two-way interactions: Age with Cls and Maxdepth with Size.
Specifically the hazard as a function of time can be seen as
):0023.0:0957.0:1056.0:02653.0
6119.00199.11425.05872.38048.10366.00
303021
303021
sizedepthsizedepthclsageclsage
sizesizedepthclsclsage((t)eh
Time
Be
ta(t
) fo
r a
ge
95 280 450 670 1200 1500 2600
-0.4
0.0
0.4
Time
Be
ta(t
) fo
r cl
spo
sitiv
e1
95 280 450 670 1200 1500 2600
-20
01
02
0
Time
Be
ta(t
) fo
r cl
spo
sitiv
e2
95 280 450 670 1200 1500 2600
-20
02
04
0
Time
Be
ta(t
) fo
r m
axd
ep
th
95 280 450 670 1200 1500 2600
-0.2
0.2
0.6
Time
Be
ta(t
) fo
r si
ze3
0c<
=3
0
95 280 450 670 1200 1500 2600
-6-2
24
6
Time
Be
ta(t
) fo
r si
ze3
0c>
30
95 280 450 670 1200 1500 2600
-10
10
30
Time
Be
ta(t
) fo
r a
ge
cls
po
sitiv
e1
95 280 450 670 1200 1500 2600
-0.6
0.0
0.4
Time
Be
ta(t
) fo
r a
ge
cls
po
sitiv
e2
95 280 450 670 1200 1500 2600
-0.6
0.0
0.4
Time
Be
ta(t
) fo
r m
axd
ep
thsiz
e3
0c<
=3
0
95 280 450 670 1200 1500 2600
-0.6
-0.2
0.2
Time
Be
ta(t
) fo
r m
axd
ep
thsiz
e3
0c>
30
95 280 450 670 1200 1500 2600
-1.0
0.0
1.0
Proportional Hazards Assumption was not violated neither individually nor as a global model (pvalue=0.14)
SURVIVAL ANALYSIS
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
Not Meas.<=30>30
K-M EstCox PH
K-M and Cox Survival Curves for each level of Size30
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
Negativepositive1positive2
K-M EstCox PH
K-M and Cox Survival Curves for each level of Cls
The Cox curves were calculated as the average of the curves corresponding to the different covariate patterns, rather than plotting curves with the average VALUE of the covariates. (used S-plus function avg.surv created by Dr. R. Brant, CHS Dept, U of C )
•As with any model, assumptions are needed. The assumption of non-informative censored data (censoring not related to the chances of recurrence) was used.
•Some interesting results and interpretation for the model:The hazard ratio for comparison between having Cls positive1 to positive2, keeping all other variables fixed:
SURVIVAL ANALYSIS
*age.-.
*age).-.(
)*age.-(...(
poscls
poscls
ee
e
eh(t)
h(t)
132170392125
132170392125
)1056402653587243804881
2
1
So, for Age=30, the hazard ratio=4.166265, that is, the hazard of having a relapse when Cls positive1 is 4.16625 times greater than the hazard of relapse when Cls positive2 at age 30
Now, for Age=50 hazard ratio=0.2963008With analogous interpretation.
We can see the effect of the interaction between Age and Cls
*depth.-.
*depth).-.(
)*depth).(-..-.(
size
size
ee
e
eh(t)
h(t)
093390407910
093390407910
0023409573611990019901
30
30
So, for Maxdepth=10, hazard ratio=0.59097, that is, the hazard of having a relapse when Size is less than 30 is .2963008 times the hazard of relapse when Size>30.
Similarly, we can look at hazard ratio for an increase in tumour size, mainly:
LOGISTIC REGRESSION ANALYSIS
•The main model for a Logistic regression is to regress the log of the odds of a binary output event as a linear function of covariates.
•Odds is the ratio of the probability of an event happening and the probability of the same event not happening
n
iii Xodds
1
)log( )(1
)()(
eventP
eventPeventodds
Recall that during the process of modeling, it was seen that important levels of Size were really three categories: Not Measured (NA’s), 30 and >30
The model for prediction agreed on included Age, Cls, Maxdepth, Size and Pellymph as predictors, along with a two-way interaction between Age and Cls.
Specifically the logistic model can be seen as
21
303021
:0964.0:0288.05071.0066.0
1692.11862.09039.2016.20286.09972.2)log(
clsageclsagepellymphdepth
sizesizeclsclsageodds
The statistical significant model included an interaction between Pellymph and Size >30. However, there were only three observations with such values and the inclusion of this interaction created problems for prediction. Hence, for the sake of interpretability and in order to be able to predict, we decided to drop it. The change in residual deviance from the fuller model to the one kept was from 252.99 to 260.99.
LOGISTIC REGRESSION ANALYSIS
Probability Surface Cls=Neg Probability Surface Cls=Pos1 Probability Surface Cls=Pos2
Probability Surface Cls=Neg Probability Surface Cls=Pos1 Probability Surface Cls=Pos2
Pellymph=Neg, Size=Not Measured
Pellymph=Pos, Size=Not Measured
The usual plot for this type of analysis is a probability curve.
Given the fact that we had 2 continuous variables in our model, we present some examples of probability surfaces.
This enables to look for any Age/Maxdepth combination
Observe interaction of Age and cls
LOGISTIC REGRESSION ANALYSIS
Probability Surface Cls=Neg Probability Surface Cls=Pos1 Probability Surface Cls=Pos2
Probability Surface Cls=Neg Probability Surface Cls=Pos1 Probability Surface Cls=Pos2
Pellymph=Pos, Size>30
Pellymph=Pos, Size<=30
We can see that Age plays a bigger role when Cls has level of positive2
Changing from Size <=30 to Size >30 increments the probability of relapse, for a fixed set of the other variables (compare top to bottom)
LOGISTIC REGRESSION ANALYSIS
•Some interesting results and interpretation for the model:The odds ratio for comparison between having Cls positive1 to positive2, keeping all other variables fixed:
So, for Age=30, the odds ratio=3.190795, that is, the odds of having a relapse when Cls positive1 are 3.190795 times greater than the odds of relapse when Cls positive2.
Now, for Age=50, the odds ratio=0.2602293, with analogous interpretation.
We can see the effect of the interaction between Age and Cls
*age.-.
*age).-.(
)*age).-(...(
poscls
poscls
ee
e
e)odds(
)odds(
1253091994
1253091994
09644028889039201602
2
1
relapse
relapse
Similarly, we can look at odds ratio for an increase of 10mm in tumour depth, mainly:
)(
adepth
adepth e)odds(
)odds( 3*0.06606041010
relapse
relapse
So, for fixed values of other variables, and an increase in 10 for Maxdepth, the odds ratio=1.935962. That is, the odds of having a relapse when tumour is 10mm deeper are 1.935962 times greater.
LOGISTIC REGRESSION ANALYSIS
One of the purposes of the case study was to classify patients in Low, Medium and High risk of relapse.
We suggest to do this using the probabilities obtained from this logistic regression in the following way:
Calculate the probability from the model for each patient. If the probability is within a prefixed range, then it is set as Low, if it is within another range Medium and so on. For example :
Low if in (0,.35], Med if in (.35, .60] and High if >.60
Another way for classifying, would involve at risk or not at risk as the possible classifications (as a +/- test).
Although this gives only two possibilities, predictive values can be calculated and hence have a measure of accuracy.
Do this by setting a cutoff point for the probabilities calculated and set the value of the test for the patient as + or -.
Some examples for different cutoffs follow.
Classified + if predicted Pr(D) >= .5 -------- True --------Classified | D ~D Total- ----------+--------------------------+----------- + | 5 3 | 8 - | 37 612 | 649 ---------+--------------------------+----------- Total | 42 615 | 657
True D defined as relapse ~= 0Positive predictive value Pr( D| +) 62.50%Negative predictive value Pr(~D| -) 94.30%Correctly classified 93.91%
•For the next cutoff values the table itself is omitted.
LOGISTIC REGRESSION ANALYSIS
Classified + if predicted Pr(D) >= .25True D defined as relapse ~= 0Positive predictive value Pr( D| +) 33.33%Negative predictive value Pr(~D| -) 94.76%Correctly classified 92.24%
Classified + if predicted Pr(D) >= .4True D defined as relapse ~= 0Positive predictive value Pr( D| +) 70.00%Negative predictive value Pr(~D| -) 94.59%Correctly classified 94.22%
Classified + if predicted Pr(D) >= .6True D defined as relapse ~= 0Positive predictive value Pr( D| +) 50.00%Negative predictive value Pr(~D| -) 93.74%Correctly classified 93.61%
As a “goodness of fit” , a table for groups follows
Logistic model for relapse, goodness-of-fit test(Table collapsed on quantiles of estimated probabilities)
Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total 1 0.0150 1 0.8 65 65.2 66 2 0.0188 1 1.1 65 64.9 66 3 0.0216 0 1.3 66 64.7 66 4 0.0259 3 1.5 62 63.5 65 5 0.0318 2 1.9 64 64.1 66 6 0.0440 1 2.5 65 63.5 66 7 0.0604 4 3.4 61 61.6 65 8 0.0839 2 4.7 64 61.3 66 9 0.1413 12 6.9 54 59.1 66 10 0.6601 16 17.8 49 47.2 65
number of observations = 657 number of groups = 10 Pvalue= 0.2704
CONCLUSIONS
•Given the nature of the study, and the assumption that prediction of relapse would be done right after surgery, variables observed after surgery were not taken into account . These were: Status of patient at last follow-up date and if patients received radiation.
•Contrary to what we expected, Disease left after surgery did not play an important role in prediction.
•There was agreement throughout the different analyses (exploratory, survival and logistic) regarding the importance of the inclusion of three covariates: Maxdepth, Capillary Lymphatic Spaces (Cls) and Size.
•The effect of variable Age on relapse is affected by its interaction with Capillary Lymphatic Spaces (cls)
•The important variables for predicting the survival to relapse are Age, Cls, Size and Maxdepth.
•The important variables for predicting the probability of relapse are Age, Cls, Size, Maxdepth and Pellymph.
FUTURE WORK
•It would be of relevance to check the importance of covariates when separating the response variable as no relapse, relapse before a specific time and relapse after that time.
•Use of trees as a classification tool rather than an exploratory tool.
AKNOWLEDGEMENTS
We would like to thank the following for their help and support in the creation of this poster:
StatCar lab, Mathematics and Statistics Dept., U of C
Dr. R. Brant, CHS, U of C
Dr. P. Ehlers, Math and Stats, U of C
B. Teare, Math and Stats, U of C
Learning Commons, U of C
BIBLIOGRAPHY
•Rose, S., Lecture notes for Biostatistics II
•Venables, W.N. and Ripley, B.D. Modern Applied Statistics with S-plus, Springer Statistics and Computing Series, New York, 1994
•Insightful, S-plus 2000 Guide to Statistics, Seattle, 1999