faculty of economics and administrative sciences department of applied statistics survival analysis...
TRANSCRIPT
Faculty of Economics and Administrative SciencesDepartment of Applied Statistics
Survival Analysis of Breast Cancer Patients in Gaza Strip
1 -Introduction
Survival analysis has become a popular tool in observational and experimental studies involving follow-up of study participants over time. These studies often experience late arrival and early departure of subjects into and out of the observation period.
Survival analysis techniques allow
for a study to start without all
experimental units enrolled and to
end before all experimental units
have experienced an event.
2- Terminology and Notations:
Survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.Survival time can be defined broadly as the time to the occurrence of a given Event.
Time, we mean years, months, weeks, or days from the beginning of follow-up of an individual until an event occurs.
event, we mean death, disease incidence, relapse from remission, recovery (e.g., return to work) or any designated experience of interest that may happen to an individual, Although more than one event may be considered in the same analysis, we will assume that only one event is of designated interest.
Censored Data. Most survival analyses consider a key analytical problem called censoring. In essence, censoring occurs when we have some information about individual survival time, but we don’t know the survival time exactly.
There are generally three reasons why censoring may occur: (1)A person does not experience the event before the
study ends .(2) A person is lost to follow-up during the study
period .(3) A person withdraws from the study because of
death (if death is not the event of interest) or some other reason .
The survivor function S(t) is fundamental to a survival analysis and gives the probability that a person survives longer than some specified time t: that is, S(t) gives the probability that the random variable T exceeds the specified time t.
The hazard function h(t) gives the instantaneous potential per unit time for the event to occur.
4-Kaplan-Meier Survival Analysis (KMSA)4-Kaplan-Meier Survival Analysis (KMSA)
Several methods have been developed for constructing survival curve estimates, the most common methods being the life Table, and
Kaplan-Meier methods .
Kaplan and Meier (1958) were the first who carried out the solution of a problem to estimate the survival curve in a simple way while considering the right censoring.
5- The Log–Rank Test for Comparison 5- The Log–Rank Test for Comparison of two Survival Distributionof two Survival Distribution
The log– rank test is a Nonparametric Method for Comparing Survival distributions and the most popular testing method of comparing the survival of groups .
The problem of comparing survival distributions arises often in biomedical Research . For example a clinical oncologist may be interested in comparing the ability of two or more treatments to prolong life or maintain health.
A statistical test is necessary
These differences can be illustrated by drawing graphs of the estimated survivorship functions, but that gives only a rough idea of the difference between the distributions. It does not reveal whether the differences are significant or merely chance variations
6- Cox Proportional Hazards Model (CPHM)
We have been discussed a most commonly used model in survival data analysis, the Cox (1972) proportional hazards model, and it related statistical inference. This model does not require knowledge of the underlying distribution.
We can say, the Cox proportional hazards model (CPHM) is a “robust” model, so that the results from using the Cox model will closely approximate the results for the correct parametric model. For example, if the correct parametric model is lognormal, then the use of the Cox model typically will give results comparable to those obtained using a lognormal model. Alternatively , if the correct model is exponential, then the Cox model results will closely approximate the results from fitting an exponential model .
The Cox proportional hazards model (CPHM), a popular mathematical model used for analyzing survival data.
8-Case study8-1-Introduction
Cancer disease is considered as one of the main medical problems in the developed and developing countries due to its spreading rate , high costs of medical treatment and high mortality rates . In addition , it needs medical and educational programs like protective programs , early detections programs as well as social , medical and psychological rehabilitation programs for patients .
In this thesis we have been studied the breast cancer incidence in the Gaza Strip and analyses the data using different models of survival analysis. We have been started with Kaplan-Meier estimation of survivorship function (KME) then we have been used the Log–Rank test for Comparison of two survival distributions then applied the Cox Proportional Hazards Model (CPHM) . The data has been analyzed using the R program is obtaining all the results below .
8-2-Cancer morbidity and reported cases .
In 2005, breast cancer occupied the first type of cancer among the Palestinian population (17.3%) with an incidence rate of 7.5 per 100,000 population. Lung cancer occupied the first type of male cancer; which constitute 13.8% of total males, cancer with an incidence rate of 5.2 per 100,000 males. However, Breast Cancer occupied the first type of female cancer (31.4%) with an incidence rate of 15.1 per 100,000 population.
The data for all breast cancer cases in the Gaza Strip were collected from El-shifa hospital. Missing data was obtained from the patients records to complete the data set required for survival analysis.
8-5-Variables of the study 1- Number of patients.
2-Birth date of patients .3-Gender
4-Marital Status .5-Address 6-Smoking 7-Date of the first diagnosis (Incidence) .
8-Date of the end of follow up .9-Status ( death or censoring) .10-First place for the emergence of tumor ( all Histology of Primary) .
11-Laterality : which is breast that contains the histology primary tumor ,( 1=Right , 2= Left )
12-Treatment 1 , surgery ,( 1=given , 2= no given) .
13-Treatment 2 , Radiotherapy ,( 1=given , 2= no given) .14-Treatment 3 ,Chemotherapy ,( 1=given , 2= no given) .
15-Treatment 4 , Hormonal therapy,( 1=given , 2= no given) .
16-Topography code ( all C50) .
8-6-Survival Analysis of the data
Nonparametric or distribution-free methods are quite easy to understand and apply. However they are less efficient than parametric methods when survival times follow a theoretical distribution and more efficient when no suitable theoretical distributions are known for the data.
In addition, the variable time of survival of patients do not follow the normal distribution or any distribution from the exponential family .
8-6-1-Kaplan-Meier Estimation of survivorship function (KME)
A set of 103 breast cancer patients was given by AL –Shefa hospital (cancer registry ) from 2000 to 2005 .Those breast cancer patients join a clinical study at the beginning of year 2000 .By the end of the study among them only 38 patients die and 56 patients censoring . Their Survival time is computed from time of diagnosis in days . Table (1) below lists the survival times t in days for those cases who die by the end of the study .
Kaplan-Meier Estimation of survivorship function (KME) is estimated following the formula ( 3.11). The computations had been carried out using the R statistical program and the results are displayed in table (1) .
( )tS
We note that the Kaplan-Meier estimation of survivorship function (KME) has an inverse relationship with the variable time (t) .
( )tS
Similar to other estimators, the standard error (S.E.) of the Kaplan Meier estimator of
gives an indication of the potential error of by formula (3.16) , The confidence interval deserves more attention than just the point estimate . A 95% confidence interval for is estimated by .This also has been calculated using the R program , and the results are illustrated in table (1) .
( )tS
( )tS
( )tS
( )tS
( ) 1.96 . .[ ]( )t S E S tS
Table (1 ),Kaplan-Meier Estimation of survivorship function (KME)
Estimate
NODAYSSTATUSCUMULATIVE PROPORTION SURVIVING AT THE TIME
LOWER 95% CI
UPPER 95% CI
HAZARD
Std. Error
152event.990.0100.9721.0000.0098
272event.981.0140.9541.0000.0196
3221event.971.0170.9391.0000.0296
4281event.961.0190.9250.9990.0396
5355event.951.0210.9110.9940.04976
6447event.942.0230.8980.9880.060
7528event.932.0250.8850.9820.0703
8596event.922.0260.8720.9750.0809
9609event.913.0280.8600.9690.0914
10626event.903.0290.8480.9620.10213
11680event.893.0300.8360.9550.1129
12748event.883.0320.8240.9480.12387
13754event.874.0330.8120.9400.13492
14767event.864.0340.8000.9330.1461
15802event.854.0350.7890.9250.1574
16806event.845.0360.7780.9180.1688
17810event.835.0370.7660.9100.1804
Estimate ( )tS
NODAYSSTATUSCUMULATIVE PROPORTION SURVIVING AT THE TIME
LOWER 95% CI
UPPER 95% CI
HAZARD
Std. Error
18883event.825.0370.7550.9020.1921
19892event.816.0380.7440.8940.2039
20893event.806.0390.7330.8860.21589
21929event.796.0400.7220.8780.2280
22968event.786.0400.7110.8700.2403
231.002E3event.777.0410.7000.8610.2527
241.247E3event.767.0420.6900.8530.2653
251.413E3event.757.0420.6790.8450.2780
261.492E3event.748.0430.6680.8360.2909
271.503E3event.738.0430.6580.8280.30399
281.608E3event.728.0440.6470.8190.3172
291.660E3event.718.0440.6370.8110.3307
301.733E3event.709.0450.6260.8020.3443
311.757E3event.699.0450.6160.7930.3581
321.792E3event.689.0460.6060.7850.372
331.839E3event.679.0460.5940.7750.3878
341.893E3event.667.0470.5810.7650.4055
351.946E3event.653.0480.5660.7540.4261
362.048E3event.622.0550.5230.7390.4749
372.106E3event.574.0680.4550.7250.55495
382.140E3event.492.0960.3360.7210.7091
Estimate ( )tS
For breast cancer data in the Gaza Strip , the mean survival time is estimated by 1751 days using formula (3.19). and the standard error of mean survival time given by using formula (3.20) is estimated by 62.512, which is indicated in table (2).
V
Table(2), Means for Survival Time
MEANS FOR SURVIVAL TIME
Mean95% Confidence Interval
EstimateStd .ErrorLower BoundUpper Bound
1.751E362.5121628.5141873.560
The estimated median survival time is the 50th percentile, which is the value of t at , The median survival time for breast cancer cases in the Gaza Strip is approximately 2140 days at , which is indicated in table (3) below .
( ) 0.50tS
2140m
( ) 0.492tS
Table(3 ) , Median for Survival Time
RECORDSN. MAXN. STARTEVENTSMEDIAN
103103103382140
Theoretically ,the estimator of survival function which is plotted in graph (1) is expected to appears as a step function since it remains constant between two observed exact survival times. However, The most commonly used summary statistic in survival analysis is the median survival time. The median survival time ( =2140 days ) is estimated from the survival curve . The estimated mean survival time( =1751 days ) can be seen to equal the area under the estimated survivorship function as described by formula (3.18).
( )tS
m
Graph (1 ) , Kaplan-Meier estimate of the survivorship function for the data in Table (1) and its 95% confidence intervals .
Graph (2 ) and table (1) of the estimated hazard function show that the death rate( hazard function ) is low in the first 52 days after diagnosis. From the end of the first 52days to the beginning of the day one thousand (1000) , the death rate (hazard function ) remains increasing continuously between 0.02 and 0.24. However , after the day 1000 to the end of day (1890) the death rate (hazard function ) remains increasing continuously , between 0.25 and 0.40.But after the day (1890) to end of the day (2140) , the death rate (hazard function) increased rapidly from 0.43 to 0.71 .
Generally speaking ,the hazard rate is generally high after the day 52 , and rapidly increasing until the day 2140 from 0.0196 to 0.7091
8-6-2-The Log–Rank test for Comparison of two Survival Distributions
The problem here is to compare survival times of two groups of patients of breast cancer exposed to four different treatments ( Surgery, Radiotherapy ,Chemotherapy, Hormonal Therapy) by comparing the survivorship function and hazard function of the two groups .
The following survival data for 103 females with breast cancer , contains two groups , the first group contains the patients of ages less than 50 years old and the second group contains patients with ages greater or equal to 50 years old .
Survival times are estimated for both groups from time of diagnosis in days . Table (4) lists the survival times t in days .
Kaplan-Meier Estimation of survivorship function (KME) is computed , in table(4) .
( )tS
( )tS
( )tS
Similar to other estimators, the standard error (S.E.) of the Kaplan Meier estimator of and A 95% confidence interval for is also estimated in table (4) .
( ) 1.96 . .[ ]( )t S E S tS
Table (4) ,Kaplan-Meier Estimation of survivorship function (KME) for two groupsof breast cancer cases in the Gaza Strip.
NODAYSSTATUSCUMULATIVE PROPORTION
SURVIVING AT THE TIME
LOWER 95% CI
UPPER 95% CI
DIFFERENCEHAZARD
Std. Error
1447event0.9710.0290.914161.02784less than 50 0.0299
2528event0.9410.040.86261.0194less than 50 0.061
3609event0.9120.0490.815961.00804less than 50 0.0921
4748event0.8820.0550.77420.9898less than 50 0.1252
5754event0.8530.0610.733440.97256less than 50 0.1592
6767event0.8240.0650.69660.9514less than 50 0.1942
7806event0.7940.0690.658760.92924less than 50 0.231
8883event0.7650.0730.621920.90808less than 50 0.2683
9892event0.7350.0760.586040.88396less than 50 0.310
10893event0.7060.0780.553120.85888less than 50 0.348
11968event0.6760.080.51920.8328less than 50 0.391
121.00E+03event0.6470.0820.486280.80772less than 50 0.435
131.25E+03event0.6180.0830.455320.78068less than 50 0.482
141.61E+03event0.5880.0840.423360.75264less than 50 0.531
151.66E+03event0.5590.0850.39240.7256less than 50 0.582
161.76E+03event0.5290.0860.360440.69756less than 50 0.636
171.79E+03event0.50.0860.331440.66856less than 50 0.6932
Estimate ( )S t
NODAYSSTATUSCUMULATIVE PROPORTION
SURVIVING AT THE TIME
LOWER 95% CI
UPPER 95% CI
DIFFERENCEHAZARD
Std. Error
182.11E+03event0.4170.1040.213160.62084less than 50 0.8755
192.14E+03event0.3120.1190.078760.54524less than 50 1.1632
2052event0.9860.0140.958561.01344greater or equal 500.0146
2172event0.9710.020.93181.0102greater or equal 500.02946
22221event0.9570.0250.9081.006greater or equal 500.04445
23281event0.9420.0280.887120.99688greater or equal 500.05972
24355event0.9280.0310.867240.98876greater or equal 500.075
25596event0.9130.0340.846360.97964greater or equal 500.09097
26626event0.8990.0360.828440.96956greater or equal 500.10697
27680event0.8840.0390.807560.96044greater or equal 500.1232
28802event0.870.0410.789640.95036greater or equal 500.13976
29810event0.8550.0420.772680.93732greater or equal 500.15657
30929event0.8410.0440.754760.92724greater or equal 500.1737
311.41E+03event0.8260.0460.735840.91616greater or equal 500.19106
321.49E+03event0.8120.0470.719880.90412greater or equal 500.2088
331.50E+03event0.7970.0480.702920.89108greater or equal 500.2268
341.73E+03event0.7830.050.6850.881greater or equal 500.2451
351.84E+03event0.7660.0510.666040.86596greater or equal 500.2662
361.89E+03event0.7480.0530.644120.85188greater or equal 500.2897
371.95E+03event0.7280.0550.62020.8358greater or equal 500.31788
382.05E+03event0.6670.0770.516080.81792greater or equal 500.4049
Estimate ( )S t
The estimated mean survival time for the first group is 1583 days , and the standard error of the mean survival time is 109.04 .
1
1V
However , the estimated mean survival time or the second group is 1832, and the standard error of the mean survival time is 74.23.
2
2V
Moreover , the estimated mean survival time for all patients is 1751 days and the standard error of the mean survival time for all patients is 62.512 .The above results are illustrated in table (5) below.
V
Table (5) , Means for Survival Time for two groups of breast cancer in the Gaza Strip
DIFERENTMEAN
EstimateStd. Error95% Confidence Interval
Lower BoundUpper Bound
less501.583E3109.0421369.3751796.818
more501.832E374.2331686.6441977.638
Overall1.751E362.5121628.5141873.560
For the remission data, the log–rank statistic computed using formula (3.24), is 6.004 and indicated in table (6) and the corresponding P-value is .014 which indicates that the null hypothesis should be rejected .The null hypothesis being tested is that there is no overall difference between the two survival curves . We can therefore conclude that the first group and the second group are significantly different (KME) survival curves.
Table (6) , Test of equality of survival distributions for the different levels of different by Log Rank test
OVERALL COMPARISONS
Chi-SquaredfSig.
Log Rank (Mantel-Cox)
6.0041.014
Now, plots of the (KME) curves for the first group together with the second group are shown here in the graph (3) below. Notice that the (KME) curve for the second group is consistently higher than the (KME) curve for the first group . These results indicate that the second group , has better survival and better response to treatment than first group .
Moreover, as the number of days increases, the two curves appear to get further apart, which indicate that the effect of treatment on the second group is greater than the effect of treatment on the first group to stay in remission.
Graph (3) , Kaplan-Meier estimate of the survivorship function for the data in table (5.24) for two groups
The graph (4) of the estimated hazard function shows that the death rate for both groups are low in the first 750 days after diagnosis. After 750 days to the end , the death rate remains increasing continuously for both groups , but the first group contains patients with ages less than 50 years old with death rates between ( 0.13 - 0.70) while the second group contains patients with ages greater or equal to 50 years old with death rates between ( 0.14 - 0.40).
The hazard rate is generally high and increase continuously for first and second group , but the hazard rate for the first group is higher than the second group .Notice that the second group contains patients who have ages greater or equal to 50 years old .
8-6-3-Cox Proportional Hazards Model (CPHM)
8-6-3-1-The Formula for the Cox Proportional Hazards Model
We are thus considering a problem involving four explanatory variables as predictors of survival time T, where T denotes days until going out of remission “death” and we label the explanatory variables (two groups of breast cancer patients) , with 34 patients in the first group which contains the patients with ages less than 50 years old and 69 patients in the second group which contains patients who have ages greater or equal to 50 years old .
1X
The data set also contains three variables :.
=Lateral: the breast that contains the histology primary tumor ,(1=Right , 2= Lift)
=Surgery : breast sparing surgery, ( 1=given , 2= not given) . = Hormonal therapy ,( 1=given , 2= not given) .
2X
3X4X
The outcome variable for the model is the time in days until a patient goes out of remission (died).
We have been described the final model and their results concerning breast cancer cases in the Gaza Strip.
We now describe final model and their results concerning breast cancer cases in the Gaza Strip
The method of estimation used to obtain the coefficients for the final model is maximum likelihood (ML) estimation.
Notice that the p-value of 0.0187 which is obtained for the coefficient of Groups ages indicates that there is significant effect for that variable .
Moreover , the p-value of 0.0325 is obtained for the coefficient of Surgery, which indicates that there is significant effect for that variable .
This Z statistic is known as a Wald statistic. All the above results can be found in the table ( 7 ) above .
Table(7) Variables in the equation
COEFHRSE(COEF)ZLOWER.95UPPER .95
Groups Ages0.7695-0.46320.3273-2.3510.01810.24390.8798
Surgery-1.0370.35450.48492.139-0.03250.13710.917
likelihood = - 157.9798
( )r
ZP
Finally ,we consider final model for the remission data. The fitted model written in terms of the hazard function using formula (4.1) is given by.
0.77Groups Age-1.04Surgery0
ˆ ˆ( , , ) ( )h t B h t e
8-6-3-2-Adjusted Survival Curves Using the 8-6-3-2-Adjusted Survival Curves Using the (CPHM) (CPHM)
Typically, when computing adjusted survival curves, the value chosen for a covariate being adjusted is an average value like an arithmetic mean or median. In fact, most computer programs for the Cox model automatically use the mean value over all subjects for each covariate being adjusted .A general formula for the adjusted survival curve for all covariates in the model following formula (4.19) is given by:
exp( 0.77 1.04 )
0ˆ ˆ( , , ) ( )
Group Ages Surgery
S t B S t
To obtain the adjusted survival curve, we then substitute the mean values in the formula in the model fitted. The formula and the resulting expression for the adjusted survival curve are shown below and the results of application of the adjusted survival carve are given in the third column of the table (8)
0.0377
0ˆ ˆ( , , ) ( )S t B S t
Table (8) , Adjusted Survival function Using the Cox PH Model
NOTIMESURVIVALS.ELOWER 95% CI
UPPER 95% CI
Baseline Cum Hazard
Baseline survivorship function
Cumulative hazard function
1520.9910.0090.97510.2280.79630.009
2720.9830.0120.9610.4570.63320.017
32210.9740.0150.94710.6880.502780.026
42810.9660.0170.93410.9210.398130.035
53550.9570.0190.9220.9951.1590.313920.044
64470.9490.0210.910.9911.40.246580.053
75280.940.0220.8980.9861.6480.192390.062
85960.9310.0240.8860.981.9020.149210.072
96090.9220.0260.8740.9742.1590.11540.081
106260.9130.0270.8630.9682.4190.0890.091
116800.9040.0280.8510.9622.6810.068510.101
127480.8950.030.840.9562.9450.052590.111
137540.8860.0310.8290.9493.2140.04020.121
147670.8770.0320.8180.9433.4870.030590.131
158020.8680.0330.8070.9363.7630.023210.142
168060.8590.0340.7960.9294.0430.017550.152
178100.850.0350.7850.9224.3250.013230.163
188830.840.0360.7740.9154.6120.009930.174
0( )tH ( )0 tS
( , , )H t x B
NOTIMESURVIVALS.ELOWER 95% CI
UPPER 95% CI
Baseline Cum Hazard
Baseline survivorship function
Cumulative hazard function
198920.8310.0370.7630.9084.9070.007390.185
208930.8220.0380.7520.9015.2130.005440.196
219290.8120.0390.7420.8935.5230.003990.208
229680.8030.0390.7310.8865.8370.002920.22
2310020.7930.040.720.8786.1570.002120.232
2412470.7830.0410.7090.876.4830.001530.244
2514130.7740.0420.6980.8626.8130.00110.257
2614920.7640.0420.6870.8547.1460.000790.269
2715030.7540.0430.6770.8467.4860.000560.282
2816080.7440.0440.6660.8387.8330.00040.295
2916600.7350.0440.6550.838.1870.000280.309
3017330.7250.0450.6440.8228.5470.000190.322
3117570.7150.0460.6340.8138.9120.000130.336
3217920.7050.0460.6230.8059.2859.3E-050.35
3318390.6940.0470.6110.7959.6996.1E-050.365
3418930.6810.0480.5970.78510.23.7E-050.384
3519460.6670.0490.5810.77410.762.1E-050.406
3620480.6390.0550.5460.7611.96.8E-060.448
3721060.5930.0660.4870.74813.869.5E-070.522
3821400.5190.0870.3930.74117.392.8E-080.655
0( )tH ( )0 tS
( , , )H t x B
9-The recommendations
1- We recommend reactivation and rehabilitation of the Cancer Registry Center in Palestine to know the exact oncology cases and its different diagnostic sources in order to define the problem and its spreading reasons .
2- Develop the use of (Ten International Classification ICD10) which is related to diseases , deaths and disabled people . Using such kind of classification required training for doctors, health professionals and data entry persons.
3- We recommend establishing an advanced system for record registry of causes of death on death certificates.
4- Development of cooperation and full coordination between system and information Department and Cancer Registry Center for monitoring and recording of cases of tumors in the Ministry of Health through a sophisticated electronic system.
5- It is important to develop the Cancer Patients data in cancer Registry Center . Furthermore , it is required to improve the cooperation between Ministry of Health and Palestinian Central Bureau of Statistics to minimize the gaps of indicators which depend on the Ministry of health data and estimated indicators of the PCBS which come as a result of health surveys .
6- We recommend applying the Kaplan-Meier Estimation of survivorship function (KME ) and estimated mean survival time for all cancer patient with confidence interval for .( )tS
( )tS
7- We recommend the determination of the relationship between (KME ) and time for all cancer patients and determination of the relationship between hazard ratio and time for all cancer patients, using the statistics program R.
8- A clinical oncologist may be interested in comparing the ability of two or more treatments to prolong life or maintain health for two group from patients ages . Almost invariably, survival times of the different groups vary. Therefore, we recommend using the Log–Rank Test For Comparison of two survival distributions for all cancer patients.
9- We recommend using the Cox proportional hazards model (CPHM), for analyzing survival data , that contain the most important variables as predictors of survival time T, where T denotes days until going out of remission “death or survive”, for all cancer patients in Palestine.