alternative parameterization of polychotomous models: theory and application to matched case-ontrol...

STATISTICS IN MEDICINE, VOL. 10, 375-382 (1991)

ALTERNATIVE PARAMETERIZATION OF POLYCHOTOMOUS MODELS: THEORY AND

APPLICATION TO MATCHED CASE-CONTROL STUDIES

HEIKO BECHER Institute of Epidemiology and Biometry. German Cancer Research Center, Im Neuenheimer Feld 280,

0-6900 Heidelberg, Federal Republic of Germany

SUMMARY

A method is proposed for transforming a class of models having an outcome variable with more than two levels into an equivalent binary model. The polychotomous logistic model is used to demonstrate the method. The equivalency to a simple logistic regression model after some data transformation (augmentation) is shown. The method is applied to the data from two case-control studies each with two control groups, and further applications are indicated.

1. INTRODUCTION

In epidemiology, outcome variables frequently have more than two values. An obvious example is a case-control study with two or more separate control groups. Aspects of this situation have been discussed recently by Rosenbaum.’ Polychotomous logistic regression is an appropriate method of analysis, and applications of this model are given by Liang and Stewart’ and Becher and JOckeL3 The other extreme, namely comparing several disease groups with one control group, is described by Durbin and Pasternack4 and Thomas et aL5

The maximum likelihood approach to polychotomous logistic regression in case-control studies is well known.6 However, maximization of the conditional likelihood function, which applies to matched studies, is not immediately possible with standard software packages like SAS7** and EGRET.’ Although direct programming is possible, it is preferable to make use of existing software and develop some additional computational devices. In the next section we briefly review the model and show that the parameters of the polychotomous logistic model may be estimated by using the conditional likelihood function of the dichotomous model after some data augmentation. This approach facilitates analysis since a wide range of software may then be used. In a further section we illustrate the procedure with case-control studies of stomach cancerlo and breast cancer.’ An application of the method to other study designs, in particular survival analysis with different endpoints, is discussed.

2. THE MODEL

The polychotomous logistic regression model is defined by

0277-67 15/91/03037548$05.00 0 1991 by John Wiley & Sons, Ltd.

Received December 1989 Revised June 1990

376 H. BECHER

in which the dependent variable Y indicates disease status, which can take J nominally scaled values; x = (xl , . . . ,xK) are covariates; and Bjo and Bj = (Bj l , . . . , B j K ) are regression parameters. To avoid redundancy we set B T ~ = . . . = / 3 ~ ~ = 0 for one of the values of j denoted by;

The model is illustrated most easily by setting J = 3 (one case group and two control groups A and B, or vice versa). We call this the 1 : 1 : 1 matched design; we let j = 1 indicate the case group, and j = 2 or 3 indicate the control groups. Here we set B1 = 0. Then the odds ratio $ for the disease in the presence of the dichotomous risk factor xk, 1 6 k 6 K , compared with control group j is given by:

This is easily seen by inserting the appropriate probabilities from ( 1 ) into the equation. Similarly, for a continuous risk factor X , we get + ( X l = x* vs. X , = xo) = exp( - Bjl(x* - xJ). Prentice and Pyke6 showed that applying the model to retrospective data does not affect the parameters of interest, B2 and P3. We get the conditional likelihood for the ith triplet (one case and two controls) as

where the sum in the denominator is over the six ways of distributing the observed covariates to the case and the two controls. The total conditional likelihood is then

n

L = n Li, i = 1

(4)

where the product is over the n triplets. Maximizing this function with respect to Bz and P3 gives maximum likelihood estimates of the two regression coefficients.

3. COMPUTATIONAL METHODS

There are various statistical software packages which allow the analysis of matched case-control studies, for example SAS (PROC MCSTRAT)’ or EGRET.9 However, these programs do not have a direct facility for fitting a polychotomous logistic regression model. In this section we show that the desired analysis is possible by some data augmentation.

The likelihood (3) may be written as

where 6 = (j2, P3) and

zil = (xi22 xi3)7 zi4 = (xi1 9 Xi2)i

zi2 = (xi23 xi1)i zi5 = (xi39 xi2),

zi3 = (xi19 xi313 zi6 = (Xi33 xil).

Note that this is the same structure as the conditional likelihood function of the standard logistic regression model. Suppose the data are such that each row of the data matrix represents one observation, containing information on status (case, control A, control B), the number of the matched set (triplet) and the covariates of interest. To obtain a suitable structure for evaluating the likelihood (3), the data set is augmented as shown in Table I. A simple SAS program for generating the required data set is listed in the appendix.

POLYCHOTOMOUS MODELS 377

Table I. Data set modification for the corresponding logistic regression model Original data structure for first set (triplet)

Case/con t rol SET status Observed covariates

1 1 x l l l , . . ’ tXK11 Case 1 2 X 1 1 2 r ‘ . ‘ ? x K I Z Control A 1 3 x 1 1 3 , . . ’ 9 ‘KI3 Control B

~

Augmented data structure for first set (triplet)

Case/control SET status Covariates

The data set for analysis corresponds directly to the likelihood (5). The number of regression variables as well as the number of observations have both doubled. The ‘case’ ( Y = 1) now has the covariates of the two controls, and there are five ‘controls’ ( Y = 0) for each case with covariates, as in Table I. The 1 : 1 : 1 matched design may therefore be thought of as a 1 : 5 matched design. The corresponding logistic regression model is

where u corresponds to the first half of the covariate vector in the modified data structure and w corresponds to the second half.

The procedure may easily be generalized to more than three different groups. Note that the augmented data set can also be used for further analysis if both control groups have similar risk factor distributions. In this case a 1 :2 matched design analysis is appropriate, when either zi3 or zi4 is defined as the ‘case’ and the parameter vector 6 has zeros in the second half, 6 = (/?21, . . . , P z r , 0, . . . , 0). To see this, consider the contribution of the ith triplet of the likelihood of the 1 :2 matched design (original data set):

exp (P’xil)

2 exp (8‘ Xim)

L. = r 3

m = 1

From the augmented data set we get

With 6 as defined above, 6’zil = 6’zi2 = B;xi2, 6‘zi3 = S’z,, = B;xil and 6’zis = 6’zi, = B;~ i3 ,

378 H. BECHER

Table 11. Risk factor distribution in a cancer case-control study of stomach cancer in Poland

Dietary variable Hospital Population

Cases controls controls

Protein High 26 15 26 consumption Medium 60 70 61

Low 24 25 23

consumption Medium 42 70 65 Low 61 28 11

110 110 110

VSF* High 7 12 34

* Vegetables, salads and fruits.

and therefore

The estimates of the regression parameters B are clearly not affected by this transformation and neither are their standard errors. However, the deviance and the degrees of freedom are not identical in the two cases. Nested models can be compared as usual by the difference between the deviances for both models, which is not affected by the data transformation and therefore has a known asymptotic distribution. This is easily seen, as the likelihood ratio statistic for testing the null hypothesis that additional parameters are zero is minus twice the difference of the maximized log-likelihood functions; the factor 0.5 cancels out. In the next section we illustrate the method with data from a stomach cancer case-control study and with the data published by Liang and Stewart.'

4. EXAMPLES

As a first example we consider a Polish case-control study of stomach cancer" with two sets of controls (hospital controls and population controls). The cases and the hospital controls came from both urban and non-urban areas, whereas the population controls came from urban areas only. In the original publication, relative risk estimation was based solely on the hospital control group. Results from the population controls were only given descriptively. In this paper a model which is contained in Jedrychowski et al." is reanalysed with a polychotomous logistic regression model. The risk factor distribution is given in Table 11.

Two dietary variables were considered. Both are categorical with three levels and refer to the frequencies of consumption of proteins and of vegetables, salads and fruits (VSF) (Table VI in Jedrychowski et al. lo). In the original paper the authors also controlled for smoking (three categories) and residence (urban vs. rural). In this analysis the control for smoking is maintained, whereas residence is ignored because all population controls were urban. This has little impact on the estimated relative risks for the nutritional factors.

Results are given in Table 111. In Table III(a) two 1 : 1 analyses are given (one based on the hospital controls only, the other based on the population controls only). Table III(b) gives the

Table 111. Case-control study of stomach cancer: analyses for protein and vegetable, salad and fruit consumption controlled for smoking

(a) Cases vs. one control group (1 : 1 matched analysis)

Variable

Cases us. hospital controls Protein High

Low VSF High

Low

consumption Medium

consumption Medium

Cases us. population controls Protein High

Low VSF High

Low

consumption Medium

consumption Medium

- 1.04 (0.47) - 1.68 (0.66)

0.27 (0.58) 2.04 (0.67)

- 0.08 (0.47) - 1.17 (0.61)

2.26 (0.80) 5.20 (1.05)

1 0.35 0.19 1 1.31 7.76

1 0.92 0.3 1 1 968

182.40

(0.14, 0.88) (0.05, 0.68)

(0.42,4.10) (2.06, 29.2)

(0.37, 2.31) (0.09, 1.03)

(2.02,46.31) (2250, 1416)

(b) Polychotomous logistic regression model (1 : 1 : 1 matched analysis)

Odds ratio- Variable - Bji (SE) $ = exp( - Bji ) (95% CI)

Cases us. hospital controls Protein High

Low VSF High

Low

consumption Medium

consumption Medium

Cases us. population controls Protein High

Low VSF High

Low

consumption Medium

consumption Medium

- 0.90 (0.42) - 1.31 (0.54)

0.28 (0.57) 2.14 (0.68)

- 0.1 1 (0.42) - 0.83 (0.57)

1.65 (0.55) 4.58 (072)

1 0.40 0.27 1 1.32 8.33

1 0.89 0.44 1 5.21 97-5

(0.18,0.92) (009,0.78)

(0.42,400) (2.27, 33.3)

(0.39,2.04) (0.14, 1.31)

(1.79, 14.3) (23.6,403)

(c) Cases vs. both control groups (1:2 matched analysis)

Variable B, i Protein High

consumption Medium - 0.50 Low - 1.05

consumption Medium 0.96 Low 3.20

VSF High

1 (0.36) 0.6 1 (0.30, 1.22) (0.48) 0.35 (0.14,0.90)

(0.48) 2.62 (1.02, 6.73) (0.59) 24.7 (7.86,78.0)

1

380 H. BECHER

Table IV. Case-control study of breast cancer (from Liang and Stewart'): distribution of complete case-control triplets by age at birth of first child ( <25 vs. 25+

years)

Hospital-neighbourhood controls Exposure status*

Case 00 01 10 11 Total

0 47 17 18 8 90 1 32 31 21 21 105

* 0 denotes < 25 years and 1 denotes 25 + years.

results for the polychotomous logistic regression model, analysed with the same software but with the augmented data set. The estimated odds ratios for cases vs. hospital controls are similar to those obtained before, whereas the odds ratios of cases vs. population controls differ somewhat, with a lower protective effect of protein and higher risk with VSF score. For illustration we also performed a 1 : 2 matched design analysis, which is easily possible with the augmented data set as outlined in the previous section. The results are given in Table III(c). An equivalent result follows if the original data set is used.

As a second example we consider a case-control study of breast cancer published in Liang and Stewart' in which age at birth of first child was considered as a dichotomous risk factor. These authors provide a direct method to estimate the regression coefficients p2 and B3 for a single dichotomous risk factor under a 1 : 1 : 1 matched design. The data are given in Table IV.

This data set was augmented and analysed as described above. The estimated regression coefficients are b2 = - 0.9026 and b3 = - 06793, in complete agreement with Liang and Stewart's result. These authors were also interested in testing the hypothesis that the two coefficients are equal. To do so they estimate the variance of the difference as var(b, - b3) = 0.0495. Using the estimated covariance matrix from the augmented analysis, a value of 0-0499 was found.

5 . DISCUSSION

Conditional logistic regression analysis has proved a useful statistical tool for analysing matched case-control studies. The polychotomous logistic regression model is an extension which is used increasingly. Some applications have been mentioned and another recent study of chryptorchism and testicular cancer may also be mentioned." A special application of the computational technique to adjust for bias in a case-control study with two control groups is given by Becher" and Becher and Jockel? Fitting of the conditional polychotomous logistic model may be facilitated by our method. If software is available which allows the conditional analysis of an n:m matched study, it may be used for the present problem by some minor augmentation of the data set.

Other applications of the method are possible. For example, for analysing experimental or epidemiological data for the time to tumour development or recurrence (survival data) a proportional hazards model13 is frequently used. The partial likelihood function for estimating the parameters of the proportional hazards model is closely related to the conditional likelihood function of matched case-control studies. In both epidemiological studies and animal ex- periments, different endpoints may occur. As an example, consider the ED,, Study14 in which

POLYCHOTOMOUS MODELS 38 1

different types of tumour (liver and bladder) were observed. The risk of tumour dependence on dose may be analysed using two separate Cox regression models. However, a simultaneous analysis of both tumour types is possible in principle using the technique described above. A practical drawback is limitation of computer space and time, as the augmented data set may easily become very large. If, for example, a block contains n animals with one tumour of each kind and n - 2 censored observations, the number of observations in the block increases from n to 2 x ("2. The method may be applied if there is blocking of the data to adjust for one or more variables. Such blocking can be useful to reduce the number of estimated parameters or because some variables violate the proportional hazards assumption. The augmentation of the data set is then performed for each block separately. The method can also be applied if the outcome variable has more than three levels, for example if there are more than three distinct groups in a matched case-control study. Data augmentation is similar to that given in Table I, and the SAS program can be used after appropriate modification. For example, in a 1 : l : l : l matched design the number of observations in the augmented data set would be blown up by a factor of 6 and the number of regression variables by a factor of 3.

APPENDIX

An SAS program for generating the augmented data set for a 1 : 1 : 1 matched design is listed. The required data set is obtained by merging data sets appropriately. To identify those observations which are to be merged, the function modulo (mod) is used. This function gives the remainder after dividing the first argument by the second argument. The first argument is the system variable -n-, which provides a consecutive numbering of the observations in an SAS data set. The second argument is the number of observations, which form a matched set in the augmented data set.

data a; * input data set (see Table I) * variables: sets -indicator for matched set (triplet) * caco -indicator for disease status * x1 . . . x k -risk factors;

data a; set a a; proc sort; by sets caco; data a; set a;

per = mod((-n- + 4), 6); if per = 0 then per = 6;

data b; set a;

Y 1 = x,; Yz = xz;

yk = xk; if mod(-n-, 6 ) = 1 then per = 2; if mod(-n-, 6 ) = 2 then per = 4; if mod(-n-, 6 ) = 3 then per = 3; if mod(-n-, 6 ) = 4 then per = 5;

if mod(-n-, 6 ) = 5 then per = 1; if mod(-n-, 6) = 0 then per = 6; keep y1 . . . y k sets per;

382 H. BECHER

proc sort data = b; by sets per; proc sort data = a; by sets per; data b; merge a b; by sets per;

case = (per = 1); proc mcstrat data = b setid = sets;

model case x1 . . . xk y1 . . . Yk;

* Augmented data set;

* Regression procedure;

ACKNOWLEDGEMENT

This work was initiated by a similar algorithm developed in the author’s PhD thesis. I thank the supervisor, Dr. Karl-Heinz Jockel, and the referees for helpful comments.

REFERENCES 1 . Rosenbaum, P. R. ‘The role of a second control group in an observational study (with discussion)’,

Statistical Science, 2, 292-316 (1987). 2. Liang, K. Y. and Stewart, W. F. ‘Polychotomous logistic regression methods for matched case-control

studies with multiple case or control groups’, American Journal of Epidemiology, 125, 720-730 (1987). 3. Becher, H. and Jockel, K.-H. ‘Bias adjustment with polychotomous logistic regression in matched case-

control studies with two control groups’, Biometrical Journal, in press (1990). 4. Durbin, N. and Pasternack, B. S. ‘Risk assessment for case-control subgroups by polychotomous logistic

regression’, American Journal of Epidemiology, 123, 1101-1 117 (1986). 5 . Thomas, D. C., Goldberg, M., Dewar, R. and Siemiatycki, J. ‘Statistical methods for relating several

exposure factors to several diseases in case-heterogeneity studies’, Statistics in Medicine, 5 , 4 9 4 0 (1986). 6. Prentice, R. L. and Pyke, R. ‘Logistic disease incidence models and case-control studies’, Biometrika, 66,

403411 (1979). 7. SAS Institute Inc. SAS User’s Guide: Basics, Version 5, SAS Institute Inc., Cary, NC, 1985. 8. SAS Institute Inc. SAS Supplemental Library User’s Guide, Version 5, SAS Institute Inc., Cary, NC, 1986. 9. Statistics and Epidemiology Research Corporation. EGRET User’s Manual, SERC, Seattle, WA, 1989.

10. Jedrychowski, W., Wahrendorf, J., Popiela, T. and Rachtan, J. ‘A case-control study of dietary factors

1 1 . Strader, C. H., Weiss, N. S., Daling, J. R., Karagas, M. R. and McKnight, B. ‘Chyptorchism, orchiopexy,

12. Becher, H. ‘Polychotomous logistic regression and matching criteria in multicenter case-control studies

13. Cox, D. R. ‘Regression models and life tables (with discussion)’, Journal ofthe Royal Statistical Society,

14. Cairns, T. ‘The ED, Study: introduction, objectives and experimental design’, Journal of Environmental

and stomach cancer risk in Poland’, International Journal of Cancer, 37, 837-842 (1986).

and the risk of testicular cancer’, American Journal of Epidemiology, 127, 1013-1018 (1988).

(in German)’, University of Dortmund PhD thesis, 1987.

Series B, 34, 187-220 (1972).

Pathology and Toxicology, 3, 1-7 (1979).

alternative parameterization of polychotomous models: theory and application to matched case-ontrol...

Documents