some of david byar's work: in appreciation

14
ELSEVIER Some of David Byar's Work: In Appredation Lincoln E. Moses, PhD Biostatistics Division, Department of Health Research and Policy, Stanford University, Palo Alto, California ABSTRACT: David Byar wrote and thought with great clarity about randomized clinical trials (RCTs), pointing to their strengths-and to the limitations of alternative investigative approaches. An issue that he treated quite briefly in the seven publications that I review here is the possible gains to be had from post hoe covariate adjustment of RCT results. Calculations that I offer suggest that for realistic sample sizes and covariates of modest strength, those gains cannot be large, on average. The price to be paid for them is to replace the clear, cogent R.CTparadigm with a more complex and diffuse one; this seems just cause for unease. INTRODUCTION It is a privilege to review several of David Byar's papers at this meeting, which honors him and his work. I met him for the first time only about 2 years before his death and was immediately drawn to admire his vigor, intelligence, understand- ing, and vitality. I had already, much earlier, encountered him in publications that I admired. David Byar provided a clear view of the strengths, limitations, and potentials of the randomized clinical trial (RCT) and alternative tools of clinical investigation such as historical control trials and data base studies. I review here seven of his publications which treat methodological aspects of RCTs and observational alternatives. My aims are to give a synopsis of these writings, to acquaint the reader with a few vivid examples to which Byar turned repeatedly, and finally to point to some issues implicit in these papers and that in my estimation still await clear resolution, most notably the proper role of adjustment for covariates in analyzing the results from an F, CT. THE THYROID CANCER STORY Between 1966 and 1977 the Thyroid Cancer Co-operative Group of the Euro- pean Organization for Research on Treatment of Cancer entered some 1,183 pa- Address reprint requeststo Dr. L.E. Moses, Biostatistics Division, Departmentof HealthResearch and Policy, Stanford University, Palo Alto, CA 94301. Received November28, 1994; acceptedDecemberI4, 1994. Controlled Clinical Trials16:216-229 (1995) © Elsevier Science Inc.1995 0197-2456/95/$9.50 655Avenue of the Americas, NewYork,NY 10010 SSDI 0197-2456(94)001154

Upload: lincoln-e-moses

Post on 15-Sep-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Some of David Byar's Work: In appreciation

ELSEVIER

Some of David Byar's Work: In Appredation

Lincoln E. Moses, PhD Biostatistics Division, Department of Health Research and Policy, Stanford University, Palo Alto, California

ABSTRACT: David Byar wrote and thought with great clarity about randomized clinical trials (RCTs), pointing to their strengths-and to the limitations of alternative investigative approaches. An issue that he treated quite briefly in the seven publications that I review here is the possible gains to be had from post hoe covariate adjustment of RCT results. Calculations that I offer suggest that for realistic sample sizes and covariates of modest strength, those gains cannot be large, on average. The price to be paid for them is to replace the clear, cogent R.CT paradigm with a more complex and diffuse one; this seems just cause for unease.

INTRODUCTION

It is a privilege to review several of David Byar's papers at this meeting, which honors him and his work. I met him for the first time only about 2 years before his death and was immediately drawn to admire his vigor, intelligence, understand- ing, and vitality. I had already, much earlier, encountered him in publications that I admired.

David Byar provided a clear view of the strengths, limitations, and potentials of the randomized clinical trial (RCT) and alternative tools of clinical investigation such as historical control trials and data base studies.

I review here seven of his publications which treat methodological aspects of RCTs and observational alternatives. My aims are to give a synopsis of these writings, to acquaint the reader with a few vivid examples to which Byar turned repeatedly, and finally to point to some issues implicit in these papers and that in my estimation still await clear resolution, most notably the proper role of adjustment for covariates in analyzing the results from an F, CT.

THE THYROID CANCER STORY

Between 1966 and 1977 the Thyroid Cancer Co-operative Group of the Euro- pean Organization for Research on Treatment of Cancer entered some 1,183 pa-

Address reprint requests to Dr. L.E. Moses, Biostatistics Division, Department of Health Research and Policy, Stanford University, Palo Alto, CA 94301.

Received November 28, 1994; accepted December I4, 1994.

Controlled Clinical Trials 16:216-229 (1995) © Elsevier Science Inc. 1995 0197-2456/95/$9.50 655 Avenue of the Americas, New York, NY 10010 SSDI 0197-2456(94)001154

Page 2: Some of David Byar's Work: In appreciation

Byar's Work 217

100

90

80 ~N

o '°rI . . . . . . . . . . • --

' t \ " - . . . .

0 • 12 18 :t4 38 42 48 54

GROUP 1 (173)

GROUP 2 11021

GROUP 3 (96 )

GROUP 4 (68)

~0

MONTH|

GROUP 5 (68)

O0

Figure 1 Risk groups for patients with thyroid carcinoma based on predicted survival (dashed lines) from a Weibull model with five covariates. Solid lines represent survival experience (actuarial method) for these groups. Numbers of patients are shown in parentheses.

tients from 23 European hospitals into its registry [1]. Of this group there were 507 who had complete information regarding age, sex, histology of the tumor, size and location of the tumor, extent of metastasis, and duration of survival. A regression model based on the Weibull distribution was fitted to the survival data. Using the Weibull regression coefficients, a risk score was constructed for each patient, and patients were sorted into five risk groups on the basis of these scores. (The groups ranged in size from a minimum of 68 to a maximum of 173.) The fitted curves and the observed curves for these five groups appear in Figure 1. Later Byar [2] wrote about this picture:

I find this a most startling picture because it demonstrated the extreme variabil- ity of this disease. It would be difficult to imagine separating the curves any further since they are almost touching the axes of the plot already[ It should be apparent that any differences in survival attributable to different therapies are likely to be much smaller than those attributed to prognostic factors in this analysis. I have done similar analyses for patients with prostatic cancer and lung cancer, based on data collected from clinical trials of treatment of these two conditions. In both instances the separation of the curves by prognos- tic factors was markedly greater than that produced by the randomized treat- ments. This is one of the reasons that I am skeptical about the results of the historical control studies. Admittedly it is possible to adjust to some extent for differences in base-line variables, but unfortunately we can only adjust for those variables we know about, and new ones are continually being discovered.

Page 3: Some of David Byar's Work: In appreciation

218 L.E. Moses

90

80

• '° f ~ 0 ~

'°f 10 0 ! . 1 - _ • l . I ! I t _ !

0 6 12 18 24 30 36 42 48 54 MONTHS

Figure 2 Observed survival curves (actuarial method) for those patients with thyroid carcinoma who received X-ray therapy contrasted with those patients who did not receive X-ray therapy.

The data base also contained information about therapies the patients had received, and it was natural to look into differential survival by therapy. This brings us to Figure 2. Concerning this he and Green [3] wrote:

The unadjusted survival curves (Figure 3) actually show that patients who received X-ray therapy had considerably worse survival than patients who did not. There are three possibilities here. The first is that X-ray therapy is actively harmful. We do not believe that this is the explanation. The second alternative is that perhaps some patients were given X-ray therapy in place of some more effective therapy. One might conclude that this idea explains at least part of the observed effect, but we would argue that X-ray therapy was given for a wide variety of reasons, strongly related to inherent differences in prognoses among patients, as well as in widely differing clinical situations (e.g., differing times during the course of disease), and that these factors explain the observed survival. Thus the Figure is not informative about the effectiveness of radiation therapy.

Originally he had expected to be able to disentangle the effects of therapy from the effects of predisposing factors, but in 1991 he turned once again to these data [4] and described these efforts. The unadjusted single degree of freedom for X-ray vs. no X-ray was 73, suggesting

that X-ray therapy was killing people off at a very rapid rate. Now, you probably do not believe that, and certainly the radiotherapists did not believe itl It shook their faith in statistics. I tried for about three months to make

Page 4: Some of David Byar's Work: In appreciation

Byar's Work

OLO iX I_ I

I l l

219

Figure 3

I I I ! i l I I I I

I "J . . . . I I

, r

good prognosis poor prognosis

Possible effect of a new diagnostic method on staging which may bias treatment comparisons relying on historical controls. See text for explanation.

some sense of this observation. I was able to adjust that one degree of freedom chi-square of 73 down to 17-20, depending upon how many variables I used in the adjustment, but still I could not make it non-significant. So, I did not know what to conclude and I persuaded the clinicians that we should not make treatment comparisons and publish them.

I thought that was the end of the matter, and I even described this analysis in a Biometrics article [5]. However, this year 1 received a manuscript to review concerning another thyroid cancer registry. The authors had data on all cases treated in one hospital over the last 25 years. In an earlier article they used our prognostic index and showed that it worked well for their data. Now they had found that X-ray therapy was doing active harm to their patients. How unlucky for them that it came to me for reviewl I signed my review, sent them a copy of the Biometrics paper, and rejected the article. That anecdote summarizes my point of view, that the comparison of treatments is rarely justified using information in databases except as frankly exploratory data analysis.

CRITIQUE OF HISTORICAL CONTROL STUDIES

The allure of historical control studies lies in their (apparent) economy of time and resources; they seem to promise (i) to already possess the information about suitable controls for comparison with results of the new method and (ii) the pros- pect of more rapidly acquiring relevant information about the new treatment since all eligible patients can be given the new treatment. This seductive line of thinking stirred up much controversy, largely in the 1970s and 1980s. Byar was in the thick of it, bringing his clear insight and writing to the debate. Table I compactly sets out the difficulties he finds with the method. He discusses each at some length [2]. I must be brief here. The thyroid cancer study shows how the relevant explana- tory data may simply not be in the record. It also speaks of the incidence of missing data; of 1,183 patients in the registry only 507 had complete data on the half dozen variables used in the analysis. Analysis of an RCT can eschew all mathematical modeling: randomize and count[

Page 5: Some of David Byar's Work: In appreciation

220

Table 1 Difficulties with Historical Controls

L.E. Moses

1. Absence of needed information for adjustment 2. Missing data 3. Reliance on mathematical assumptions 4. Possible time trends in:

a. Patient population b. Diagnostic methods c. Details of treatment d. Supportive care

5. Effects of unmeasured or unknown prognostic factors 6. Failure to convince others of the results

Trends in control populations can be ruinous and difficult, or impossible, to identify. Byar and others [6] give a clear historical example:

An example of the bias problem that can occur, even in a sequence of studies, is provided by the Veterans Administration Cooperative Urological Research Group study on treatments of cancer of the prostate. This large randomized clinical trial admitted 2313 patients over a period of seven years. A major aspect of the study was to compare survival in placebo-treated and estrogen-treated patients. For patients admitted in the last 2.5 years of the trial no significant difference in the probability of death due to cancer was detected between patients treated with estrogen and those given the placebo. However, placebo- treatment patients admitted in the first 2.5 years of the study had significantly shorter survivals than the estrogen-treated patients admitted in the last 2.5 years (P < 0.01). Over the course of this study there had been no changes in admission criteria or the measure of response (survival). If these placebo-treated patients from the earlier part of the study had been the only source of compari- son (i.e., recent historical controls), an incorrect inference would have been drawn that estrogen treatment was superior to placebo. Closer review of the data showed that patients admitted in the early part of the study were older and had less favorable performance status on admission. Adjustment for these two variables by the Mantel-Haenszel technic removed the apparent difference in survival, illustrating that the difference was due not to treatment but to the non-comparability of the patients receiving the two treatments. In this instance it was fortunate that data were available on variables that could explain the spurious significance. It is quite possible, however, that more subtle selection mechanisms could have been at work, producing biases that could not be removed by adjustment technic, since the nature of the biases could not have been identified.

Changes in diagnostic methods can distort the fairness of comparison between early and late cases. Byar offered Figure 3 by way of illustration [2]; new patients placed in (new) stage 1 have better average prognoses than old patients placed in (old) stage 1. Then, simply because of the change in staging, more recent patients will have better outcomes on average than old patients with the same stage-label. Hence, such a diagnostic shift will masquerade as a treatment effect. Quite parallel considerations attach to improvements in details of treatment or supportive care.

Unmeasured or unknown prognostic factors can be important influences on outcome; at their worst, they can be unrecorded, highly prognostic, and the actual determinant of what treatment is assigned. In such instances treatment "outcome"

Page 6: Some of David Byar's Work: In appreciation

Byar's Work 221

can stand (silently) more as a confirmation of (tacit) prognosis than as an indication of treatment effectiveness. Green and Byar [3] summarize in this way:

There are many reasons why a treatment is chosen in clinical practice, based on the physician's evaluation of prognosis. Sometimes the reasons are obvious, but other times there may be subtle clues which cause the physician to decide whether a patient should be treated in a certain man- ". The fact that treatment assignments are tied up very closely with prognosis, thus producing an inherent bias, is the major problem with using observational data to compare treatments.

Byar, writing alone [4], adds:

The one piece of information that you do not have, even in the best databases, is why did some patients get surgery and why did some get medical treatment, or why did some get beta blockers and some get something else'/That is the one thing that you want to know if you wanted an adjustment variable. I have yet to see an analysis presented today (or in my whole life for that matter) that had really good information on why some patients got one treatment and others got another.

Finally, expressing a view found in much of his writing, he faults historical control studies for being unconvincing. (It has been my own view that the regrettable decades-long dispute over the proper role of radical mastectomy in treatment of primary breast cancer was testimony to this incapacity of historical control studies to be convincing.)

Hard as his critique of historical controls may be, he has even stronger words for data base studies of treatment comparisons [4]:

Why should you not do treatment comparisons using databases'/In the 1970's I and others argued in several articles that studies with historical controls could often be seriously misleading. Then with the rapid development of computers, clinical databases came along. At first I thought that the problems with using databases to make treatment comparisons were the same as those in historically controlled studies. After studying the problem, I concluded that the problems with databases were even worse, and others agreed. Then the "humongous" databases came along and I just cannot imagine what is going to come along next. When I wrote those papers, I was thinking about using databases like the one at Duke to choose treatment for individual patients or for groups of patients. This approach completely ignores the subset problem and can result in seriously misleading conclusions for reasons that have been mentioned many times today.

ADVANTAGES OF RANDOMIZED TRIALS

Table 2 comes from the same article [6] as Table 1 and is, I believe, nearly self-explanatory. In amplification of the fourth point about mathematical models I cite from the article:

Although randomization does not ensure that the treatment groups are'compa- rable' for all factors that may affect prognosis, the validity of significancelevels based on randomization does not require this unachievable assumption. As Fisher points out, randomization 'relieves the experimenter from the anxiety of considering and estimating the magnitude of the innumerable causes by which his data may be disturbed.' This is not to say that important prognostic

Page 7: Some of David Byar's Work: In appreciation

222

Table 2 Principal Advantages of Randomized Trials

L.E. Moses

1. Bias (conscious or unconscious) is avoided 2. Time trends are no problem, since they affect all treatment groups in the same way 3. Missing data are less likely, since all patients are following the same protocol 4. Mathematical models are not needed in the analysis 5. Results are more likely to be convincing 6. Randomized trails may be more ethical, since fewer patients need to be treated to get

a convincing answer

factors may be ignored. A more sensitive experiment can often be obtained by randomization within strata defined by such prognostic factors. If a variable has a pronounced effect on prognosis but has not been used in stratification, it is advisable to analyze the effects of this variable separately or to apply some form of adjustment technic in the analysis.

HIS THINKING ABOUT MODIFICATIONS OF RCTS

Writing with others [6] Byar gave thoughtful consideration to the possible scope for applying "adaptive" designs to RCTs comparing therapies. The underlying idea is that outcome information obtained during the trial might advantageously be used to assign more patients to the superior treatment in the trial before definitive clarity on the relative merits has been reached. If feasible, that would have the desirable result of exposing fewer subjects in the trial to the inferior treatment. "Play the winner" and "two-armed bandit" strategies bdlong to the class of adaptive procedures. Byar's critique is thoughtful and compelling: (i) Adaptive rules are not practical unless outcomes are known very promptly; this consideration rules out a large fraction of all trials. (ii) Often, the outcome is in fact composed of several outcomes (in rehabilitation therapy there might be cognitive, social, and physical end points); then how to define the better of two outcomes will pose difficulties in deciding which treatment is "winning." (iii) An adaptive design may result in the majority of A-patients being "early" and a majority of B-patients being "late." This would induce a vulnerability (in treatment comparison) to time trends in patient populations or other factors, a vulnerability ruled out by the contemporaneous treatment assignment of the ordinary RCT. (iv) The imbalance in numbers assigned to treatments A and B may lead to a larger total sample size, meaning a longer clinical trial; this would imply that, although patients in the trial may have benefitted from less liability to the inferior treatment, the delay in arriving at a definitive conclusion may have exposed many more subjects (out- side the trial) to the inferior treatment while a clear verdict was being awaited.

Byar was the leader of a group of 22 authors [7] who published a strong call for rethinking aspects of the classical RCT paradigm. The impetus for the effort arose in the context of AIDS. Although a too brief account cannot do justice, I offer a sketch of some of the ideas. Much of the message can be captured under the rubric of large simple trials, key elements of which the authors present. (i) Eligibility for enrollment should be broadened greatly: "If treatments found to be effective in clinical trials are likely to be offered to most patients, most patients should be eligible .~o enter the trials, and the exclusionary criteria shouldbe system- atically relaxed in order to reach this goal." (ii) Randomization of a patient should be regarded as appropriate if the uncertainty principle is satisfied, that is, if "the

Page 8: Some of David Byar's Work: In appreciation

Byar's Work

Table 3 Conditions Justifying Uncontrolled Trial (All Must Apply)

223

1. No suitable alternative treatment 2. Experience shows uniformly poor prognosis 3. Substantial side effects not expected 4. Expected benefit large, convincingly so 5. Rationale strong enough so that successful outcome will be widely accepted

choice of a recommended treatment remains substantially uncertain" for that pa- tient. If the patient or the physician is reasonably certain that one of the treatments is contraindicated, then the patient should not be randomized. (Otherwise random- ization is indicated.) (iii) Trials should be simplified by greatly reducing the amount of information recorded at enrollment and during follow-up. This should widen the opportunities for recruiting cooperating physicians and should increase patient enrollment, thus resolving therapeutic problems more promptly.

This same paper [7] points to the potential for a more rapid increase of clinical knowledge by introducing classical factorial designs.

Possibly the most original contribution in this article [7] J~ a consideration of when an uncontrolled clinical trial of the efficiency of a new drug may be justified. Five conditions are set; they must all be satisfied. The conditions appear in Table 3. A first reading can lead to the assessment that an uncontrolled trial is justified only if the answer is already known and no trial is needed. That is not, however, an accurate reading. The first two conditions require knowledge; the last three embody expectations. After the trial is completed those expectations are sup- planted by experience, which may be favorable or not. The status of the treatment is then on a different footing from the pretrial situation.

ROLE OF STATISTICAL ADJUSTMENT IN RCT

I was struck by the clarity with which Byar presented the theoretical basis for the use of randomization in comparing treatments. I found three such statements in the set of seven papers reviewed here. The first has already appeared in this article under "Advantages of Randomized Trials." Like that passage, the other two share the same pattern: "first, here is why randomization suffices to allow valid inferences, and second, you may be wise not to rely on that, but rather supplement the study with statistical adjustments."

I interpreted this pattern as indicating ambivalence on his part about the proper role for adjustment. I may be wrong in that in:erpretation, and I surely regret the lack of opportunity to talk about the issue with him. Whether or not he was ambivalent on the matter, I believe our profession is.

The three passages follow:

Although randomization does not ensure that the treatment groups are'compa- rable' for all factors that may affect prognosis, the validity of significance levels based on randomization does not require this unachievable assumption. As Fisher points out, randomization 'relieves the experimenter from the anxiety of considering and estimating the magnitude of the innumerable causes by which his data may be disturbed.' This is not to say that important prognostic factors must be ignored. A more sensitive experiment can often be obtained by randomization within strata defined by such prognostic factors. If a variable

Page 9: Some of David Byar's Work: In appreciation

224 L.E. Moses

has a pronounced effect on prognosis but has not been used in stratification, it is advisable to analyze the effects of this variable separately or to apply some form of adjustment technic in the analysis [6].

Although the groups compared are never perfectly balanced for important covariates in any single experiment, the process of randomization makes it possible to ascribe a probability distribution to the difference in outcome be- tween treatment groups receiving equally effective treatments and thus to assign 'significance levels' to observed differences. It is thus the process of randomiza- tion that generates the significance tests, and this process is independent of prognostic factors, known or unknown. Nevertheless, in interpreting the results of a significance test, one should not neglect the additional information pro- vided by covariates [2].

Perhaps the greatest advantage of randomization is that it removes both conscious and unconscious bias from treatment assignment. This does not mean that the charac~.eristics of patients in two treatment groups in a randomized trial will be identical. What it does mean is that a comparison of these two groups will be valid despite any imbalances, since the p-value obtained by applying a statistical test of significance takes into account the fact that such imbalances may occur. In fact, the p-value is a measure of just how probable a given difference in outcome would be in the absence of any real treatment differences. This does not mean, of course, that it may not be desirable to adjust for any differences in prognostic factors that may be observed, or to perform stratified analyses in order to increase the precision of treatment comparison [2].

The dualism h e r e - (1) randomization suffices, (2) but you may want to (need to?) adjust for covariates-is somewhat of a sore point with me. Not long before undertaking this review I had occasion as an (prospective) expert witness to behold a courtroom critique of a sound RCT; the critique consisted mainly of exhibiting those covariate adjustments that could be found to push the P value for treatment efficacy across the magic 0.05 boundary (in this case into nonsignificance). I was made very uncomfortable by all this. I felt that an extraordinarily clear and effec- tive scientific paradigm-randomize and then count the resul t s -had been lost sight of, in exchange for more sophisticated methods too easily distorted. Indeed, a sound scientific methodology seemed to be transmuted into some kind of open- ended ball game.

Two goals motivate the "improvements" on the straightforward randomized design: (1) increased precision and (2) rectification of the influence of covariate imbalances between the randomly constituted treatment groups.

I distinguish three flavors of modification:

(1) In the original design, set up certain strata and apply random allocation within those strata.

(2) Adjust, post-hoc, for prespecified covariates. (This includes of course post- stratification on prestated covariates.)

(3) After the data are in, identify the covariates that have turned out to be "important" and adjust for them ("you must play the hand you are dealt").

I have little objection to the first strategy. In a multisite clinical trial random assignment site-by-site seems wise, for example. Random allocation by stage may also be indicated, etc. The drawbacks are largely mechanical; there are additional

Page 10: Some of David Byar's Work: In appreciation

Byar's Work 225

possibilities for mix-up. Further, proliferation of strata can lead to effective loss of data where some observations sit in strata with only one treatment represented.

The second strategy-adjust post-hoc for prespecified covariates-sounds at- tractively definite, but closer inspection excites my unease. Even if a specific set of covaria~es are named-say age, stage, years with disease, laboratory test 1, and laboratory test 2 - there remains much to specify: is age to be linear, quadratic, logarithmic, or categorical'/How about years with disease'/On what scales a~? the laboratory tests performed'/What two-factor interactions are to be included in the model for adiustment? If too many questions of this kind are open, then the meaning of "prespecified" fades and my unease grows.

The third strategy is attended by strong disadvantages in my opinion. First, with large numbers (say 100) of covariates being typical, making choices among them is unavoidable. Clear landmarks guiding such choices are hard to come by. Problems of multiplicity blur the meaning of P values; multicollinearity obscures the meanings ascribable to particular named variables; the opportunity for wish- guided analysis is present; suspicions about such analysis is hard to avoid.

Several years ago, if memory is faithful, Paul Canner suggested in San Francisco that an escape from difficulties like these would be to adjust for all covariates in one large regression, thus bypassing opportunities for abuse, or the appearance of it. That proposal seems to have disadvantages. First, it is not well defined for reasons already mentioned: what scales for age, what interactions, etc.'/Second, if this obstacle were cleared, taking all-variables adjustment as the norm would either eliminate from analysis small trials or discredit them since they could not meet the norm of all-variables adjustment. Additionally, in the large studies it must be wasteful of data because of estimating too many coefficients in the presence of inevitable collinearity.

To this point I have aired my reservations about post-hoc adjustment, but possibly the returns to them in the form of increased statistical efficiency of estima- tion and power of significance tests are great.

I have taken a peek at the effect of adjustment from an estimation point of view. I was prompted by a statement of Byar's [3] where he was talking about uncontrolled trials; he said that a confounder could exert "a crucial effect" on an estimate of treatment effect without itself differing significantly between groups. So I posed this question: "In a randomized study with n treatments and n controls, with outcome y and covariate x (both may be taken to have unit standard devia- tion), where pxy = p, and where the true treatment difference is A, what is the probability that, with ~ -- ~z not significant at 0.05, we will find a "crucial" distor- tion in the estimate of A?" (I indexed "crucial" by a parameter ~.) The desired probability is a function of 0, ~, and 0 where 0 = A ~/rn~is the familiar standardiz~¢d distance entering into the power calculation. The analysis is presented in the Ap- pendix.

The model I have chosen for study is linear, and I do not know how the results may relate to logistic or Cox model analyses. Still, linear regression covers a lot of ground.

Table 4 shows the results in terms of these three parameters. First, note that 0, the correlation coefficient would be considered large by many observers at a value of 0.4. What we see is that in an experiment designed to have reasonable power (0 -- 3.0) a crucial distortion (30% or more) in the estimate of A is unlikely if 0 is 0.4 or less. For studies with greater power-i .e . , higher values of 0- the

Page 11: Some of David Byar's Work: In appreciation

226

Table 4

!,.E. Moses

Probability a of a "Critical" Distortion in Estimated Treatment Effect Being Induced by a Confounder Variable that Itself Does Not Differ Significantly (0.05) between Treatment and Control

0 X 0.2 0.4 0.6 0.8

3.0 0.2 0.0223 0.1365 0.2904 0.4168 0.3 0.0089 0.0505' 0.1330 0.2220 0.4 0.0055 0.0247 0.0729 0.1323 0.5 0.0040 0.0147 0.0444 0.0925

4.0 0.2 0.0021 0.0442 0.1552 0.2776 0.3 0.0005 0.0088 0.0501 0.1131 0.4 0.0002 + 0.0027 0.0184 0.0593 0.5 0.0002- 0.0012 0.0074 0.0302

5.0 0.2 0.0001 0.0097 0.0730 0.1716 0.3 0.0000 0.0008 0.0131 0.0555 0.4 0.0000 0.0001 0.0022 0.175 0.5 0.0000 0.0000 0.0005 0.0040

6.0 0.2 0.0000 0.0011 0.0294 0.0969 0.3 0.0000 0.0000 0.0016 0.0204 0.4 0.0000 0.0000 0.0001 0.0018 0.5 0.0000 0.0000 0.0000 0.0001

a Where a probability exceeds .05, it is understood; if it exceeds .10, it is doubly understood.

probability of distortion by even 20 % becomes quite small, except for great correla- tion between covariate and outcome.

Perhaps the question just treated should be replaced; perhaps attention should go to the probability of a "crucial distortion" in the estimated treatment effect whether or not there is significant imbalance in the covariate between the two groups. Table 5 shows these probabilities. Now the probabilities are all somewhat larger, but it remains true that in a study of good size (0-> 4) a crucial distortion of 0.3 or more is an unlikely event unless there is a very strongly predictive covariate at hand. Such a covariate can ordinarily be recognized in advance, in which case adjustment or stratification should be prespecified in preference to invoking it post-hoc.

The picture I have sketched here, showing only small probability of important benefits from adjustment in a randomized study of good size is probably over- favorable to adjustment because it deals only in parameters and does not allow for statistical error in adjustment. Beach and Meier [8] present strong evidence of the small gains to be sought by<ovariate adjustment in some actual randomized studies, They supplement their analysis with simulations exploring simultaneously sample size, number of available covariates, multiple correlation with outcome, number of covariates to be selected from the stock, and method of selection- by (i) strength of correlation, (ii) intergroup difference, and (iii) the product of these procedures. They find that unless R 2 or N is large, selection on any of the three bases "actually increased the mean squared error, so that the unadjusted estimator would be preferable." Thus I find that I have put forth two propositions:

Page 12: Some of David Byar's Work: In appreciation

Byar's Work

Table 5

227

Probability ~ of a "Critical" Distortion in Estimated Treatment Effect Being Induced by a Covariate in a Randomized Study

0 ~ 0.2 0.4 0.6 0.8

3.0 0.2 0.035 0.173 0.326 0.444 0.3 0.014 0.074 0.166 0.260 0.4 0.008 0.039 0.095 0.160 0.5 0.006 0.024 0.065 0.114

4.0 0.2 0.006 0.075 0.196 0.313 0.3 0.001 0.022 0.078 0.149 0.4 0.000 0.008 0.038 0.082 0.5 0.000 0.004 0.021 0.053

5.0 0.2 0.001 0.029 0.112 0.211 0.3 0.000 0.005 0.034 0.083 0.4 0.000 0.001 0.013 0.040 0.5 0.000 0.000 0.006 0.022

6.0 0.2 0.000 0.011 0.061 0.141 0.3 0.000 0.001 0.014 0.045 0.4 0.000 0.000 0.004 0.018 0.5 0.000 0.000 0.001 0.008

Where a probability exceeds .05. it is understood; if it exceeds .10° it is doubly understood.

(1) Adjustment for covariates other than by random allocation within strata or similarly explicit prespecified ana lys i s - say covariance on x l - ca r r i ed costs that in my judgment weigh heavily.

(2) The offsetting gains in precision are modest, in many circumstances negative on average.

Our profession would do well to consider whether the clarity of the Fisherian model should be maintained to the exclusion of post-hoc adjustment, willingly accepting the possible, apparently modest cost of (sometimes) needing to use larger sample sizes.

A further question beckons: Can a tight, flexible, fair, and transparent paradigm for post-hoc adjustment be developed?

In taking up such questions, David Byar's absence will be felt.

APPENDIX

The basis for calculating Tables 4 and 5 is given here. Let y be the outcome (c~y = 1), let x be the covariate (a x = 1), and let

yi ~- ]1 -~ pxi + ei i = 1 . . . . . 2n /

with e independent of x having mean zero. It is a conseq(ience that var (e) = 1 - p2.

Now let a random subset of n of the individuals (called group 2 from now on) have A (unknown, the treatment effect) added to their y values.

Page 13: Some of David Byar's Work: In appreciation

228 L.E. Moses

A -- ~2 - ~ -- A + p ( ~ - ~ ) + ~ - ~ (1)

in an obvious notation. If n is large enough for the central limit theorem to be trusted, we may write (1) as:

= A + pzl 2x/2"~ + z2 4 (1 - - PZ)2/n (2)

where zl and z2 are independent unit normal deviates. The contribution to/~ arising from imbalance in the covariate is the term involv-

ing p. We seek to find the probability that this "distortion" is a "crucial" fraction of the estimated treatment effect ~ .

A l (3) This has two branches, where pzl 2~'7-n and ~ are of the same sign or opposite. First, we take them both positive. Our event is:

pzl 2x/2-~ ~> 3`~, i.e.,

pz12,,,/2-~ >t 3`A + 3`pzv,J~ + 3,zz~ll -- fJz)21n, or (4) (1 - 3`)pz1~/2-~ >t 3`A + 3 ` z 2 ~ [ 1 - - pZ)2/n, or

1 - 3, X pzl - z241 - pa >I A,4rn72 = 0. 3,

We introduce 0 for A 4n72, which is the familiar parameter for power calculation (if a - 1, as here).

The left-hand side of (4) is normally distributed with mean zero and variance o 2 , as given below:

_ p2(~__~)" + (1 - pZ) 02

So the probability of the event in (4) is:

where z is unit normal. The second case of the event in (3) is:

- - p z l ~ / - ~ >1 3,~ = 3,A + 3,pz~,vrZj-n + 3,zz4(1 '-- pZ)2/n (6)

Reductions similar to those just seen bring this event to the form:

0

z' >~ ~(1 _ p2) + p2(~_ .~) 2 (7)

The probability that either of the events (4) or (6) occurs is defined by an integral of the bivariate unit normal distribution over a region, in the (zl, z2) plane, defined

Page 14: Some of David Byar's Work: In appreciation

Byar's Work 229

by the two lines corresponding to expressions (4) and (6). It is not hard to establish that for our values of ~,, p, and 0, the probability of both inequalities (4) and (6) occurring is small compared to 0.001. One is then able to construct the probabilities in Table 5 using univariate table look-ups employing expressions (5) and (7) and adding the results.

Table 4 arises from a more complicated treatment. Events (4) and (6) arise again, but the condition that ~1 and ~2 do not differ significantly (at the 0.05 level) demands calculating the probability of (4) or (6) occurring, with Izll < 1.96. This requires using bivariate normal tables (or program) to find the probability in the (zl, z2) plane of a region which is the intersection of the vertical strip Iz~l ~ 1.96 and the region defined by inequalities (4) and (6).

REFERENCES

1. Byar DP, Green SB, Dor P, Williams ED, Colon J, Van Gilse HA, Mayer M, Sylvester RJ, Van Glabbeke M: A prognostic index for thyroid cancer. A study of the E.O.R.T.C. Thyroid Cancer Cooperative Group. Eur J Cancer 15:1033-1041, 1979

2. Byar DP: The necessity and justification of randomized clinical trials. In Controversies in Cancer: Design of Trials and Treatment, H.J. Tagnon and M.J. Staquet, eds. Masson, New York, pp 75-82, 1979

3. Green SB, Byar DP: Using observational data from registries to compare treatments: the fallacy of omnimetrics, Stat Med 3:361-370, 1984

4. Byar DP: Problems with using observational databases to compare treatments. Stat Med 10:663-666, 1991

5. Byar DP: Why data bases should not replace randomized clinical trials. Biometrics 36: 337-342, 1980

6. Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gall MH, Ware JH: Randornized clinical trials. Perspective on some recent ideas. N Engl J Med 295:74-80, 1976

7. Byar DP, Schoenfeld DA, Green SB, Amato DA, Davis R, DeGruttola V, Finkelstein DM, Gatsonis C, Gelber RD, Lagakos S, Lefkopoulou M, Tsiatis AA, Zelen M, Peto J, Freedman LS, Gail M, Simon R, Eilenberg SS, Anderson JR, Collins R, Peto R, Peto T: Design considerations for AIDS trials. N Engl J Med 323:1343-1348, 1990

8. Beach ML, Meier P: Choosing covariates in the analysis of clinical trials. Col~rol Clin Trials 10:161S-175S, 1989