statistical and ethical considerations in clinical trials

7
Gommunity Dental Oral Epidemiol, 1980: 8: 267-272, Key words: dental caries; ethics. Statistical and ethical considerations in clinical trials JOHN E, ALMAN Forsyth Dental Center, and Veterans .idministration Outpatient Clinic, Boston, U.S.A. Ethical considerations in the planning and execu- tion of clinical trials are not new, of course, though the past 20 years or so have witnessed a substantial increase in public concern /or human rights in the conduct of clinical research in the medical care setting. In American universities and other research institutions the formation of a Committee on Investigations Involving Human Subjects was mandated by the U. S. Public Health Service in 1966 after recommendations along these lines had appeared a year earlier. At Forsyth Dental Center this committee was activated in Novembei'of 1966, In the United Kingdom the appearance of Ethical Committees in hospitals goes back to a 1967 recommendation to that eflect by the Royal College of Physicians. Thus, formal institutional respotisibility for the ethics of clinical trials involving human subjects was dramatically increased by these steps, taken less than a decade and a half ago. Until these recent times, ethical restraints on the design and conduct of clinical trials had relatively little effect on the activities of statisticians in assisting with the design of clinical trials and in carrying out the data analyses. But a major outcome of the increased concern for ethical considerations is the general notion that, to the maximum extent possible, hitman subjects shall not be subjected to the withholding of potentially beneficial treatment in order to create control subjects. From a statistician's point of view, this goes to the heart of the matter. A clinical trial is usually expensive and time- consumitig, particularly in the dental busitiess, and is not mounted unless someone feels that a "new" treatment is a winner. It is rare indeed that a clinical trial is staged unless the principal investigator is enthusiastic about the new treatment. But the enthusiasm ofthe investigator is not considered as valid evidence of efllcacy. Statistical considerations state emphatically that adequate control group(s) and randomized assign- ment of subjects to treatments are essential to a convincing assessment ofthe value ofthe treatment under test. The late HUGO MLIENCH of Harvard once olfered several statistical laws, one of which said essentially that "nothing improves the performance of an innovation as much as the lack of controls," In dental research, there is heavy emphasis on research into treatments and products foi- the reduction of caries. Since anti-caries products are dominated by the fluoridated variety it is natural to think ol an appropriate control treattnent as one that is similar but does not contain the active ingredient, in other woi ds a placebo. But, when we assign by means of a lottery scheme some subjects to a placebo treatment we are eflectively withholding from these subjects the opportunity of receiving a treatment with likely preventive value. Since caries is an irreversible process the ethical dilemma is obvious. As late as 1971, the U.S, Food and Drug Administration approved a clinical trial of a fluo- ride dentifrice in which a random one-third ofthe subjects used a non-fluoride paste. This was at a time when the anti-caries value of a fluoride toothpaste had already been demonstrated in many clinical trials. Today, approval of stich a sttidy desigti is doubtful; tnuch the same applies in Britain, The history of clinical trials offluoride dentifrices goes back about a quarter of a century. Many trial results have been published with a variety of outcomes; in summary they seem to indicate that a 0301-5661/80/050267-06S02,50/0© 1980Munksgaard, Gopenhagen

Upload: john-e-alman

Post on 21-Jul-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Gommunity Dental Oral Epidemiol, 1980: 8: 267-272,Key words: dental caries; ethics.

Statistical and ethical considerations inclinical trials

JOHN E, ALMAN

Forsyth Dental Center, and Veterans .idministration Outpatient Clinic, Boston, U.S.A.

Ethical considerations in the planning and execu-tion of clinical trials are not new, of course, thoughthe past 20 years or so have witnessed a substantialincrease in public concern /or human rights in theconduct of clinical research in the medical caresetting. In American universities and other researchinstitutions the formation of a Committee onInvestigations Involving Human Subjects wasmandated by the U. S. Public Health Service in1966 after recommendations along these lines hadappeared a year earlier. At Forsyth Dental Centerthis committee was activated in Novembei'of 1966,In the United Kingdom the appearance of EthicalCommittees in hospitals goes back to a 1967recommendation to that eflect by the RoyalCollege of Physicians. Thus, formal institutionalrespotisibility for the ethics of clinical trialsinvolving human subjects was dramaticallyincreased by these steps, taken less than a decadeand a half ago.

Until these recent times, ethical restraints on thedesign and conduct of clinical trials had relativelylittle effect on the activities of statisticians inassisting with the design of clinical trials and incarrying out the data analyses. But a majoroutcome of the increased concern for ethicalconsiderations is the general notion that, to themaximum extent possible, hitman subjects shall notbe subjected to the withholding of potentiallybeneficial treatment in order to create controlsubjects. From a statistician's point of view, thisgoes to the heart of the matter. •

A clinical trial is usually expensive and time-consumitig, particularly in the dental busitiess, andis not mounted unless someone feels that a "new"treatment is a winner. It is rare indeed that aclinical trial is staged unless the principal

investigator is enthusiastic about the newtreatment. But the enthusiasm ofthe investigator isnot considered as valid evidence of efllcacy.Statistical considerations state emphatically thatadequate control group(s) and randomized assign-ment of subjects to treatments are essential to aconvincing assessment ofthe value ofthe treatmentunder test. The late HUGO MLIENCH of Harvardonce olfered several statistical laws, one of whichsaid essentially that "nothing improves theperformance of an innovation as much as the lack ofcontrols,"

In dental research, there is heavy emphasis onresearch into treatments and products foi- thereduction of caries. Since anti-caries products aredominated by the fluoridated variety it is natural tothink ol an appropriate control treattnent as onethat is similar but does not contain the activeingredient, in other woi ds a placebo. But, when weassign by means of a lottery scheme some subjects toa placebo treatment we are eflectively withholdingfrom these subjects the opportunity of receiving atreatment with likely preventive value. Since cariesis an irreversible process the ethical dilemma isobvious. As late as 1971, the U.S, Food and DrugAdministration approved a clinical trial of a fluo-ride dentifrice in which a random one-third ofthesubjects used a non-fluoride paste. This was at atime when the anti-caries value of a fluoridetoothpaste had already been demonstrated inmany clinical trials. Today, approval of stich asttidy desigti is doubtful; tnuch the same applies inBritain,

The history of clinical trials offluoride dentifricesgoes back about a quarter of a century. Many trialresults have been published with a variety ofoutcomes; in summary they seem to indicate that a

0301-5661/80/050267-06S02,50/0© 1980Munksgaard, Gopenhagen

268 ALMAN

reduction in caries in children of the order of20-30 % from a non-fluoride control is the norm forthe fluoride dentifrices. Suppose we wish to test a"new" fluoride dentifrice for efficacy. It is notenough simply to state that the dentifrice contains afluoride; it must be demonstrated via a clinical trialthat the new dentifrice can produce at least thesame order of magnitude of caries reduction thatfluoride dentifrices have shown in the past. How dowe do this without using the placebo control? And,from a statistical point of view, how are the designand analyses of clinical trials aifected?

Clinical trials are usually of the one-on-onevariety, i.e., a new treatment is tested against acontrol treatment. The assessment of efficacy isnormally in terms of the mean difference in thenumber of new decayed or filled tooth surfacesobserved over say a 2-year-period ofthe study. Toassess a mean difference we use the t-test. Thestatistie, t, of course is nothing more than astandardized mean difference. The value of t isjudged by a statement of probability, i.e., theprobability of obtaining by chance a value of t atleast as large as that actually obtained, if the truedifference between test and control group isessentially zero. Since the true difference is ibreverunknown, inasmuch as it is only within our powerto look at a small sample of the potential subjects,we agree to accept the demonstration of a non-zeromean difference if the probability associated with aspeciflc value oft is less than or equal to a specifiedvalue, say ,05. This is the conventional t-test, firstenunciated by Go,SSETin 1908, with the theoreticalbasis reflned by FISHER in the 192O's (3).

When we plan and budget for a clinical trial, animportant consideration is the number of subjectsrequired to demonstrate the efflcacy of whate'C'er isunder test. The t-test can be used to provide ananswer to the question of "how many?". To showthis and to provide a basis for further discussion wego through the following development.

The formula ibr the t-test can be written in avariety of forms. I choose to express it initially in thefollowing manner:

mi — m2

S(mcan diff,)

where mi and m2 represent the mean values forgroups 1 and 2 respectively, with group 1 beingarbitrarily designated as the "control" group, s

denotes the estimate of the standard error of themean difference. Thus, the value oft depends notonly upon the observed mean difference, but alsoon the variability of fhe data and the numbers ofcases, both of which are incorporated into thedenominator. The formula for t can also be writtenin terms of n's, means, and standard deviations asfollows.

t — ( m i —(nt

(2)"112 )

where ni and n2 are the numbers of cases for thetwo groups, and si and S2 are the respectivestandard deviations.

In the planning stage we assume that thenumbers of cases in each group are tobeeqtial. (Wecan only assume equality at the start, even thoughwe are aware that this will be only approximatelyso after losses have taken their toll.) Hence, wemake the simplifying assumption that

m = n2 = nand (2) reduces to • , ,.

t— (mi —

Now, in (3) we solve ibr n, yielding

n —(mi

(3)

(4)

In (4) above I have appended the subscript a to tto indicate that we select the appropriate value oftcorresponding to the probability value of a. Thus, ifwe decide to use a = . O 5 , we are using the "criticalvalue oft at the 5% level" as representing the valueof t we would like to obtain at the end ofthe study.Having selected an appropriate value of t, andhaving conjured up estimates of the values of themeans and standard deviations from some pastexperience, we can readily compute the n requiredto produce the selected value off. Such an estimateoi n is only very approximate, of course.

To facilitate the discussion we convert the meandifference in the formula for t to a proportionaldifference. (Since, in dental trials we normally lookfor differences in DFS that represent a "reductionfrom the control group" we can conveniently talk ofa proportional reduction, which is denoted by P i nthe following formulas) P is defined by

Ethical considerations 269

m i

and thus is the relative mean diilerence. We see thatmi — m2 = pmi and we make this substitution in (4),yielding

t'g

n — 2 2

p mi

(5)

We now define q = l —p, and introduce thecoellicient of variation.

After substitution in (5) we obtain - ^

, 2 , 2 I 2 2vt (Vi + V2 q ) (6)

Two-year clinical trials among children, say inthe 7-11 age range, tend to show mean DP'S ol theorder of 3-4 surfaces per child, with the coefficientsof variation about unity, i.e., standard deviationsare about the same size as the mean \'alues. If meanvalues are higher, say 6-8, the coefficients olvariation tend to drop to appreciably less than one,and when mean values are down in the 1-2 rangethe coefficients of variation may be as high as 2 ormore. In general however, within a short range oimean values, as might realistically be found be-tween control and test group, the coefficients ofvariation tend to be more or less independent ofthemean values. Accordingly, we take as a roughapproximation that the coefficients of variation forthe two groups are equal,

Vl = V2 = V ,

and we define a new symbol, w, as w=tav.Making these substitutions we arrive at the final

form of our formula for n:

n =(7)

As a numerical example of the estimation of nusing (7), suppose we agree to use t=1.7, which isapproximately the critical value oft at the 5% levelfor the one-tail test (4). Choose v = l , then w" =(1.7X1)^=1.89. Using this value ibr w and sub-stituting several diilerent values for P (and thus q)

we obtain the values of n that would just providethe specified value oft if our assumptions about thecoeiTicient of variation hold up. For vahtes of Pranging from .30 to .05 the corresponding values ofn are given below:

p11

,30

48

.25

73

.20

119

,15

222

.10

524

.05

2200

Thus, looking to detect, for example, a 25%reduction from the control we need only 73 stibjectsper group. To detect a 15% reduction requires anincrease in n to 222, and to detect a 5% reductionrequires 2200 cases per group. It is clear thatprecision in the evaluation of mean differences ispurchased at the price of more and more subjects,and thus higher costs.

Fig. 1 provides a plot of formula (7), though Ihave chosen to place the value of P on the verticalaxis and to use selected values of n, (This isequivalent to solving (7) for P). This way of plotting(7) permits asking our question about n in adifferent way, e.g., what level of proportionalreduction can I detect statisticafly if I use, say, 200subjects per group with w=1.7? We read on thevertical axis about .15, i.e,, we would expect to beable to detect statistically a proportional redtictionof about 15%.

In practice, there are good reasons why a value ofn obtained from consultation with the plot shouldbe inflated. There are summarized below:1, Subject attrition. The value of n that we seek is

that number per group at the end of the study.

1000

Fig. 1. Valtics of proportional reduetion [P) just significantaccording to values of w, for seleeted values of n.

270 ALMAN

But, losses of subjects during the course of thestudy can be appreciable. Experience varies, butlet us take, say, 20% as a typical loss experienceover 2 years; to cover our losses I'equires that weadd 25% to the obtained value of n,

2. Setting a value for w involves both the estimationofthe coefficient of variation and the selection ofa t-value. To provide a margin for possibleunderestimation of v, and to give a margin formeeting the hoped for a-value oft, let us add anadditional 10% to n.

3. Finally, since every person who has ever under-taken a large-scale clinical trial is well aware of

J- the Machiavelian maniiestations of Murphy's; Law "if anything can go wrong, it will" we will

prudently add an additional 5-10% in deferenceto this phenomenon (5).Altogether, then, it is realistic to add 40-50% to

the value of n that we arrive at by the use of formula(7),

We return to the problem of specifying a controlgroup for a clinical trial. Consider a trial to test anew dentifrice with expected prophylactic value,say, beyond that offluoride dentifrices now on themarket. We note that we must do the testing in theface of much fluoride competition. More thanthree-quarters of the dentifrices purchased in theU.S. contain fluoride, and the corresponding fig-ures for the British Isles and Westein Europe arenot greatly diiierent. In addition, children areexposed to fluoride in the water supplies and viarinses, tablets, and topical applications. In short,we must evaluate our new dentifrice in a worldwhere our available subjects are already exposed toan appreciable amount of fluoride, much of ituncontrollable in the clinical trial, and we can onlyform control groups of subjects who themselvesexperience a substantial amount of exposure tofluoride. Thus, if we must use a control that is apositive control, e.g., makes use of a iluoridedentifrice that is currently in wide use, it can onlyserve to make smaller the proportional reductionwe seek to detect, and thus to increase the nrequired. To move our detection level from 20%down to 10%, say, using w=1.7, requires that weincrease our n (uninflated) from something over100 to over 500. The increase in cost is obvious.

A further point is worth considering on thepessimistic side. It may just be that when wecompete with a fluoride control dentifrice, and in

the face of an unknown amount of other fluoridecontamination, a realistic appraisal of our newdentifrice might suggest that maybe an improve-ment of perhaps 5-10% is all that we couldreasonably expect. If so, it may well be that it issimply not economically feasible to meet the costs ofa clinical trial with the numbers of cases that wouldappear to be required.

It seems clear that the inci-eased pressures oiethical restraints serve to make it more difficult,and costly, to demonstrate efficacy of a product.Thus, it is appropriate to ask if there are counter-measures that can be taken to improve the sensiti-vity oi' our clinical trials, to counter somewhat therather pessimistic picure given above. Below iburapproaches are discussed which I feel have promiseof countering the pessimism of the day.

1, Reduction in variability ofthe data - As seen in (3)the t-ratio varies inversely with the variability ofthe data, and any scheme that decreases variabilitywill tend to increase the precision of a trial to detecta difference between two treatments. Or, conver-sely, decreasing variability tends to decrease thevalue oi n required to obtain a specified level oft.

We know that coefficients of variation tend todecrease as the mean values of DFS increase, andthe latter tend to increase as the length of the trialinct^eases. Thus, an additional year, say, ibr a trialshould have the effect of inci'easing its sensitivity.Working against this, of course, is the additionalattrition of subjects that another year produces.Since extension of the study has obvious costconsequences, the cost-benefit in sensitivity maybetrivial or even negative. Nevertheless, the point haspromise. For example, it might be expected topromote a modest improvement if a 2-year studywas stretched to 2% years so that the examinationswere done after 18 and 30 months rather than alter12 and 24 months. (An interesting conjecture is thatthis phenomenon of increasing mean levels anddecreasing relative variability might be ablepartially to explain what has often been observed,viz., that the first year of a study may not producethe signiflcant results that tui-n up after the thirdexamination data are examined).

A second, and possibly more promising, avenueto reducing variability is to decrease theheterogeneity of the subject pool. For example, areduction in the age range of the children usedas subjects is quite likely to decrease the data

Ethical considerations 271

variability. Many of the trials reported in theliterature cover an age range of perhaps ages 6-12,When we plot DFS by age cohorts we usually see analmost-linear rise in DFS with age. If we couldrestrict the age range to something appreciably lessthan customary, I would expect to see a decrease inthe overall variability ofthe DFS measure. Ideally,it seems certain that a study in which subject age isheld constant at some optimum value wouldproduce a more sensitive trial, but the logisticalproblems could be horrendous and the costsprobably excessive. Some compromise along theselines, however, seems a promising way to go.

In addition to increasing the homogeneity withrespect to age, can we do the same for cariessusceptibility? We can assume that in any largepool of potential subjects there are those whosecaries-susceptibility is suiFiciently high that re-latively subtle differences in the anti-cariogenicproperties of treatments cannot be seen in thisgroup of subjects. Likewise, we can assume there arethose who are sufficiently caries-resistant that theycan be expected to contribute virtually nothing tothe detection of treatment differences within areasonable time span. If we could screen out anappreciable portion of these two groups withoutunduly disturbing the middle group that is the realtarget for the study, we would be almost certain tohave a more sensitive study, with lower variabilityand quite possibly a better mean differencebetween treatments,

2. The combining of results from multiple studies -Under citxumstances where it is necessary to test apromising new product against a positive controlwe may have to settle for non-signiflcant resultseven though the mean difference is in favor of thenew product. The numbers that come out of (7)give ample suggestion that this may be so.However, suppose that we can repeat, ap-proximately, the advantage to the new product in asecond study? We should not ibrget that FiSHER (6)long ago showed how to combine two or moreindependent studies so as to arrive at a probabilisticstatement about the two of them together, Anumerical example is instructive. Suppose that astudy using 200 subjects per group finds aproportional reduction of 14% from the standard,with t = 1,50 andaone-tail probability of .068. Thisis close but not significant by the usual standard.A second study, independent of the first, uses 300

subjects per group and finds a proportional reduc-tion of 9%, with t = 1.15 and a correspondingprobability value of .126. The second study isless convincing in itself, but it does leinibrcethe first study. Using both probability values.FISHER obtains a value of 2-square of 9.54 on 4degrees of ireedom, significant at the 5% level.Given the data as outlined above, it takes onlya few moments with a hand-held calculator todo the computations, and the answer canprovide convincing evidence of significance inthe right circumstances. It is important to empha-size that the studies must be independent, differentexaminers, different locales. Because of this it is notappropriate for combining probabilities taken fromsub-sets of a single study. And be it noted that theremay be times when the logistics make it easier (andcheaper) to mount two studies of a size rather thanone of twice the size.

3. The demonstration of equivalence - Under somecircumstances the goals of a clinical trial may bealmost as well met if the results can demonstratethat a test product and a standard can be said to beequivalent. It is well-known that acceptance ofthenull hypothesis, i,e,, a non-significant t, is necessarybut by no means suflicient to make a case forequivalence. Ponder the notion that by keeping then's suificiently small we can manage to make anymean difference fall into the non-significant range.We need to show that we have conducted a trialwith a sufficient number of subjects and withsuiFicient sensitivity to have detected a diiference intreatments of an arbitrary magnitude if such wasactually the case. For example, suppose that usingn==400 we obtain a proportional reduction of9%, with t = 1.34 and a two-tail probability of about.2. (I use here a two-tail test since a claim ofequivalence implies no a priori statement con-cerning which product is "better.") Referring to theplot of (7) we find for w = 2 and forn = 400 we candetect a proportional reduction of about 13%. Thelatter figure is sufficiently low that the trial can besaid to represent a reasonable eflbrt to detectdifferences between the test product and the stan-dard if such truly existed, and that there is goodindication that the test product k at least equ'wddtntto the standard. I use the words "at least" since thetest product did have a small advantage over thestandard in the actual results. Though there is nostatistical reason for it, I am sure that the case for

272 ALMAN

equivalence would not be nearly as persuasive ifthe standard showed to better advantage than thetest product in the actual results.

A case for equivalence is best made incircumstances where the trial results suggest thatthere is perhaps a modest advantage to the testproduct, but within the constraints of costs andlogistics it was not possible to use a suflicientnumber of subjects to obtain statistical significanceby the usual standard.

4. Two examiners. - If each subject in a trial isexamined independently by two differentexaminers, there may be a statistical advantagethat could be important. Assuming that theexaminers are of similar ability and experience,and have undergone some preliminary exercises toprovide reasonable agreement on diagnosticstandards, it would appear that (1) the results ofone examiner should confirm the other, at leastwithin reasonable limits, (2) that the reliability of acomposite DFS score should be higher than for asingle examiner's scoring, and (3) that the higherreliability translate,s into the equivalent of oneexamination done on a larger number of subjects.We do not yet know enough about the potential ofthis ploy for increasing the sensitivity of a trial. Andcertainly we do not yet have a good idea of whetherthe cost-benefit might be positive or negative andhow best to estimate it. Some data that will beavailable in due time from an on-going study willshed some light on this conjecture.

You have probably observed that I have carriedon the discussion as though all clinical trialsinvolved only two groups, a control and a productunder test. It is, of course, quite possible to test moreproducts in one trial. Combinations such as a testproduct against two different controls, or two testproducts against a single control, have been usedfrequently. There is the obvious attraction oiobtaining more evaluations within the overhead ofjust one study. But, working against this advantage,we must partition our available pool oi subjectsamong more groups, and we must take into accountthe number of one-on-one comparisons we canmake, three for three groups, six ibr ibur groups,and so forth. Since all tests of products ortreatments must come back to the one-on-onecomparison, i.e,, the t-test, we are back where westarted with one exception. Because ofthe multiplet-tests involved, we must increase the critical value

of t to make proper allowance for the multiple com-parisons within the trial. The literature of multiplecomparisons is extensive, I call to your attention arecent paper by TUKEY (7) and an earlier book byMILLER (8). In summary, I feel that the two-grouptrial is really more eilicient, particLtlarly since allone-on-one comparisons in a trial of three or moregroups are seldom of the same importance.

Unless we can picture a breakthrough of majorimportance that renders obsolete our present levelof anti-caries effectiveness, it appears we are facedwith the prospeet of evaluating smaller margins ofefficacy among competing products, and withconducting clinical trials without being able to useplacebo controls. Indeed, it may well be thatplacebo controls no longer can serve us wellbecause of the high level of uncontrolled fluorideexperience among potential subjects. We needmore thoughtful planning of the designs of clinicaltrials, and need to bring to bear all of the statisticalwitchcraft that we can muster. .

R E F E R E N C E S - ' , ; • .̂

1, Noted by MIKE, V, & GOOD, R, in Old priMems, newchallen,qes appearing in Science for 18 November 1977, This isthe lead article in a series based on the Birnbaum MemorialSymposium on "Medical research: statistics and ethics"held at the Memoj'ial Sloan-Kettering Gancer Genter inNew York Gity on 27 May 1977,

2, Quoted by GILBERT, MGPEEK & MOSIELLEK in "Statisticsand ethics in surgery and anesthesia" ibid,, p, 687,

3, No reference is needed. There is hardly a statistical text thatdoes not devote considerable space to the t-test, and itshistory is well-documented in many of these texts, GOSSETwrote tinder the pseudonym of "Student",

4, A one-tail test is appropriate if we take the position that thediilerence belweenconti'oland testgroupis uni-directional,i,e,, il a true difference exists we expect it to be in favor ofthetest group, Glinical trials are virtually never undertaken iftruly nothing is known in advance about the relativeeflicacy of control and test,

5, MURPHY'S Law has many corollaries and can strike in manyunexpected ways. An alternative statement ofthe Law thatis more descriptive of how it alfects clinical trials is, "Anyunexpected event in the course of a clinical trial can onlyaUenuate the favorable lesults and sharpen theunfavorable,"

6, To be found on p, 99 ofthe 13th edition of FISHER'S Statisticalmethods for research workers.

1. Writing in Science for 18 November 1977, pp, 679-684,under the title Some thoughts on clinical trials, especially/nobtemsof multiplicity,

8, Mii,LER, R,: Simultaneous slalislical inference, McGraw-Hill,New York 1966,