combining independent studies of a diagnostic test into a summary roc curve: data-analytic...

STATISTICS IN MEDICINE, VOL. 12, 1293-1316 (1993)

COMBINING INDEPENDENT STUDIES OF A DIAGNOSTIC

APPROACHES AND SOME ADDITIONAL CONSIDERATIONS

TEST INTO A SUMMARY ROC CURVE: DATA-ANALYTIC

LINCOLN E. MOSES AND DAVID SHAPIRO Department of Stati.stic.s. Stanford University. Stanford, C A 94305-5092, U.S.A.

AND

BENJAMIN LITTENBERG Department of Medicine, Darfmouth Hitrhcock Medical Center. Hanover. NH 03756, U.S.A.

SUMMARY

We consider how to combine several independent studies of the same diagnostic test, where each study reports an estimated false positive rate (FPR) and an estimated true positive rate (TPR). We propose constructing a summary receiver operating characteristic (ROC) curve by the following steps. (i) Convert each FPR to its logistic transform U and each TPR to its logistic transform Vafter increasing each observed frequency by adding 1/2. (ii) For each study calculate D = V - U , which is the log odds ratio of TPR and FPR, and S = V + U , an implied function of test threshold; then plot each study’s point (Si, Di). (iii) Fit a robust-resistant regression line to these points (or an equally weighted least-squares regression line), with V - U as the dependent variable. (iv) Back-transform the line to ROC space. To avoid model-dependent extrapolation from irrelevant regions of ROC space we propose defining n priori a value of FPR so large that the test simply would not be used at that FPR, and a value of TPR so low that the test would not be used at that TPR. Then (a) only data points lying in the thus defined north-west rectangle of the unit square are used in the data analysis, and (b) the estimated summary ROC is depicted only within that subregion of the unit square. We illustrate the methods using simulated and real data sets, and we point to ways of comparing different tests and of taking into account the effects of covariates.

INTRODUCTION

A standard way of describing the performance of a diagnostic test is the two-by-two table giving the probabilities of positive and negative test results for patients with and without disease: see Table I. The probabilities add to 1.0 in each column; the ideal test would show TPR and TNR both equal to 1.0, with FNR and FPR both equal to 0.

The data from a study of the diagnostic test will produce a similar-looking table, where the entries are now counts - numbers of patients - rather than probabilities: see Table 11. Again, the ideal sample result would occur if b and c were both zero, making a equal to n , and d equal to n2 .

This paper offers methods for combining several independent data sets like that of Table I1 into a single summary. In general, the underlying principle is to recognize that, with everything else being equal, difSerent probabilities in the format of Table I must apply if the test threshold defining

0277-671 51931141293-24$17.00 0 1993 by John Wiley & Sons, Ltd.

Received June 1992 Revised December 1992

1294 L. E. MOSES, D. SHAPIRO AND B. LITTENBERG

Table I . Summary of test performance probabilities Patient status

With disease Without disease

(true positive rate) (false positive rate)

l - Q = F N R l - P = T N R Test outcome

Sum: 1 .o 1 .o

Table 11. Frequencies from applying the diagnostic test to n l patients with disease and nz patients without

With disease Without disease Patient status

a positive outcome differs from study to study. If that threshold is made more stringent, the false positive rate will be reduced, and so will the true positive rate; if the test threshold is made less stringent then both rates will increase. Consequently, two studies alike in all respects except threshold stringency will have diflerent underlying probabilities for Table I, and will be expected to have discordant data in their two data sets arranged as in Table 11. This remains true even when the test threshold is wholly implicit, as may be the case in assessing an image as ‘normal’ or ‘abnormal’.

It can sometimes happen that the test threshold surely does not vary from study to study (for example, a positive test may be defined in each study as a score of 10 or less on a published checklist of psychological signs and symptoms), but still an association between true positive and false positive rates can occur if there are shifts in the two patient populations from one study to another. Consider for example two sites, one a college campus, the other a detention camp; at both sites there will be depressed and undepressed persons, but probably with different distributions across the spectrum of depression. Though the test threshold is fixed at 10, the scores of both depressed and undepressed may well be higher at the detention camp, which would then show higher TPR and FPR than the college, and yield a different point on the ROC curve, at the same threshold (10).

The underlying joint dependence of TPR and FPR, which can profitably be thought of as a trade-off between test sensitivity (TPR) and test specificity (1 -FPR), is captured fully in the receiver operating characteristic (ROC) curve. An ROC curve is a path in the unit square, rising from the lower left corner, where both FPR and TPR are zero, to the upper right corner, where they are both one. In Figure 1 an ROC curve is plotted, together with two points. Point PI represents a study with estimated FPR (the horizontal co-ordinate) equal to 0.20 and estimated TPR (the vertical co-ordinate) 0.75, while point P, represents another study with estimated FPR = 0.05 and estimated TPR = 050. Both these points lie near the plotted ROC curve, and, viewed in this way, are seen not as clearly discordant with each other, but as roughly concordant with this one ROC curve.

SUMMARY ROC CURVE FOR DIAGNOSTIC TEST 1295

0.0 0.2 0.4 0.6 0.8 1 .o

FPR

Figure 1. An ROC curve with two points that lie near it

So we begin our investigation starting from the idea that to describe the behaviour of a diagnostic test we should aim to give that description in the form of an ROC curve. Consequently, in summarizing the results from several studies of the same test we should similarly aim at producing a summary ROC curve.

This paper aims to offer some practical and convenient methods to estimate such summaries, and to characterize their adequacy. Very little has been published on this problem, which we may call the meta-analysis of independent studies of a diagnostic test. In 1982 Swets' published a letter that pointed to the underlying issue. In 1986 Inouye and Sox2 analysed such a data set, but these methods were not appropriate; we give a reanalysis here. In 1990 Kardaun and Kardaun3 attacked the same problem that we do but by more computer-intensive methods; we re-analysed their data by the methods proposed here, and found that the results using their approach and ours (not shown here) are concordant though not identical. A technical report in 1992 offers a way to combine maximum likelihood estimates of normal theory ROC curve^.^

APPROACHES TO FITTING

If we knew the underlying probability laws that applied to the probabilities of false positive and true positive, perhaps in a way that depended upon a few unknown parameters, then we would find general methods such as maximum likelihood estimation suitable. But, lacking such know- ledge of underlying structure, we are driven either to accepting dependence on ad hoc assum- ptions about that structure or to adopting a frankly data-analytic approach. We choose the latter and present here the methods, some worked examples, and some comments on the properties of the methods and of the estimates they produce.


The necessity for false positive rate and true positive rate to increase together supports looking for a regression formulation. Our programme is to use the empirical logistic transforms of the two rates, in the expectation that linear regression will then capture an important part of their joint increasing relation.

An empirical transformation

Experience has shown that interpretation of a proportion or rate is often made easier when such a rate, P, is transformed into its logistic transform, called the logit:

logit (P) = In [P/(l - P)].

This transformation has the effect of mapping values of P less than 0.5 into negative numbers, which are absolutely large for P close to zero, and values of P greater than 0.5 into positive numbers, which are large for P close to 1.0. The mapping is symmetric around P = 0.5, so that 0.3, and 0.7 map into points equally distant from zero, as do 0 1 and 0.9, etc.

We define

U = logit(FPR) = ln[FPR/(l -FPR)] = ln[P/(l -P)]

V = logit(TPR) = In [TPR/( 1 - TPR)] = In [Q/( 1 - Q ) ] .

Transforming P and Q into U and V in this manner maps the unit square (ROC space) into the ( U , V ) plane, with the origin corresponding to P = Q = 0.5.

These definitions lead naturally to the estimates:

r? = ln[(c/nz)/(d/n2)1 = ln[c/d]

3 = ln[(a/n,)/(b/n,)] = ln[a/b].

If any of a, b, c or d is zero, the transform involving it is undefined. To avoid this problem we follow Cox5 in making an adjustment in the rates a / n l and c/n2: we use as the estimated logits

,. c + 1/2 U=ln-

d + 1/2

- a + 1/2 V=ln-

h + 1/2

We use these definitions for all the two-by-two tables, including those with no zero. This adjustment is asymptotically negligible but, as we will see later, with finite samples it can exert perceptible downward bias on estimated ROC curves. Note that with the 1/2 adjustment,

= (c + 1/2)/(n2 + 1) and Q = ( a + 1/2)/(~, + 1). If the test measurement X being compared with the threshold y follows a (nearly normal)

distribution called the logistic distribution, then the ROC curve becomes exactly a straight line in ( U , V ) space. More particularly, if for the population without disease

and for the population with disease


then the plot of V against U will be exactly a straight line. (For distributions ‘near’ this logistic distribution, the plot in ( U , V ) space will be ‘nearly’ a straight line. Fortunately inspecting a plot of the data enables us to avoid gross error in fitting by means of a line.)

The observed points (c,, ei) can be used to estimate the line in ( U , V ) space that relates U ( y ) to V(y); then that line can be mapped back into ROC space. This is the approach taken by Kardaun and K a r d a ~ n , ~ who use a method asymptotically equivalent to maximum likelihood estimation of the logistic-based line in ( U , V ) space.

The symmetric special case that arises when a1 = a2 yields a line in ( U , V ) space with slope 45”. Because the eye of the data analyst is more comfortable with horizontal lines, we choose to plot V - U against V + U , the latter being the horizontal co-ordinate. Again logistic distributions imply linearity, but now the symmetric case c1 = a2 yields a horizontal line, as shown in the Appendix. Hereafter, we shall refer to S (short for sum) and D (difference) for V + U and V - U , respectively. We have found that in plotting D against S we are following footsteps more than 20 years old: see Example 6 : 1 in COX.^

Sketch of estimation methods(s)

Our general strategy is as follows. First, for each table calculate r3 and ?, then s = d = p- 0. Then fit a straight line

+ 6 and

D = A + B S

to points ( S , D ) i , i = 1, . . . , k from the various studies. We suggest two methods: (1) equally weighted least squares (EWLS), minimizing squared deviations in the D direction; or (2) a robust- resistant method described by Kafadar,6 which also focuses on deviations in the D direction. A discussion of several alternative resistant line fitting methods appears in an interesting chapter by Emerson and Hoaglin.’ Some of the methods there are less well adapted to hand calculation than Kafadar’s approach (given at (3”) in Reference 6 ) but are easier to program for a computer. Having estimated the straight line in ( S , D ) space, transform it back to ROC space.

Regardless of what estimation method is used to produce A and B, the ROC curve that corresponds to them can be plotted using the formula presented below. (Justification appears in the Appendix.) In applying the formula, one inserts a value of P, and computes the corresponding value of Q. The locus of such points is the estimated ROC curve. The equation relating them is

- -

1 - p ( l + B ) / ( I - B ) - 1

Q = [ 1 + e - A / ( I - B ) ( p > ] . (1)

We observe that in fitting a linear regression, the points with horizontal coordinates that are far from the centre are disproportionately influential in determining the slope. In our problem such points correspond to high values of TPR and FPR or else low values of TPR and FPR. Practical applications, however, rarely use those regions of ROC space; for example, it would be an unusual application in which a high true positive rate was sought, at the cost of FPR > 0.50. Similarly, it would be an unusual application in which a low FPR was sought at the cost of a TPR < 0.5. Actual relevant domains of ROC space will depend on the particular situation; the major point is to avoid basing ROC estimates on data that are in a range of no practical interest, for that amounts to accepting needless extrapolation to the range of interest from data outside that range. (If we had a known true model for the ROC curve this objection would vanish, for in that case the ‘right’ way to use the data that are outside the range of interest would be understood from the true model.)

1298 L. E. MOSES, D. SHAPIRO A N D B. LITTENBERG

Table 111. Sampling experiment: Poisson parameters

Study tl B Y 6

1 20 1 10 10 2 20 1 5 5 3 16 2 4 10 4 18 3 3 10 5 20 4 5 20 6 10 2 5 20 7 10 8 1 16 8 10 2 4 16 9 16 1 8 10

10 10 1 5 10 1 1 20 2 5 10

a was chosen as the parameter of a Poisson distribution for the frequencies occurring at a in Table 11; /I, y and 6 are similarly defined. The following groups of studies coincide in ROC space: I and 2; 5 ,6 and 8; 10 and 1 1 .

Thus, an essential component of the data-analytic methods that we propose is that a ‘relevant ige’ be defined; it would specify an upper bound to ‘relevant’ values of the FPR and a lower md to ‘relevant’ values of the TPR. We happen to choose 0.5 for both these bounds, but they :d not be equal, and they could be larger or smaller depending on the decision problems

,ntemplated. After that choice is made, two consequences follow: (i) only studies in that range x t e r the data analysis; (ii) the summary ROC curve is plotted only within that relevant range. Upon reflection it will be recognized that one effect of this truncation must be to bias upward the estimated line in ( S , D ) space. Consideration of the magnitude of this ‘truncation bias’ appears in the Appendix but we note now that the truncation bias and the bias due to adjustment by 1/2 have opposite signs, and thus oppose one another. Simulations give strong indication that the truncation bias is actually helpful when our methods of estimation are applied.

Before illustrating the methods on some data sets, we note that for any study, d is the estimated log odds ratio, since

D ^ ^ A = v - u =In(-) Q -In(-----=) F = In Q ( l - P) 1 - Q 1 - P (1 - Q ) P .

We have already remarked that if cr, = a2 then the linear relation between D and S , D = A + BS, has B = 0. In that case the estimation becomes a one-parameter problem focusing on A, the elevation of the horizontal line in ( S , D) space, and the task is to estimate the common odds ratio; this is a problem already well explored in the literature.*-I2

FIRST EXAMPLE, WITH KNOWN PARAMETERS

Our first example takes simulated data from a known model, ir. which a common odds ratio underlies the 1 1 data points. Eleven two-by-two tables were constructed (see Table 111); each had Poisson parameters that imply odds ratio 20 (and In(0R) equal to 3.0). They lie on a single ROC curve, the one corresponding to a logistic translation model (thus a1 = a2) with implied odds ratio 20; that ROC curve and the locations (P, Q)i of the 11 studies are depicted in Figure 2. When a point is labelled 2 or 3, the point corresponds to two or three studies, as shown in Table 111.


a e

I I I I I I I I I

0.0 0.1 0.2 0.3 0.4 0.5 0.8 0.7 0.8 0.9 1.0

FPR

Figure 2. A logistic ROC curve (odds ratio = 20) on which the 1 1 simulation studies of Table 111 are identified

Table IV. Poisson sampling experiment: observed tables

U b C d Study TP FN FP TN s d U a-'

~

1 14 0 16 7 4.16 2.58 1.16 2 15 2 12 4 2.85 0.803 0.845 3 17 0 2 12 1.95 5.16 1.25 4 12 4 5 6 0.855 1.19 0.795 5 16 4 7 16 0.51 1 2.09 0.68 1 6 1 1 1 6 22 0.795 3.28 0.908 7 13 7 1 13 - 1.61 2.79 0.905 8 6 2 1 22 - 1.75 3.66 1.05 9 18 1 7 9 2.28 2.75 0.906

10 11 1 5 9 1.49 2.58 0.958 11 21 4 6 8 1.3 1.83 0.726

In calculating the last four columns the frequencies a, h, c, d have all been increased by 1/2.

0.866 1.18 0430 1 1.26 1.47 1.1 1.1 1 0.952 1.1 1.04 1.38

For each study, using uniform random numbers to simulate the four Poisson cumulative distribution functions, Poisson-distributed random variables were drawn, one for each cell, giving the 11 observed tables shown in Table IV. Those observed frequencies have the same distribution as would be obtained by drawing two Poisson observations n l and n 2 , the first with expected value IY + /3, the second with expected value of y + 6, and then observing two binomial random variables: the first, a, with parameters Q = a/(@ + /3) and n , as just found; the second, c, with P = y/bi + 6) and n2 as just found. In this manner we generated the 1 1 two-by-two tables, each from an underlying situation with odds ratio 20.


. 10. + b + +

$11 : 0 + z

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FPR

Figure 3. Simulation results from the 1 1 studies of Table 111. Data of Table IV are plotted twice: + with, and or 0 without, continuity correction

The data analysis: least squares

The 11 points are plotted in ROC space in Figure 3; indeed they are plotted twice. The stars show the points (a/nl,c/n2) that lie inside the relevant region. Two of the points lie on the upper boundary; they reflect the zero frequencies seen in studies 1 and 3. The two points at the right (which arise from studies 1 and 2) appearing as 0 s are truncated and do not enter further analyses. The points are plotted again as crosses, after the addition of 1/2 to each of a, b, c, d as required by our computational routines. The effect of the adjustment may seem surprisingly large; it obviously tends to push an estimated ROC curve away from the desirable northwest corner of ROC space.

Consult Table IV again. The values of s and 6 for study 1 were computed thus: = 1n(14.5/05) = 3.37; = 1n(165/7.5) = 0.79; s = 3.37 + 079 = 4.16; 6 = 3.37 - 0.79 = 2.58.

The values of s a n d b shown for the nine studies (excluding studies 1 and 2, which have FP > 0-5), are plotted in ( S , D) space in Figure 4. The least squares line is constructed in the usual way from the nine ( S , D) pairs (omitting studies 1 and 2), giving D = 2.82 - 0.002s. The line lies nearly flat, at an elevation 2.82, about 0.5 standard error (SE = 0.36) below the true log odds ratio which is 3.0. (The standard error of the intercept comes from the usual least squares calculation.) Each point lies within two asymptotic standard errors of the line (where ASE = [ ] / ( a + 1/2) + l /(b + 1/2) + l/(c + 1/2) + l/(d + 1/2)]1’2 = a), except for the one upon which standard error bars (of length 2.60) have been plotted. (The ticks show 2.00).

The curve in ROC space corresponding to the least squares line is shown in Figures 5 and 6. The curve is the same in both; in Figure 5 the observed values a/(a + b) and c / ( c + d ) are plotted; in Figure 6 the values of (a + 1 /2) / (a + b + 1 ) and (c + 1/2)/(c + d + 1) appear. Observe that the

SUMMARY ROC CURVE FOR DIAGNOSTIC TEST

*

1301

1

Figure 4. The continuity-corrected data of Figure 3 plotted in ( S , D) space, omitting the two out of range points, with the equally weighted least squares line

f

"1

*(

. . . . . . . . . . . . . .

d

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FPR

Figure 5. The image of Figure 4's equally weighted least squares line, in ROC space, with uncorrected data points, plotted as * or 0


E c

r! - :j 0

9 1 0 1 I 1 I I I I I I I

0.0 0.1 0.2 0.3 0.4 0.5 0.8 0.7 0.8 0.9 1.0

FPR

Figure 6. The image of Figure 4‘s line, in ROC space, with continuity corrected data points plotted as t

curve, which was fitted to the 1/2 adjusted points of Figure 6, lies below six of the nine (raw data) points in Figure 5, but in Figure 6 three adjusted points are above, four are clearly below, and two are almost on the curve.

Next we consider an alternative way of fitting a line in (S, D) space.

The data analysis: robust-resistant line

On graph paper carefully plot the points (9, d) based on the 1/2 corrected frequencies. As before, we omit the two ‘outside’ points. The next step is to divide the points into three groups by vertical lines so that one-third of the points lie to the left, one-third to the right and one-third between the two lines. There are three points in each region. (If we were dealing with ten points there would be four in the middle and three to right and left; with eleven points we would use four at right and left and three in the middle.) Now, for the right hand points find the median S value and for those same points the median D value. Mark the point with those two coordinates with a + . Do the same for the points in the left hand region. With our data of Figure 7, in the left region one of the observations has both median values; on the right the point does not coincide with any observation. The slope of this line, - 0.01, is our estimate of B.

Now count the points above and below this line; if they are not equally numerous, slide the line parallel to itself, up or down, to bring about equality. The elevation of this final line determines our estimate of A. In Figure 7, the initial line has three points above and four below, and passes through two points; the fit cannot be improved by raising or lowering it. So in this case the initial line is also the final one. Its intercept A is 2.77, close to the previous result.

This fitted line is thus far a graphical construct. To render it in algebraic form, which will be needed to effect the back-mapping into ROC space, take two widely separated points on the line


b ‘1 (0

MY

t 0

el

01

c

0

i . . 4 . . . ?

I . 4 . I

1 I I . .

. .

. . I I I I J

I L I R I I I I

4 - 3 - 2 . 1 0 1 2 3 4 s

S

Figure 7. Fitting the robust-resistant line to the same continuity-corrected data points a s shown in Figure 4

and identify from the graph paper their coordinates (Sl, Ill) and ( S 2 , D2). Then, in the expression

D = A + B S

take

B=----- D 2 - D 1 , A = DlS2 - D2Sl s2 - s1 s2 - s1 .

Next, formula (l), already given for back-mapping, is applied to A and B. The robust-resistant line fitted in Figure 7 maps into the ROC curve (original points) in Figure 8; as with the equally weighted least squares (EWLS) line, the curve lies beneath six of the nine original points. The eye cannot confidently see a difference between the EWLS ROC curve and the robust-resistant one, though that need not be true with every data set!

The data analysis: assuming constant odds ratio

For this data set we know a priori that, first, the ith two by two table arises from a point on a logistic ROC curve determined by a log odds ratio Oi and, second, all the Oi are equal (in fact, equal to 3.0). In this one-parameter family situation, estimating an ROC curve is the same as estimating a log odds ratio. Without supposing that all the Oi are equal we can get a 96.1 per cent confidence interval for the median 0 in the population from which these studies may be regarded as a random sample; that confidence interval reaches from the next-to-smallest value of d (1.83) to the next-to-largest value (3.66); the median value of d(2.75) estimates the population median. To each of these three d values there corresponds an ROC curve (of logistic form); they are shown in Figure 9. In the same one-parameter family situation the mean value of d and a confidence interval based on the t-test afford an alternative approach. The mean value of d is 2.81 with


0

0.0 0.1 0.2 0.3 0.4 0.S 0.8 0.7 0.8 0.9 1.0

FPA

Figure 8. The image of Figure 7’s robust-resistant line, in ROC space, with uncorrected data points, plotted as * or 0

S I 0 1 I 1 I I I 1 I I I

0.0 0.1 0.2 0.3 0.4 0.S 0.8 0.7 0.8 0.0 1.0

FPR

Figure 9. The true ROC curve (OR = 20, u1 = u2) together with 1 1 uncorrected data points and the median-based symmetric constant-odds ratio ROC curve, with upper and lower confidence bands (confidence 0961)


a E

TNO C U M .- M.H.-burd ROC 6 - conl#.ncrbmds

0 d I I I I I I I I I I

0.0 0.1 0.2 0.3 0.4 0.1 0.8 0.7 0.8 0.9 1.0

FPR

Figure 10. The true ROC curve (OR = 20, u1 = 02) together with 1 1 uncorrected data points and the Mantel-Haenszel ROC curve, with upper and lower confidence bands (confidence 0.961)

standard error 3.84; the t - multiplier tz2 = 2.45 again gives 96 per cent confidence; the confidence interval reaches from 1.87 to 3.75, close to the median-based result. Notice that both these methods will produce wider confidence intervals (on average) if the Oi are not homogeneous. If one is prepared to rely on homogeneity of the Oi then the confidence interval for the common underlying 0 can be constructed via the Mantel-Haenszel method. Here that method gives the point estimate 2.73 with standard error 0.31 5; again using 96 per cent confidence (for comparabil- ity) we find in the normal table Zo2 = 2.06, and arrive at the much narrower 96 per cent confidence interval 2.08 to 3.38, together with the implied logistic ROC curves, yielding the confidence band displayed in Figure 10. The M-H approach does not take account of systematic between-study variation if present.

SECOND EXAMPLE: COMPUTED TOMOGRAPHY AND MEDIASTINAL NODES

Inouye and Sox2 present 14 studies of the sensitivity and specificity of computed tomography as a method to assess presence or absence of mediastinal metastases in patients with non-small-cell lung cancer. Table V shows the frequencies reconstructed from their Table 2. One of these studies, number 79, does not enter our analysis; its false positive rate exceeds 0.5, and its true positive rate is less than 0.5. The table shows the values of s and 6 for each study, and Figure 11 shows the 13 studies as points in (S , D ) space.

In that figure, five fitted lines appear. The two upwardly sloping lines arise from applying equally weighted least squares and the robust-resistant algorithm. The former has slope 0.024 with a standard error of 0.212, suggesting that a summary based on a zero-slope fit should be adequate. The three horizontal lines are (in descending order) those based on the median, the Mantel-Haenszel estimator and the mean. They are in close agreement; the scatter of the points


9 -

9 - Y - 0

0 4 ,

Table V. Inouye and Sax CT-scan data

+ +

+

+ -----.- _.-._.-.-.-.-.___.---.---.---.---.-.-.

Study TP FN FP TN S D U U - '

3 -

5 - 0

Y - c

4 , .- 3 -

3 -

78 79* 80 81 82 83 84 85 86 87 88 89 90 91

" " ~ """" .___.___ +* ---- - - - - -____- ------_--__. __/---- _----- _--- --- +

+

+

+

EWLS

I I I I I I I I I 1

1 1 7 2 5

15 2 4 1

22 21 17 1 29 10

8 6 7 6

20 1 19 1 18 6 18 1 17 3

6 26 15 13 2 32 4 13 1 6 9 15 1 54 4 23

12 25 10 18 9 19 8 65 9 13 6 49

- 0.978

- 0.74 0

- 1.42 1.97

- 0.65 -

- 2.56 - 1.38 - 0.57

2.05 1.85

2.16 - 0.996

- 0'421

1.83 0.637 0.927 0.843 4.39 0.891 2.2 1.06 1.51 0909 2.95 0.869 4.63 0.809 1.92 0723 0.856 0638 3.18 0.85 3.28 0.856 3.09 0576 2.86 0.872 3.64 0698

1.57 1.19 1.12 0.942 1.1 1.15 1.24 1.38 1.57 1.18 1.17 1.74 1.15 1.43

* This study is omitted from our analysis; it has estimated FPR > 0.5 and estimated TPR < 0.5.

from different studies dwarfs the discrepancies among those three lines. Some of the numbers associated with Figure 11 are displayed in Table VI, which shows the estimates of A, B, the standard error of A, and Q*, a global measure of test efficacy, defined in the Appendix.

The top three lines relate to estimates where zero slope is enforced. The standard errors are each of a different sort. The Mantel-Haenszel standard error, much the smallest, contemplates no source of variation other than binomial (or Poisson). (This appears to be contrafactual given the


Table VI. Coefficients of some fitted lines in S, D space: Inouye-Sox data

Median Mean Mantel-Haenszelp Robust-resistant EWLS Inouye-Sox (implied)"

ot 2.95 (0.393)$ 0.814 o t 2.80 (0.306) 0.802 ot 2.90 (0.228) 0.810 0.164 2.62 - 0.788 0.024 2.80 (0.319) 0802 0 2.37 - 0.766

t These values of B are not fitted; they are implied by the assumption of equal variances. 1 Imputed; see text. (i Breslow-Day test and Tarone test each reject a common odds ratio at 0007 < p < 0010.

values of A and Q*. Their summary result, Se = 0.74, Sp = 0.79, leads to a symmetric logistic ROC curve with these

0

0

0

0.0 0.1 0.2 0.3 0.4 0.5 0.8 0.7 0.8 0.9 1.0

FPR

Figure 12. The uncorrected data of Figure 11 in ROC space, with: three fitted curves, the lowest corresponding to the odds ratio of the Inouye-Sox summary point; three marks depicting Q* for median, mean and Mantel-Haenszel constant

odds ratio estimated curves (0814, 0,802, 0.810 see Table VI)

results of Breslow-Day and Tarone tests.). The standard error of the mean is responsive to interstudy variation, if present. The confidence interval for the median (confidence coefficient 0978) reaches from the third smallest value of D, 1.83 (study 78) to the third largest, 3.64 (study 91), a distance of 1.81. The confidence coefficient corresponds to normal deviates of k 2.30; dividing 1.81 by 4.6 then gives the result 0.393, which we take as an imputed standard error for the median. This 'standard error' is also responsive to between-study variation, if present.


The next two lines of Table VI derive from lines fitted to the data in (S , D) space without constraining the slope. Again the standard error shown for EWLS is responsive to between-study variation, if present. The last line is derived from establishing that the symmetric logistic ROC curve passing through the Inouye-Sox summary point has log odds ratio 2.37, and implies the value of Q* shown.

The six ROC curves of Table VI can be drawn in ROC space, and Figure 12 displays three of them. The lowest is the Inouye-Sox imputed symmetric curve; it does not appear to do justice to the thirteen studies, nine of which lie above it. The least symmetric is the curve derived from the robust-resistant line. The third curve comes from the equally weighted least squares line in (S, D) space; it is slightly asymmetric, but it lies so close to the three curves using median, mean and Mantel-Haenszel that they are not depicted, though their values of Q* are indicated on the diagonal line where sensitivity and specificity are equal, that is when P + Q = 1.

DISCUSSION

The area under the ROC curve, a convenient global measure of test efficacy, is not available to us, for we discard the relevance of much of that area. But we propose another global measure of test efficacy, the intersection of the estimated ROC curve with the line where sensitivity = specificity, that is the line TPR + FPR = 1 . We call the common value of sensitivity and specificity Q*; it is a function of A only, and a standard error for Q* is available if least squares has provided the estimate of A . Values of Q* near 1.0 indicate an ROC curve that is snugged up near the desirable north-west corner where sensitivity and specificity are both 1.0.

As elaborated in the Appendix, truncation gives an upward bias to the estimated summary ROC curve, and the 1/2 correction gives a downward bias. In the cases studied by simulation, the truncation bias opposes but does not fully compensate for the bias due to adding 1/2 to all cells.

We have little to say about assessing goodness of fit. We distinguish two kinds of fit. The first relates to whether a straight line in (S, D) space is a reasonable regression function. It takes either marked non-linearity or a wealth of data to resolve this issue. The second kind of fit is concerned with wild data points, a possibility that favours using the robust-resistant line. We believe that if equally weighted least squares is chosen as the estimation method, standard diagnostics, perhaps adapted, might be useful.

Overdispersion

Of much greater concern is the matter of overdispersion. Maximum likelihood, minimum x2, and weighted least squares all proceed as if the only source of randomness were binomial (or Poisson or multinomial) variation. With those methods of estimation, a single very large study will dominate the estimate. On general grounds we think it unwise to rely on the absence of interstudy variance. We believe that it is wise to accept the likelihood that any three studies - however large their sample sizes - would not lie on a single two-parameter ROC curve. If one accepts that premise, then it becomes important not to use estimates that ignore an interstudy component of variation. That is why we prefer equally weighted least squares to weighted least squares; equally weighted least squares is ‘exactly’ appropriate in the limiting case that binomial (Poisson) intrastudy variability is negligible in comparison to between-study variability. Analogously, weighted least squares is exactly appropriate if there is no component of variation between studies - something we regard as unlikely.

Doubtless it is possible to devise methods that estimate the between-study component of variance, and to construct weights that take account of both that component and the binomial


(etc.) components of the various tables. But we do not undertake that task here; if we were to address it we would expect to borrow from works of Wolfe et ~ l . , ’ ~ Wil l iam~,’~ Pocock et af.” and Breslow.’6

The role of models

We have avoided relying on a model, beyond thinking it likely that transforming binomial probabilities to logits will be beneficial, and that since TPR and FPR should increase together, maybe linear regression, on the logit scale, will capture a useful part of that relation between TPR and FPR.

We actually have reasons to distrust the translation model. It posits two separate populations, one sick and one well. This might apply well with a single-gene disease or an infectious one. But in many applications it seems more reasonable to regard disease status Y (such as heart failure) and a variable X used in diagnosis (such as treadmill endurance) as both being continuous and having a joint distribution, dichotomized in both directions: large enough Y constituting ‘heart failure’, and large enough X being a ‘positive’ test outcome. If one takes the joint distribution to be bivariate normal, the double dichotomy does not neatly present two normal (or logistic) densities. Some study of examples indicates that, except for very highly discriminating tests, the actual regression of D upon S , though obviously non-linear, can be serviceably approximated by a straight line. More work is needed.

Even if separate logistic densities are accepted, there are difficulties with the model if g1 # 0,. It is easily ascertained that then the likelihood ratio is not monotonic (and so the ROC curve is not convex). Further, when crl # 0, the strategy ‘ X > y denotes positive’ is unreasonable either for all sufficiently large y or else for all sufficiently small y. That is, for sufficiently extreme values of X in both directions, one of the densities - the one with the larger standard deviation - is the more likely source of an observation; but the rule ‘ X > y denotes positive’ does not honour this fact, and so is unreasonable. To some extent this difficulty is softened by restricting data and estimated ROC curve to the northwest quarter, for it can be shown that in that region the ROC curve is convex whatever is the ratio of g1 to 0,.

Uses of the summary ROC curve

Comparing two diagnostic tests

Our methods offer ways to compare two tests. First, a ‘global’ comparison can be made by comparing the values of Q* for the two studies in the light of their standard errors; thus

could be referred to the normal table if there are ‘many’ (at least ten?) studies for each test so that the central limit theorem promises nearly normal All and A ,, of which QT and QT are smooth functions. We have assumed here that the studies of the two tests were statistically independent; if each site gave a two-by-two table for each test using the same or overlapping subjects then the standard error given above would be incorrect.

A second way of comparing two tests would be to plot all the points for both tests in (S , D) space, as if they were a single sample. Fit the line. Now construct some reasonable statistic that contrasts the two tests, for example the fraction fi of the M1 points of test 1 that lie above the line minus the fractionf, of the M 2 points of test 2 that lie above the line. Iffi exceeds f, significantly, then test I would be regarded as the better test. Significance could be assessed by a permutation


test, but if frequencies are large enough then the two-by-two contingency

Test 1 Test 2

table

can be assessed for proportionality by the usual x 2 test. A third approach would be to compare the heights above (below, counted negatively) the line in the two groups using a permutation test.

Finally, if B = 0 we can apply a simple t-test to the values of Di for the two tests. If B = 0 and each site gives a value of Di for each diagnostic test, the matched pairs t-test applies.

In comparing two diagnostic tests we can expect biases (from estimation, from 1/2 correction, from truncation) to be less irksome because of a tendency for cancellation to occur when diffrerences are taken.

Using the summary ROC locally in decisions

Decision problems involving use of a diagnostic test score sometimes reduce to setting the test threshold to correspond to a given value Vo of the likelihood ratio of the test result. This amounts to identifying the test threshold y o at which the tangent to the R O C curve has slope equal to Vo. From the ROC curve one can find that point of tangency with a ruler and a protractor (or by formula, not developed here). The values TPR, and FPR, for the point can be read from the graph. But what value of y corresponds to that point on the R O C curve? Local records of test values of non-diseased subjects may allow finding what test threshold y produces the false positive rate FPR,. If so, use that value of y for y o and thus obtain the point on the summary R O C curve that yields the desired (estimated) likelihood ratio.

Covariates

Studies of the same diagnostic test at various sites may be accompanied by variations in circumstances. Patient populations may vary in average age, or socio-economic status, or nutritional status. Gold standards may not be the same, and so on. It is natural to ask whether such covariates influence outcomes. Perhaps some of the overdispersion can be removed by taking covariates into account.

It appears that covariates can be taken into account fairly readily, though we have not yet completed such a study. We will indicate the ideas using a single covariate; the multivariate extension is immediate. So let Z be a covariate, taking value Z i in study i . With D and S being as before V - U and V + U respectively, write

Di = A + BSi + CZi

and fit it by equally weighted least squares, obtaining

Lji = 2 + BSi + czi = ( A + d Z i ) + i s i = Ai + B S i ,

In writing the third equality we suggest that the summary R O C curve in (S, D) space has for the ith site an individual intercept ii, allowing each site to fit its own summary R O C curve, taking into account the local value of Z together with the information in the summary R O C curve.


Confidence intervals

We have presented confidence zones for logistic ROC curves in the case that B = 0. They were derived in this way: if B = 0 then we estimate the ROC curve to be the image in ROC space of

D = l n ( Q/(1 - Q) ) = P(l - P )

and that image is

A^(1 - P ) Q = l + e - -1 P . [

Study of the expression above shows that Q is a strictly increasing function of A , for any given value of P between zero and one. So, if A lies between AL and A, (the ends of a confidence interval) then, at each P , Q lies between

[ 1 + e - A L ( q ) ] - ' and [ 1 + e - A u ( y ) ] - '

These are the formulae that produce the curves in Figures 9 and 10. The interpretation of the upper and lower curve is that if the ROC curve has the form

then with stated confidence (the confidence coefficient used in finding A, and A , ) the true ROC curve lies between the two confidence curves.

Just what may be the value and place of confidence intervals relating to summary ROC curves is not entirely clear. Does one want a confidence interval for sensitivity, given a value of specificity? Why? How about the other way around? Should one seek a confidence interval at each point on the curve in the direction normal to the curve (regrets in decisions using the ROC curve are related to excursions in roughly that direction). Which of these (if any) actually coincide? We feel that the purposes and uses of confidence procedures in this context may need to be clarified before further progress on confidence procedures becomes important.

Additional comments

The size of the effect of the continuity correction surprised us. A smaller value than 1/2 may be a good idea, especially where the data lie very near the north-west corner (that is Q* near 1.0). The choice of such correction influences results strongly. Maximum likelihood or minimum x2 methods sidestep this choice, but at the cost of much increased computational effort, and biases would still call for empirical investigation.

With very abundant data or in a very high level computing environment, other methods than those we suggest may be preferable.

We recognize that our policy of utilizing only points that do not lie outside the domain of practical application is controversial. There is unlikely to be any unassailable position on the matter, just as there is no uniformly best policy concerning what to do with outliers more generally: downweight? omit? include? All such approaches have their place, at times. It is our preference in this context to omit them rather than to depend on an ad hoc model to render them informative. Others may prefer to include, still others to look at data sets both ways.


APPENDIX: TECHNICAL NOTES

The model

One way of representing the ROC model is to postulate that there are two populations of scores x on the diagnostic test. Population 1, for the normal subjects, is characterized by a distribution function F with location parameter pl and scale parameter ol; population 2, for the subjects with disease, is characterized by the same distribution function, but with parameters p 2 and 02, We lose no generality in taking p2 > pl and then saying that the diagnostic test is positive if x > y , that is if the score observed exceeds the test threshold y.

If the location and scale family of which F is a member is the logistic family, then we may write

The logistic transforms of P(y) and Q(y), which we call U ( y ) and V(y), are

Evidently both U (y) and V(y) decrease as y increases. Also evidently, U and V satisfy the linear relation

which is of the form

V(Y) = + PU(Y)

with M = ( p 2 - p1)/a2 and P = 01/02. Notice the linear relation does not use the value of y, which only betokens the fact that U and V(and P and Q) relate to the same threshold. But whatever that threshold is, then the logit transforms of P and Q satisfy

V=Cr+PU.

Estimation

Because straight lines are easy to work with, it is attractive to estimate the coefficients M and /? in ( U , V ) space and then translate back to ROC space. Kardaun and Kardaunj do just this; the fitting is somewhat difficult because both U and V are measured subject to error and the error distributions differ from study to study and must be estimated from the fit.

The approach we take is to construct

S = U + V and D = V - U .

Because U and V satisfy a linear relation, so do S and D ; it is

as is readily verified. We prefer to do our estimation in (S , D) space, using simple (but necessarily approximate) procedures. That is, we fit

D = A + B S


and then deduce the relation ( 1 ) between P and Q in ROC space as follows:

V - U = A + B( V + U )

V(l - B ) = A + U(l + B )

V = AM1 - B) + U(l + B)/(l - B)

WQM1 - Q)} = AM1 - B ) + ((1 + W(1 - B ) } ln{P/(l - PI}

Q/(1 - Q)= exp{A/(l - B)) {P/(1 - P))'")''l-B),

whence Q = [l + exp{ - A/(1 - B)} ( ( 1 - P)/P}cl+B)/cl-B)]-l.

This formula enables finding Q for given P, and thus the drawing of the ROC curve, once the values of A and B are in hand.

From the definitions of U and Y we may write

V - u = y ( W 1 - 1 / 0 2 ) + P 2 / 0 2 - P l b 1

V + u = - Y(l/0l + 1 / 0 2 ) + P 2 b 2 + P l / G l .

Study of these equations shows that V + U is essentially a coded form of y; as y varies, V + U varies with y proportionately to l/al + 1/02. On the other hand V - U embodies a comparison between p1 and p 2 , and varies more slowly with y, proportionately with ( l /a l ) - (1/02). If g1 and o2 are equal, V - U is independent of y; V - U would then plot as a horizontal straight line in ( S , D) space and the height of that horizontal line, V - U = ( p 2 - p1)/o, is actually the value of the log odds ratio, since

Q P Q(1 - P) V - U = ln- - In - = In 1 - Q 1 - P (1 -Q)P

There are many methods available for combining several studies that arise from a common odds ratio (which would occur, for example, if in every study u1 = o2 were constant and p1 and p2 were as well, or less demandingly if, for all i, ( p 2 i - pli)/oi was constant and oli = 02i = ai).

In the more general case, where the standard deviations of normal and diseased subjects are unequal, the slope of the line in ( S , D) space is not zero, and two parameters need to be fitted: A and B.

We have offered two methods of fitting the line in ( S , D) space: equally weighted least squares (EWLS) and a robust-resistant fit. Both methods minimize vertical discrepancies - in the D direction. When o1 = o2 this is quite evidently correct; all the information about discrimination is in the odds ratio, which is independent of y. But when crl # c2 and so B # 0, the log odds ratio D depends on y, which is encoded in and varies with S . Now we may be embarrassed by the facts that S is subject to error and that cov ( S , D) at any point is not zero unless the standard errors of I; and Q are equal at that point. Nevertheless we are, at least tentatively, comfortable using deviations in D, because the response we care about lies in the D direction, namely relates to the log odds ratio.

A consequence of restriction to a relevant region

One consequence of our using data points and reporting ROC results for only a subset of ROC space (the north-west quadrant) is that we must forswear access to that convenient one-number descriptor, the area under ROC curve, which depends largely on the 'irrelevant' region.


We offer, for our purposes, another convenient one-number descriptor, the point of intersection of the ROC curve with the line P + Q = 1, which slopes from the north-west corner to the south-east corner. At that intersection, sensitivity and specificity are equal; we call their common value Q*. It, like the area under the entire ROC curve, is an indicator of how closely the ROC curve shoulders up to the desirable north-west corner. The formula for Q* is readily found by the following argument. The point we seek must lie on the line P + Q = 1, whence P = 1 - Q and therefore logit P = logit(1 - Q) = - logit Q. Hence U = - V, whence S = U + V = 0. The ROC curve in (S, D) space is D = A + BS, and the point on it where S = 0 has D = A, that is V - U = A. But because U = - V, we deduce 2V = A or In Q/(1 - Q) = A/2, whence Q/(1 - Q) = eA1’ and Q = [l + e-Ai2]-1. This common value of Q (sensitivity) and 1 - P (specificity) we call Q*. Notice that Q* does not depend on B, the slope of the line in (S, D) space.

Because we can get an estimated standard error for A using equally weighted least squares (call it SE(A^)), we can also get an estimated standard error for Q* by applying the formula from the delta process:

SE(Q*) = SE(A^)

With Q* = [ l + e - it is straightforward to find

Another consequence: bias

Discarding points outside the ‘relevant region’ removes some sample points with large FPR, and others with small TPR; this can be expected to introduce an optimistic bias into the estimated ROC curve. More careful argument shows that this is true. To gauge the size of this bias we undertake some simulations.

On the ROC curve for 0 = 3.0 and c1 = cz were chosen 17 points as equally spaced along the curve as feasible, with seven points inside the upper left quadrant, two at the ROC’S crossings of the quadrant boundary, four outside the quadrant to the right, and four outside the quadrant below.

Each iteration comprised drawing samples (al = n2 = 10,40 or 160) at each of the 17 points. To each of the observed frequencies a, b, c, d was added 1/2; then followed the fitting of ROC curves both by the robust-resistant algorithm (R-R) and by EWLS. In each iteration these fits were done under four different regimens. (i) All 17 points were used in the fitting. (ii) Only those points of the 17 samples that fell inside the north-west quadrant were used in the fitting. (iii) Just the samples from the nine points on the ROC in the quadrant (or on the boundary) were used. (iv) Of the samples originating from those nine points, only the sample points falling inside the quadrant (or on the boundary) were used in estimating the line in (S , D) space. Table VII displays the results. All the biases shown were negative. All were reduced when truncation to the upper quadrant was imposed. With the truncated samples the biases were all less than 0.15 (the estimand 0 equals 3.0) and, except for n l and n2 = 10, the truncated method produced biases whose squares make less than a 10 per cent contribution to the mean square error over the 200 replications. The table also shows some superiority with regard to bias of EWLS over R-R for this model. Not shown are the mean square errors, which again favour EWLS over R-R, with the former being roughly 30 per cent smaller.

When the same simulation study was done with cl/cz = 2/3 the results were qualitatively similar.


Table VII. Simulation results of bias in estimating intercept via EWLS and robust-resistant regression. crl = a2; log odds ratio = 3.0 = true intercept; 17 equally spaced points on the ROC curve furnished the data (entries are in units of 0001). Truncated samples used only sample points inside the upper left

quadrant

n , = n2 10 40 160

R R Obs 17 Truncated Obs 9 Truncated

E WLS Obs 17 Truncated Obs 9 Truncated

- 660 - 147 - 263 - 139

- 664 - 82 - 222 - 95

- 192 - 60 - 102 - 71

- 173 - 10 - 62 - 27

- 56 - 22 - 39 - 23

- 23 - 1 1 - 27 - 12

Based on 200 repetitions. Where on average (bias)2 exceeded one-tenth of MSE the entry is in italic type; where (bias)2 exceeded one-half MSE the entry is in bold type.

ACKNOWLEDGEMENTS

We are indebted to many people for helpful comments on earlier drafts. We particularly thank Bradley Efron, Les Irwig, Frederick Mosteller, Dan Rabinowitz, Seema Sonnad and John W. Tukey.

REFERENCES

1. Swets, J. A. ‘Sensitivities and specificities of diagnostic tests’, Journal of the American Medical Associ- ation, 248, 548-549 (1982).

2. Inouye, S. K. and Sox, H. C. Jr. ‘Standard and computed tomography in the evaluation of neoplasms in the chest’, Annals of Internal Medicine, 105, 906-924 (1986).

3. Kardaun, J. W. P. F. and Kardaun, 0. J. W. F. ‘Comparative diagnostic performance of three radiological procedures for the detection of lumbar disk herniation’, Methods of Information in Medi- cine, 29, 12-22 ( 1 990).

4. Hall, W. J. Eficiency of Weighted Averages, Technical Report 92/02, Department of Statistics and Biostatistics, University of Rochester, 1992.

5. Cox, D. R. The Analysis of Binary Data, Methuen, London, 1970. 6. Kafadar, K. ‘Robust-resistant line’ in Kotz, S. and Johnson, N. L. (eds.) Encyclopedia of Statistical

7. Emerson, J. D. and Hoaglin, D. C. ‘Resistant lines for y vs x’, in Hoaglin, D. C., Mosteller, F. and Tukey,

8. Haber, M. ‘A modified exact test for 2 x 2 contingency tables’, Biornetrical Journal, 4, 455-463 (1986). 9. Mantel, N. and Haenszel, W. ‘Statistical aspects of the analysis of data from retrospective studies of

disease’, Journal of the National Cancer Institute, 22, 719-748 (1959). 10. Mehta, C. R., Patel, N. R. and Gray, R. ‘Computing an exact confidence interval for the common odds

ratio in several 2 x 2 contingency tables’, Journal of the American Statistical Association, 80, 969-973 (1985).

Sciences, vol. 8, 1988, pp. 169-170.

J. W. (eds.) Understanding Robust and Exploratory Data Analysis, Chapter 5 , 1983, pp. 129-165.


11. Mehta, C. R. and Walsh, S. J. ‘Comparison of exact, mid-p, and Mantel-Haenszel confidence intervals for the common odds ratio across several 2 x 2 contingency tables’, The American Statistician, 46,

12. Pagano, M. P. and Tritchler, D. ‘Algorithms for the analysis of several 2 x 2 contingency tables’, Journal

13. Wolfe, R . A., Petroni, G. R., McLaughlin, C. G., and McMahon, L. F. Jr. ‘Empirical evaluation of

14. Williams, D. A. ‘Extra-binomial variation in logistic linear models’, Applied Statistics, 31, 144-148

15. Pocock, S. J., Cook, D. G. and Beresford, S. A. A. ‘Regression of area mortality rate on explanatory

16. Breslow, N. E. ‘Extra-Poisson variation in log-linear models’, Applied Statistics, 31, 144-148 (1984).

146-150 (1992).

on Scientific and Statistical Computing, 4, 302-309 (1983).

statistical models for counts on rates’, Statistics in Medicine, 10, 1405-1416 (1991).

(1982).

variables; what weighting is appropriate?, Applied Statistics, 30, 286-295 (198 1).

combining independent studies of a diagnostic test into a summary roc curve: data-analytic...

Documents