diagnostic scores for interlaboratory study data

© 2000 by the American Society for Testing and Materials

REFERENCE: Proctor, C. H., “Diagnostic Scores for Interlab-oratory Study Data,” Journal of Testing and Evaluation, JTEVA,Vol. 28, No. 4, July 2000, pp. 307–334.

ABSTRACT: The primary goal of an interlaboratory study (ILS),or of a collaborative study as it is also called, is to verify that a mea-surement method will work over many testing or analytic laborato-ries. The data produced and the calculations of the precision state-ment are well established in statistical standards. We suggest somecalculations beyond the precision statement to deal with detectinglaboratory and data point outliers. These calculations, as suggestedby Mandel, were designed to mesh with those of the precisionstatement but to be applied when the data first arrive back to thestudy director. We apply these calculations to 11 sets of ILS datafrom a variety of test methods and show how to read the results forindications of problems with the test method and for locating de-viant laboratories.

KEYWORDS: collaborative study, outliers, power transforma-tion, measurement method

Interlaboratory studies are those covered by standards such asASTM E 691-98 [1], which gives valuable advice on the conductas well as the calculations involved. When a laboratory has devel-oped a test method and has done ruggedness tests and tried it incomparison to existing methods at cooperating laboratories and af-ter all seems promising, it may arrange for an interlaboratory study(an ILS). Samples from a range of materials are sent to a number oflaboratories that agree to learn how to follow the protocol and toapply the method. Within a specified time the measurements, twoor three on each of five or so materials from eight or so laborato-ries, come to the study director who calculates the repeatability andreproducibility standard deviations. These are deemed acceptableor not and the standard measurement protocol is published or not.This is the simple case scenario. However, most sets of ILS data aresufficiently complex and ambiguous to tax the statistical expertiseof the study director and his advisors and the simplified scenariooccurs only rarely. This paper aims to furnish some calculationsthat can overcome the complexities seen in ILS data.

When we joined Committee E11 (Quality and Statistics) ofASTM some years ago it was to work on bulk sampling, but wealso took part in discussions concerning E 691 and its applicabilityto the many phases of ASTM’s overview of measurement methods.When John Mandel [2] outlined some additions to E 691 that he feltwere needed, we implemented them in SAS code. We also added

some features of our own and to verify the reasonableness of thesuite of tests we acquired a collection of ILS datasets and ran theprogram on them. More recently we have saved output from thisprogram into a single file and can begin to see better how the testsperform. Our objective is to review these results.

Protocols for Analysis of Interlaboratory Study Data

In addition to the ASTM standard [1] there are two other ver-sions. One is from the International Standards Organization (ISO),designated ISO 5725-2 [3], and the other is from the Association ofOfficial Analytical Chemists–International Union of Pure and Ap-plied Chemistry (AOAC-IUPAC) [4,5]. In routine scientific workthe norms of replication and independent verification of findingsusually guarantee honesty and correctness of measurements. Inmore commercial applications some formal agreement to all thesteps in making measurements seems to work better, and standardsplay a crucial role. E 691, as mentioned earlier, nicely describeshow to arrange to conduct an ILS. The ISO 5725-2 standard definesconcepts of accuracy, trueness, reproducibility, and repeatability.The AOAC-IUPAC guidelines give detailed procedures for con-centrations studies to follow and the earlier protocol defines theirterm collaborative study (meaning ILS) and also distinguishes itfrom proficiency study (to characterize laboratory performances)and from certification study (to characterize reference materials).All three ILS standards agree on the precision calculations of re-producibility and repeatability standard deviations. They differsomewhat on how to judge outliers, as we will see.

ILS Model Formulation

Data from an ILS are quite simple. Duplicate test values, D ofthem, are reported by each of L laboratories on each of M materi-als. Just n 5 DLM observations with n around 100 or 200. In ISO5725-2 [3] the model appears as

y 5 m 1 B 1 « (1)

for a particular material where m is a constant, “the general mean,”B is a random “laboratory component of bias,” and « is the “randomerror.” This model is a series of one-way random effects models.When we use i 5 1, 2, . . ., L; j 5 1, 2, . . ., M; and k 5 1, 2, . . ., Das indices, then the model becomes

yijk 5 mj 1 Bij 1 «ijk, for j 5 1, 2, . . ., M (2)

The standard assumptions are zero means, normal distributions,and independence for the B’s and «’s. Differences in variances over

307

TESTING FORUM

Charles H. Proctor1

Diagnostic Scores for Interlaboratory Study Data

Manuscript received 01/18/00; accepted for publication 4/20/00.1 Address: 1110 Blenheim Drive, Raleigh, NC 27612.

www.astm.org

Copyright by ASTM Int'l (all rights reserved); Thu Dec 6 08:16:46 EST 2012Downloaded/printed byNorth Carolina State Univ pursuant to License Agreement. No further reproductions authorized.

materials are allowed, but not by laboratories. This model givesrise to the precision parameters as two sets, each of M variancecomponents. One set is the laboratory variance components, thes2

jB, and the other set is the error components, the s2j«. The corre-

sponding ANOVA estimates may be written s2jB and s2

j«, where if s2jB

turns out to be negative we replace it with 0. The square roots of thes2

j« quantities, the sj«, are written sr and called the repeatability stan-dard deviations, while the square roots of the (s2

jB 1 s2j«) quantities

are written sR and called the reproducibility standard deviations. Aprecision statement consists of a listing of the materials with the av-erage level of the observations for each material along with the sr

and the sR quantities.Although it does not pay generally to observe all the niceties of

expectations of mean squares behind the repeatability and repro-ducibility variances, we do wish to comment briefly on what hap-pens to them when there is violation of what are called “repeatabil-ity conditions.” ISO 5725-2 states that “Each group of n (here read“D”) measurements belonging to one level shall be carried out un-der repeatability conditions, i.e., within a short interval of time andby the same operator and without any intermediate recalibration ofthe apparatus unless this is an integral part of performing a mea-surement.” Not all groups doing an ILS observe this prohibition,and some opt to introduce extra factors such as batches or days. Itturns out that if one ignores extra sources of variance in the calcu-lations then the estimate of total variance (s2

R 1 sr2) remains unbi-

ased for total variance. Thus, sR is still the correct reproducibilitystandard deviation. Of, course, sr estimates a mix of added sourcesof duplication variance but one simply keeps that in mind.

Although all three standards caution that the measurementmethod should be well established, it usually happens that one ormore laboratories find problems in using it. In addition, a certainnumber of very large or very small observations seems inevitable.One also often finds gradual and general departures from normal-ity in the distribution of errors as skewness (asymmetric tails) andkurtosis (long or short tails). The model, Eq 2, contains an additivelaboratory bias term, Bij, but often the laboratories also differ in theway they track the materials. Putting in model terms for each labo-ratory that are linear in the m terms allows them to represent mul-tiplicative bias, while quadratic, cubic, etc., terms can, of course, beadjoined as well.

When laboratories are viewed individually, then laboratories be-comes a fixed effect. By supposing that ∑Bij 5 0 and that ∑B1ij

5 0, when summations are over the laboratories, the following is amodel for fixed additive and multiplicative laboratories effects andis useful for laboratory diagnostics.

yijk 5 Bij 1 (1 1 B1ij)mj 1 «ijk, for j 5 1, 2, . . ., M (3)

In future work based on this model, the multiplicative biases aredenoted slopes and the variances are called scatters, while the ad-ditive effects will be termed means. The quantities called additiveeffects or means in Eq 3 have also been called biases, but perhapsthat terminology is a bit too strong. A quantity worthy to be calleda bias should also be correctable, we feel. In most cases of ILS datathe fact that some laboratory reports generally high values is no in-dication that it will do so over time. More commonly, a change ofreference solution or of operator will shift the mean. That is, thequantities called “laboratory biases” are transitory quantities. Theyare taken to be set and “fixed” for the diagnostics phase of analy-sis, but are part of random measurement effects when it comes timeto evaluate the measurement method.

Finding an adequate model for the observations may requirehigher order effects, non-normal errors, and dropping some outly-

ing data points. However, it can happen that the model for trans-formed observations x, where x 5 yp and p is a judiciously chosenpower in the range roughly from 22 to 12 with zero representingthe log, may not require such modeling complications. If the unitsfor x are acceptable, the standard may even be rewritten to be ameasurement method for x values. In any case, use of the x valuesseems preferable for assessing outliers where the tests use signifi-cance points based on assumptions of normally distributed errors.

When data first arrive into the study director’s files they arescanned for outliers by laboratories and by individual observations.At this time laboratories are taken as fixed, and one can test for ho-mogeneity of their error variances, of their tracking slopes, and oftheir additive effects and deal with specific laboratories with regardto their large or small observations. Incidentally, it is unwise everto question a specific observation unless everybody is being askedabout his or her observations. There may be a tendency for any lab-oratory whose data are questioned to wish to withdraw them, butthis would prejudice the study and must be avoided. When every-one is asked to check for misplaced decimals the suspicious datapoint may well get changed. Reviewing how the study went withall participants may elicit some disconformities and lead to a legit-imate withdrawal of some extreme observations at one laboratory.Once the data are declared legitimate, then laboratory effects be-come random as in Eq 2, and the precision statement is issued.

More comprehensive model selection calculations based on Eq3 have been discussed by Fuller [6] and also in Geongbae Jeon’sthesis [7 ]. These approaches employ maximum likelihood meth-ods and are not simple to program. They hold great promise and weawait their trial on the collection of ILS data sets we have assem-bled. The SAS code that we wrote early on to implement Mandel’ssuggestions [2] plus an outlier routine that includes transformationto x values as powers of y is copied as an Appendix in GeongbaeJeon [7 ]. The more recent version is the following and the SAScode for it is in Appendix A to this paper.

The ILS Calculations

In view of the many ways that ILS data might fail to conform tomodel assumptions, it seemed best to choose test statistics and teststhat gave reasonable results for those sets of data we had. The fol-lowing calculations thus cannot be shown to be best either from sta-tistical theory or from simulations, but they seemed to work. Forthe first three aspects of slopes, scatters, and means, we begin withthe cell means, the Ywij 5 ∑D

k51 Yijk/D for i 5 1, 2, . . . L and j 5 1,2, . . . M, and then compute the material means Ywj 5 ∑L

i51 Ywij/L.Now regress, for each laboratory, the cell means on the materialmeans. These are the basic steps outlined by Mandel [2].

We then use the slopes and the error variances from these re-gressions.

bi 5 ∑M

j51

(Ywij 2 Ywi)(Ywj 2 Yw)/∑M

j51

(Ywj 2 Yw)2, for i 5 1, 2, . . ., L (4)

where Ywi 5 ∑M

j51

Ywij/M and Yw 5 ∑L

i51

Ywi/L.

s2ei 5 3 ∑

M

j51

(Ywij 2 Ywj)2 2 bi2 ∑

M

j51

(Ywj 2 Yw)2/

3 (M 2 2), for i 5 1, 2, . . ., L4(5)

308 JOURNAL OF TESTING AND EVALUATION


Because there are usually differences in error variances by ma-terials, the distributions of the s2

ei will not be easy to describe.Nonetheless, it is only the departure of the largest of them fromtheir average that we need to test and the usual procedure givesgood results. That is, one calculates the ratio of the largest overtheir sum and refers to the distribution that Cochran [8] derived forthis statistic, called g, where g 5 max {s2

ei}/∑s2ei. A tail probabil-

ity of 0.01 seems to signal the notable cases.Tests for the slopes and means were devised in a similar format,

since both are linear combinations of the error terms. The teststatistics are

qmi 5 1.5 | Ywj 2 Yw | ÏMw/se, for i 5 1, 2, . . ., L, (6)

and

qbi 5 1.5 | bi 2 1 | /sb, for i 5 1, 2, . . ., L, (7)

where s2b 5 se

2/ ∑ (Ywj 2 Yw)2 and se2 5 ∑ s2

ei/L. Both q statistics areto be referred to the studentized range distribution for L cases andfor M-2 degrees of freedom. The factor 1.5 was found helpful inboosting the number of signals at the 0.01 level to correspond towhat we saw to be a “reasonable” number of cases of deviant slopesor means out of line.

The 1% points of the q and g distributions are given in the Dixonand Massey textbook [9] and all computations are well within thecapability of the hand-held calculator. When we entered these for-mulas into an SAS program, the output was recorded as the threeone-tailed P-values denoted P0bi, P0gi, and P0mi for the slopes, scat-ters, and means, respectively.

The next step was to find a reasonable power of transformationp for producing transformed observations x 5 (y 1 y0)p. The ob-jective now is to attain constant error variance along the range ofmaterials. Since p 5 0 corresponds to the log transform, a shift inorigin is invoked if needed to bring all observations above zero.The constant y0 is left at zero if all observations are greater thanzero. If the minimum observation is zero then y0 is set to half of thedistance to the next largest value. If the minimum is negative thenadd the absolute value of the minimum to all observations and treatthe case as for a minimum of zero. Now we are ready to followAnscombe’s procedure [10] for computing p. First regress the vari-ances of cell means on the material means to get the slope, calledh, and then solve for p as

p 5 1 2 (1/2)hYw/1∑M

j51

sj2/M2 (8)

where

sj2 5 ∑

L

i51(Ywij 2 Ywj)/(L 2 1), for j 5 1, 2, . . ., M (9)

and

h 5 ∑M

j51

(Ywj 2 Yw)sj2 /∑

M

j51

(Yw 2 Yw)2 (10)

Once the observations have been transformed, repeat the calcula-tions for means, slopes, and scatters using the x’s. The resulting P-values may be denoted P90bi, P90mi, and P90gi.

The three standards mentioned above give somewhat differentadvice for handling outliers. ASTM E 691 uses differences be-tween cell means and material means (the h statistics) for one set oftests and uses cell variances (the k statistics) for another. The ISOstandard calls these the Mandel h and k statistics and also offers

Cochran’s test for cell variances and Grubbs’ tests for single andpairs of observations. These last two are those used by the AOAC-IUPAC protocol. The Cochran’s test is the same in all. Testing in-dividual observations with the Grubbs’ test seems preferable to us-ing cell means or laboratory means, but the pairing device forGrubbs’ test we find problematic. In the one case we saw it activewas for an ILS with D 5 2 and it picked two observations in thesame cell.

We believe Cochran’s criterion is better exploited for the labo-ratory average of error variances, which we have captured with theg statistic. For outliers we have used external studentized residualsafter transformation on individual observations for some time [11]and find it serves well here. Both of the Grubbs’ tests and the Man-del h statistics use separate error variances for the separate materi-als and thus suffer from reduced degrees of freedom. Checking foroutliers after transforming allows us to use a pooled error variancebased on more degrees of freedom.

Our computational routine for detecting outliers uses a simpleone-way material means model. Each observation is dropped inturn, and, using the rest of the data, one computes the materialmean and uses the deviation in a standardized residual with tailarea from the Student’s-t distribution multiplied by the squareroot of n. We write out “square root” so it will not be missed.This makes the 0.01 level perform correctly in drawing attentionto deviant observations. The most deviant observation is droppedif its signal is below 0.01, and then the Anscombe power is cal-culated and applied if its h value is significant at the 0.05 level.With the newly transformed observations we check for outliersand drop those beyond the 0.01 signal and retransform if calledfor. The final result of cycling through these steps is a power anda list of outliers.

In the choice of outlier detection methods there is just “no ac-counting for tastes.” Our method of transforming and droppingcyclically seems to give quite similar results to those found with ro-bust regression (for example, Terbeck and Davies [12]). This, ofcourse, should be the case. That is, robust methods down-weightextremes, while we drop them. Our choice of the action level as a“significance signal” of 0.01 or less is based on experience with thecollection of ILS data and many other sets, and we invite others tocompare their results with ours. One can simulate outlier mecha-nisms endlessly without being sure that one approaches reality. Weprefer watching persons familiar with the subject matter ponder thespecific signaled observations.

Summary Scores

Having conducted tests for slopes, scatters, and means by labo-ratories and for outliers over the whole set of data, we now need tosummarize the results. We do this by calculating a diagnostic scorefor each laboratory. The reader should also consult Uhlig and Lis-cher [13], who compute scores based largely on outliers that weredesigned for proficiency studies of concentrations in which the ob-jectives are to assess laboratories, while the present scheme is foruse with any type of ILS and during the review of the data phase.The diagnostic scores are computed by adding contributions fromthe various types of deviance. An outlier with a significance signalless than 0.01 from a laboratory is counted in that lab’s score. AnyP-value less than 0.01 for a slope, scatter, or means test also adds 1to the score of the lab involved. Any P9-value less than 0.01, whichreflects a test after transforming, was deemed a more serious flawand counted 3.

These contributions were collected by laboratories to give the

PROCTOR ON INTERLABORATORY STUDY DATA 309


scores called S1, S2, etc. in Table 1. They were also collected bythe four sources (slopes, scatters, means, and outliers) to give thequantities called DIAGNOS1 to DIAGNOS4 for characterizing themethod. Finally the total of contributions was subtracted from amaximum possible and divided by that maximum to give PCT_COND, a percent condition summary quantity for comparingacross studies. As a maximum sum of all scores we take(11111131313) times the number of laboratories plus 0.08times n. That is, we take 8% as the maximum rate of outliers amongthe data points and then add the pre-transform scores, the 1’s, forthe three kinds of departures to the post-transform scores, the 3’s.

Interpreting the Scores

None of the studies in Table 1 are concentrations and are thus afairly heterogeneous mix of measurement methods. If we pretendwe are examining these results as a study director would, who hadjust received them, the first step would be to look at the outliers sig-

naled to see if any mistakes had occurred in copying the data. Sup-pose all looks correct. We would next ask if there is evidence ofoutlier laboratories. We suggest looking at PCT_COND, and onlyif this summary quantity is below 95% do we consider there couldbe a laboratory outlier. That is, with good condition (i.e.,PCT_COND above 95%) any discrepancy could easily be a ran-dom aberration. That lets out studies 7 and 10 and possibly 3. If weadmit Study 3 as possibly having an outlying lab, then one nextlooks at G1 and G2, the Fisher’s shape coefficients calculated onthe quantities S1, S2, etc. We are now using G1 and G2 as indica-tors of outliers, which they have been found to be (Ferguson [21]).

Since both G1 and G2 are zero for normally distributed scores,we will divide each by its approximate standard deviation to gettest statistics. Formulas for G1 and G2 as well as for their standarddeviations deserve to be exhibited to acknowledge their author, R.A. Fisher [22]. We use n for historical reasons to refer to the num-ber of observations, but L should have been used for present pur-poses. We also write di for deviations of the scores S1, S2, etc. from


TABLE 1—Non-concentrations studies information that fits on one line for each study.


their mean. The formulas are:

g1 5 k3/k23/2 and g2 5 k4/k2

2 (11)

where

k2 5 ∑ di2/(n 2 1), k3 5 n ∑ di

3/[(n 2 1)(n 2 2)]

and k4 5 n 3(n 1 1) ∑ di4 2 3(n 2 1) (12)

3 1∑ di222

/n4/[(n 2 1)(n 2 2)(n 2 3)]

and

SE(g1)2 5 6n(n 2 1)/[(n 2 2)(n 1 3)(n 1 5)] and

SE(g2)2 5 24n(n 2 1)2/[(n 2 3)(n 2 2)(n 1 3)(n 1 5)](13)

The test statistics t1 5 g1/SE(g1) and t2 5 g2/SE(g2) are used to givetail areas under the standard normal distribution, and the 0.05 levelseems about right for the significance point.

For study 3 they are both highly significant as can be seen in theP10 and P20 values. That is because there is one laboratory thatshows all the discrepancies—Lab No. 4. Studies 4 and 9 also giveclear indications of having outlying labs—Lab 11 in Study 4 andLab 2 in Study 9. Significances for G1 and G2 are absent for Stud-ies 5 and 6, but another three studies (1, 2, and 8) do have G1 sig-nificances near 5%. These borderline cases must be treated as suchand the study director will earn his salary as he reviews the putativeoutlying laboratories’ report of their experiences with the method.Study 11 shows poor condition overall, and there is one good per-forming lab with one discrepant one that causes significance fromG2, but no outlier. All of these characterizations may sound pro-found, but they need to be checked against what a study director,

under advice from his statistical colleagues, would indeed have dis-covered just from patient review of the numbers.

Although percent condition is rather low for some of the meth-ods in Table 1, the percent reproducibilities (PCTRPD’s) are all un-der 20%. Of course, they should all be less than 5%, or certainlyless than 10%, but this is an imperfect world. The 19.4% value forPCTRPD comes from a celebrated study of Bekk Smoothness forpaper [15]. Notice here the high value for DIAGNOS1 5 25 inStudy 2. This is for slopes and the authors have speculated that hu-midity differences may affect the test. These kinds of data can beused to point out the necessity for the multiplicative bias term in Eq3. For all 11 studies the largest contributor to the scores is DIAG-NOS3 or differences in laboratory means, as might be expected,but next is slopes and lastly scatters.

Computational Considerations

Further details, on which labs showed which types of discrepan-cies, how the errors change over materials, and where the outliersare, can be read from other parts of the output but such informationis almost too plentiful to present, but let’s look at the detailed find-ings for one ILS, the bekk3.txt dataset. The original study (as de-scribed in Mandel and Lashof [15]) called for (and received) 8 du-plicates from 14 materials and 14 laboratories, but the authorsreported only the cell standard deviations and not the eight values.Thus we created two observations in each cell to satisfy the stan-dard deviations, and the data appear in Table 2 with the detailedstatistics for each material in Table 3.

Notice, in Table 2, the indications concerning standard formatfor data from an ILS that were introduced in the computer softwareprogram of ASTM E 691 (User’s Guide [19]). The first code num-ber allows for four types of layout: 1 5 duplicates in line, labs asrows and materials as columns, 2 5 duplicates in line, materials as


TABLE 2—Data as stored in text file. The title appears on line 1 as: “Bekk smoothness, Table II, Mandel and Lashof ASTM Bulletin 1959.” The codesfor this set of data appear on line 2 as:“4 14 14 2.” The first code digit 4 indicates that duplicates are stacked, rows are materials, and columns are lab-

oratories. The other digits tell that there are 14 labs and 14 materials and 2 duplicates.


rows, 3 5 duplicates stacked, labs as rows and 4 5 duplicatesstacked and materials as rows. Thus, in order to verify the calcula-tions you will first average vertical pairs to get the cell means. Thatis, the average of 6.00 and 6.75 gives 6.375, which is the cell meanreported in Table 11 of the original article (Mendel and Lashof[15]) for Laboratory 1 and the first material (No. 2 it is calledthere). The cell standard deviation was listed as 1.14. The authorsthen transformed the data by taking logs, which was a popular de-vice at the time, and can be compared to the PWR and POWER val-ues reported in Table 2, which were 0.292 and 0.009. That is, a sim-ple regression of material variances on material means leads toPWR 5 0.292, which is more severe than a square root transform(at 0.5) but not so severe as a log transform (at 0.0). However, bydropping outliers and using the regression of squared residuals onfitted values we would find POWER 5 0.009 in agreement withthe log transform.

Laboratory 6 is the one out of line with a borderline indicator ofan outlier (i.e., P10 5 0.054). In the original article this laboratoryand also Lab 4 were cited as outliers and the reason was that theywere located overseas where humidity would be higher. Now thestudy director has the problem of deciding whether to remove theone or both. After removing just Laboratory 6 the PCTRPD (per-cent reproducibility) drops to 10.5% from the initial 19.4% so onecan see what drastic effects dropping data have on the precision es-timate. With both Labs 4 and 6 gone, the percent reproducibility atthe average material becomes 12.3%. Now PCT_COND sits at75% and both slopes and means are still showing large discrepancyscores but there are no outliers among the individual observations.Thus, the method has its problems even when restricted to domes-tic labs.

Although we obtained an “average material” mean by simplyaveraging the M material means, we used interpolation in powers


TABLE 3—Details of the laboratory diagnostic statistics for Bekk smoothness study.

Source: Studies 1, 4, 6, and 9 are in ASTM [14] from the following standards: (D 4467-94, Table A1.9), (D 4483-96, Table A7.2), (D 3980-88, Table7), and (C 802-96, Table X1.2). Study 2 is from Mandel and Lashof ([15 ], Table II). Study 3 is from Crowder, M. ([16], Table 1). Study 5 is from Mandel([17 ], Table IV). Study 7 is from ISO ([3 ], Table B.6). Study 8 is from Mandel ( [18], Table 8.1). Study 10 is from ASTM ([19], Fig. 3). Study 11 is fromJaech ([20], Table 6.16). All data are in Appendix B to this paper.


of material means for getting the overall sR and sr values. Withapologies for over-elaborating a rather trivial operation, here arethe details. Regress the squares of sR values on the materialmeans; call the regression coefficient h; compute PWD 5 0.5 h(overall-average-of-material-means)/(average-of-squares-of-sR-values); transform material means by taking the power PWD;regress sR values on these transformed means and predict for theoverall average mean. The predicted sR value becomes the inter-polated value. Do the same for the sr values with PWT as thepower. One should recognize the formula for power as that due toAnscombe [10].

It is obviously a great help to be able to run and rerun the com-putations on numerous data sets and it was the SAS code thatmade this possible. The calculations were developed in the firstplace using only a hand-held calculator, but then we includedsome SAS code for outliers that we had written earlier and theway became open for further complications. Referring to tabledsignificance points is not necessary when distribution functionsare available in SAS, e.g., PROBMC, the studentized range dis-tribution function, that gave us the P-values. Scoring the labs waseasy to do in SAS and saving the whole mass of output to one filecould not be resisted. Our skills with other languages are limitedand, although we may prefer to have written in FORTRAN, forexample, we find it is too much trouble. Thus, while we includea listing of the code in Appendix A, we cannot recommend it asa model for future use and we hope it will be treated as simpledocumentation of the steps.

Summary and Implications

We hope that the task we undertook has been completed. Thiswas to furnish some calculations to make it easier to analyze ILSdata and to check their performance on sets of data. We have spunout the formulas and written the SAS code. Then we made itthrough 11 sets of ILS data to arrive at conclusions about which areoutlier laboratories, about judging when the condition of the data isnot too good, and about what features of laboratory differencessuch as slopes, scatters, means, or outliers seem characteristic ofthe test method.

The implications are still a bit murky. We will be interested tosee if the methods get used. When we learned about fitting cellmeans to material means from John Mandel [2], we thought ev-eryone would want to run the regressions. However, discussionsin Committee E 11 of ASTM subsequent to his presentation,about what to add to E 691, showed more enthusiasm for robustmethods or for graphical methods. We support robust methods foroutliers but would certainly keep the existing conventional calcu-lations for sR and sr. Our routine of cycling between transformingand dropping outliers is essentially a robust method, but it is moreconsistent with the conventional calculations for sR and sr. I donot know how slopes, scatters, and means would be rendered inthe robust framework. Graphical methods are best when the de-viant case is obvious but they are less useful for borderline cases.We believe our rules based on significance levels and the 95%condition point will promote action toward approving or not ap-proving the test method. That is, we fear that viewing graphicsmay tend too often to excuse inaction. These considerations urgeus to accumulate results on existing sets of ILS data from variousfields and thereby to establish norms against which to judge newsets of data.

When one considers the wide range of kinds of measurementsfor which ILSs have been run, it is truly amazing that the same ap-

proach can work for all. We mentioned that extra sources of varia-tion in the duplicates do not invalidate the calculation of sR, but in-flate sr. In a similar way the measurements may be severelyrounded (read “discrete”), as in the integer ratings for Study 1 ofTable 1, but laboratory averages for each material can still benearly normally distributed and again sR makes good sense. Like-wise, the diagnostics should work the same as always. The finding,for example, that the rating method of Study 1 is none too good butthat there is no outlier laboratory is a borderline finding but seemstrustworthy. The appendices have the data and give enough of thereferences so you can learn what the various committees thoughtabout the data. We hope you will try out your own methods to com-pare with these.

References

[1] ASTM, Designation E 691-92, “Standard Practice for Con-ducting an Interlaboratory Study to Determine the Precisionof a Test Method,” Annual Book of ASTM Standards, Vol.14:02, ASTM, West Conshohocken, PA, pp. 350–369.

[2] Mandel J., “Analyzing Interlaboratory Data According toASTM Standard E 691,” paper presented (read by DuncanMcCune) at the ASTM Quality and Statistics: Total QualityManagement Symposium, Atlanta, GA, May 1993.

[3] ISO, Reference Number 5725-2: 1994(E), “Accuracy (True-ness and Precision) of Measurement Methods and Results—Part 2: Basic Method for the Determination of Repeatabilityand Reproducibility of a Standard Measurement Method,”1994. Reproduced by Global Engineering Documents withpermission of ISO under royalty agreement.

[4] Horwitz, W., “Protocol for the Design, Conduct and Interpre-tation of Collaborative Studies,” Pure and Applied Chem-istry, Vol. 60, 1988, pp. 855–864.

[5] IUPAC Workshop of May 1987, “Guidelines for Collabora-tive Study Procedure to Validate Characteristics of a Methodof Analysis.” Journal of the AOAC, Vol. 72, 1989, pp.694–704.

[6] Fuller, W. A., Measurement Error Models, John Wiley, NewYork, 1987.

[7 ] Jeon, G., “Latent Variable Fit to Interlaboratory Studies,” un-published Ph.D. dissertation, North Carolina State Univer-sity, Department of Statistics, Raleigh, NC, 1999.

[8] Cochran, W. G., “The distribution of the Largest of a Set ofEstimated Variances as a Fraction of Their Total,” Annals ofEugenics, Vol. 11, 1941, pp. 47–52.

[9] Dixon, W. J. and Massey, F. J., Introduction to StatisticalAnalysis, McGraw-Hill, New York, 1983.

[10] Anscombe, F. J., “Examination of Residuals,” Proceedings ofthe Fourth Berkeley Symposium on Mathematical Statisticsand Probability, Vol. 1, University of California Press, 1961,pp. 1–36.

[11] Proctor, C. H., “Calculating Significance Levels for CheckingRegression Assumptions,” Proceedings of the Social Statis-tics Section of the American Statistical Association, 1987, pp.104–109.

[12] Terbeck, W. and Davies, P. L., “Interactions and Outliers inthe Two-way Analysis of Variance,” The Annals of Statistics,Vol. 26, 1998, pp. 1279–1305.

[13] Uhlig, S. and Lischer, P., “Statistically-based PerformanceCharacteristics in Laboratory Performance Studies,” Analyst,Vol. 123, 1998, pp. 167–172.

[14] ASTM, ASTM Standards on Precision and Bias for VariousApplications, ASTM, West Conshohocken, PA, 1997.



[15] Mandel, J. and Lashof, T. W., “The Interlaboratory Evalua-tion of Testing Methods,” ASTM Bulletin, 1959, pp. 53–61.

[16] Crowder, M., “Interlaboratory Comparisons: Round Robinswith Random Effects,” Applied Statistics, Vol. 41, 1992, pp.408–425.

[17 ] Mandel, J., “Non-Additivity in Two-Way Analysis of Vari-ance,” Journal of the American Statistical Association, Vol.56, 1961, pp. 878–888.

[18] Mandel, J., Analysis of Two-Way Layouts, Chapman & Hall,1995.

[19] ASTM, Interlaboratory Data Analysis Software: E 691User’s Guide, ASTM, West Conshohocken, PA, 1990.

[20] Jaech, J. L., Statistical Analysis of Measurement Errors, JohnWiley, New York, 1985.

[21] Ferguson, T. S., “On the Rejection of Outliers,” Proceedingsof the Fourth Berkeley Symposium on Mathematical Statisticsand Probability, Vol. 1, University of California Press, 1961,pp. 253–287.

[22] Fisher, R. A., Statistical Methods for Research Workers, 13th

ed., Hafner Pub. Co., New York, 1958.



APPENDIX A

SAS Code



APPENDIX B

ILS Non-Concentrations Datasets