an empirical study of the validity of the spearman-brown formula as

9
AN EMPIRICAL STUDY OF THE VALIDITY OF THE SPEARMAN-BROWN FORMULA AS APPLIED TO THE PURDUE RATING SCALE H. H. REMMERS, N W. SHOCK, AND E L. KELLY* Purdue University HISTORICAL SUMMARY When the same test is given at different times to the same persons, the correlation between the resulting scores gives a measure of the reliability of the test used. It is a well known fact that the relia- bility of a test is increased as its length is increased, but for some time there was no way of foretelling just how much the reliability would be increased in its length. There was need for a formula that would predict the reliability of a test n times as long as a given test. In 1911 William Brown published a formula which promised to meet this need. It was found to be only a special case of an earlier formula of Spearman's for finding the correlation between the sums or averages of scores. 2 This revised formula is now in current use and is known as the Spearman-Brown prediction formula. If r, t is the coefficient of reliability of a test, then r», the coefficient of reliability of a test n times as long, is given by the formula: 3 nr,, r Since r is always less than 1, r n will always be greater than ru, and thus the reliability of the longer test will increase as some logarithmic function in its length. In order for this formula to be valid, it must be assumed that the test material is homogeneous in character, or statis- tically speaking, that the intercorrelation between the units is equal and their standard deviations are equal. Were these assumptions strictly fulfilled, perfect prediction would result. In practice, however, such ideal conditions are never found, so our problem becomes that of determining just how accurate a prediction may be obtained for the material at hand. It was only natural that the introduction of such a formula would be of much interest to all those interested in tests and measurements. Accordingly, studies soon began to determine the validity of the for- • Responsibility for this study is divided as follows: Shock and Kelly collected the data and carried out all statistical computations under the direction of Rem- mere, who suRgestwl the problem and the experimental and statistical procedure. 187

Upload: donhu

Post on 26-Jan-2017

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: an empirical study of the validity of the spearman-brown formula as

AN EMPIRICAL STUDY OF THE VALIDITY OF THESPEARMAN-BROWN FORMULA AS APPLIED TO

THE PURDUE RATING SCALEH. H. REMMERS, N W. SHOCK, AND E L. KELLY*

Purdue University

HISTORICAL SUMMARY

When the same test is given at different times to the same persons,the correlation between the resulting scores gives a measure of thereliability of the test used. It is a well known fact that the relia-bility of a test is increased as its length is increased, but for some timethere was no way of foretelling just how much the reliability would beincreased in its length. There was need for a formula that wouldpredict the reliability of a test n times as long as a given test. In1911 William Brown published a formula which promised to meetthis need. It was found to be only a special case of an earlier formulaof Spearman's for finding the correlation between the sums or averagesof scores.2 This revised formula is now in current use and is knownas the Spearman-Brown prediction formula. If r,t is the coefficientof reliability of a test, then r», the coefficient of reliability of a test ntimes as long, is given by the formula:3

nr,,r

Since r is always less than 1, rn will always be greater than ru, andthus the reliability of the longer test will increase as some logarithmicfunction in its length. In order for this formula to be valid, it must beassumed that the test material is homogeneous in character, or statis-tically speaking, that the intercorrelation between the units is equaland their standard deviations are equal. Were these assumptionsstrictly fulfilled, perfect prediction would result. In practice, however,such ideal conditions are never found, so our problem becomes that ofdetermining just how accurate a prediction may be obtained for thematerial at hand.

It was only natural that the introduction of such a formula wouldbe of much interest to all those interested in tests and measurements.Accordingly, studies soon began to determine the validity of the for-

• Responsibility for this study is divided as follows: Shock and Kelly collectedthe data and carried out all statistical computations under the direction of Rem-mere, who suRgestwl the problem and the experimental and statistical procedure.

187

Page 2: an empirical study of the validity of the spearman-brown formula as

188 The Journal of Educational Psychology

mula for various classes of material. Holzinger,4 in studying theapplication of the formula to intelligence test material, found that theformula tended to predict the obtained reliability with fair accuracy fora pool of 6ve tests or less, but for greater numbers the formula over-estimated the empirical results obtained. His test material, however,violated to some extent the major assumption, i.e., homogeneity oftest items. Later, Holzinger and Clayton* applied the formula tospelling words and self-administered test material, and in each casefound a close correspondence between the predicted and observedresults. Kelley,' using Gordon's data on lifted weights, also found aclose agreement between the predicted and observed results on thismaterial. Ruch and others7 applied the formula to spelling words,and they came to the conclusion that it gave a meaningful predictionfor such test material. In two cases out of five, the tendency was tooverpredict slightly the empirical scores. Wood8 applied the sameformula to a set of scores on a hundred true-false questions given tolanguage students, and he found that it tended very nearly to predictor elBe to underestimate the observed results. Furfey* publishedresults while the present study was in progress, showing that the for-mula slightly overestimated the reliability obtained by dividing thetest into a number of components.

PURPOSE OF THIS STUDY

It is the purpose of this study to determine the validity of theSpearman-Brown prediction formula when applied to the personalrating scale now in use at Purdue University. If it is found to be avalid means of predicting the reliability of the combined ratings ofany number of judges, it will be quite useful to the personnel depart-ment in deciding on the number of judgments that should be obtainedon an individual in order to obtain the desired reliability. At thepresent time, each engineering student is rated by five fellow studentsand by five instructors, all of bis own choice. Thus the number ofjudgments has been arbitrarily placed at ten, without knowingwhether it is too low to be dependable or too high to be economical.If it is shown that the Spearman-Brown formula will accurately predictthe reliability for any given number of judgments with the scale, thenthe most efficient number of judgments, both for dependability and foreconomy, may be accurately determined.

In order that the reader may better understand the present study,a short description of the rating scale, which is now being used, is

Page 3: an empirical study of the validity of the spearman-brown formula as

Spearman-Brown Formula and Purdue Scale 189

given. It consists of a printed form containing a paragraph of instruc-tions on the upper part of the sheet, which explains the purpose of thequestionnaire and asks the reader to rate the individual whose nameappears filled in the blank. The scale itself consists of ten traits onwhich the ratings are made. These traits are:

1. Address and Manner2. Attitude3. Character4. Cooperative Ability5. Disposition6. Industry7. Judgment8. Initiative9. Leadership

10. Mental CaliberThe ratings are made on a five point scale, five being considered

perfect, one as very poor, and three as representing the average indi-vidual. Fractional judgments are permitted. It will thus be seenthat fifty is the highest possible total score.

Plice in a recent study of the scale10 came to the following con-clusions:

1. "Students rate each other higher than teachers rate them.2. "Teachers agree better on the rating of students than do the

students themselves.3. "Teachers agreed fairly well with the students except on traits

7 and 10. (Trait 7 is Judgment and Trait 10 is Mental Caliber.)4. "While the general agreement of teachers and students in rating

is low, it is somewhat higher than chance agreement."

EXPERIMENTAL PROCEDURE

If the Spearman-Brown prediction formula is to be used in predict-ing the reliability of a certain number of judgments, certain assump-tions must be made. In the first place it is assumed that one judgeis just as competent to judge traits of character and personality in anindividual as another. This follows from the basic assumption thatthe test material is homogeneous, as is made in the original formula,since here instead of having test elements we have judges. Anotherassumption that follows directly from this is that an individual judgecan be treated as a constant factor, comparable to a test element.

Page 4: an empirical study of the validity of the spearman-brown formula as

190 The Journal of Educational Psychology

It was felt that we should have a group as large as possible withnot less than twenty-five who would judge each other conscientiouslyand who also knew one another well enough to be fairly accurate intheir judgments. Hence, a fraternal group was chosen. Here severalfactors entered which are possibly not the same as those encounteredin the actual field of personnel rating. In the first place all of thesemen knew each other more intimately than would ordinarily be trueof some ratings obtained in practice, since they had lived in the samehouse with one another for almost an entire semester. This factwould tend to raise the reliability of the ratings. On the other hand,the fact that each man was rated by every other man in the house, letpetty jealousy and feeling creep into the ratings in some instances.This would tend to lower the reliability of the scale. This difficultyis obviated in practice since the individual rated chooses his ownjudges, thus possibly eliminating those who would rate him low merelybecause of personal feeling.

The first step was to get the men to rate each other. In order tofacilitate the tabulation of the results each man in the fraternity wasgiven a number as a judge and a letter as a subject to be judged.It was first thought best to have each man judge six others at a time,selecting each group by chance, in order to eliminate the factor offatigue on the part of the judge. This was done, but with the firstand second groups of six a decided change in standards on the part ofthe judge was noted. That is, the judges tended to compare andrank the men within each group of six, instead of judging the menwithin the fraternity as a whole. In order to avoid this difficulty,all of the remaining rating blanks were given out at the same time, thusminimizing the possibility of errors due to the shifting of the standardsused by the individual in making his judgments.

The instructions given to the men were: "In order to carry out aninvestigation of the reliability of personal judgments, we are askingevery man in the fraternity to rate every other man, using the PurduePersonnel Rating Scale. The thing that we want is a conscientiousrating of the character and personality of each man, using the averageman within the fraternity as a standard. The rest of the instructionswill be found on the rating scale. You may be perfectly frank inyour judgments, since all ratings will be anonymous." It was thusmade clear to the men that each was to use his own conception ofthe average man within the fraternity as a standard by which he wasto judge the others

Page 5: an empirical study of the validity of the spearman-brown formula as

Speartnan-Broii'n Formula and Purdue Scale 191

A fine spirit and keen interest was displayed by everyone. Thiswas due to the fact that all were much interested in finding out whatthe rest of their fraternity brothers thought of them. This was possi-ble only if each gave a good rating of the rest. As a result, the ratingscame in promptly and were rather carefully filled out. It is reasonablycertain that each rating was as carefully analyzed as is usually true inratings made in actual practice.

After all the rating scales were received properly filled out, theindividual's score was computed by taking the sum of the rating oneach of the ten characteristics. After these data had been computedthey were arranged in tabular form, subjects judged on the x axis,and judges on the y axis. The arithmetic mean of all the judgmentson each subject were computed. The standard deviations (<r) of thedistribution as well as probable errors (PE) of the means then com-puted and arranged in tabular form as found in Table I. All tablesnot shown in this article because of limitations of space are on filein the Department of Education, Purdue University.

TABLE 1

Letters Indicate Subject. Average is Average of All Judgments on That Subject.a is the Standard Deviation of the Distribution. PE,, ia the Probable Error

of the Average

Subjectrated

ABC

nKFGHIJKLM

Average»

29 031 130 835 837 938 136 333 131 036 937 031 035 0

Distr.

4 314 975 424 904 844 554 063 863 754 745 164 355 10

PE., !

±.58± 67±.72± 6 6±.65±.61± 55± 52± 51± 64±.69± 59± 69

Subjectrated

fii0PQR

1 sTUV

i wXYZ

Averagea

33 434 833 831 835.330 938 333 836 835 934 133 831 9

Distr.

5.204 904.775.284 363 974 604 975 094.685 244 624.03

PE.,

± 70± 66±.64± 71±.59± 52± 62± 67+ 69± 63± 71± 62± 55

Averages of judgments varying in number from one to thirteenand randomly selected, were then correlated. That is, correlations ofone judgment vs. one judgment, two judgments vs. two judgments andso on to thirteen judgments vs. thirteen judgments were obtained.

Page 6: an empirical study of the validity of the spearman-brown formula as

192 The Journal of Educational Psychology

Column 2 of Table II gives the number of chance combinations corre-lated for the different groupings. A total of 96 r's was computed.

Table II is a summary of the results of the entire investigation,since it shows the predicted and experimental reliability of the scalewhen used with given numbers of judgments.

TABLE II.—CORRELATIONS PREDICTED AND OBBERVED

Ones . .Twos. . . .ThreesFoursFives . . . .SixesSevens .EightsNinesTensElevensTwelvesThirteens.. .

Number

2523

86

1065322222

Pre-dicted

r

401.500.572626677700

.728

.751

.770786801813

PEr

± 111+ 099+ 087± 080± 070+ 067± 062 '± 058 ;

± 051 ;+ 050± 048± 046

Ob-served

r

251388

.592538631661777667663731834865852

PEaverage

r

± 040± 020+ 043+ 041± 046± 013± 003± 040+ 027± 015± 041± 034± 044

Differ-ence

observed— pre-dicted

- 013+ 092- 034- 005- 016+ 077- 061- 088- 039+ 048+ 064+ 039

The first column of numerals is the number of correlations in theaverage.

Predicted r is reliability predicted by Spearman-Brown formula.Observed r is reliability obtained by experiment.Inspection of this table reveals certain interesting trends in the

data: (1) A relatively rapid increase in reliability in the first few groupsof correlated judgments. (2) A very close approximation to the pre-dicted value when the number of correlations is relatively large. (3)The last three values in the observed series are higher than those inthe predicted series. This, however, may be merely a chance fluctua-tion due to limited sampling.

A graphical representation of these results is given in Fig. 1. Itwill be seen that the experimental value for the reliability falls outsideof one probable error of the predicted value only three times, andthat it is always within one standard deviation. One reason for theirregularity of the curve of the experimental reliabilities is the relatively

Page 7: an empirical study of the validity of the spearman-brown formula as

Spearman-Brown Formula and Purdue Scale 193

Page 8: an empirical study of the validity of the spearman-brown formula as

194 The Journal of Educational Psychology

small number of correlations which were possible in the groups con-taining the higher numbers.

Figure 2 is made from the same data, but with the curve of pre-dicted values represented by a straight line in order to show that theprobable error of the correlations decreases as the coefficient of correla-tion increases. This fact is not evident from Fig. 1 since the PE ismeasured on the axes rather than on line perpendicular to the curve ofpredicted reliability. Holzinger and Clayton6 in a graph showing theresults of a similar study to test the validity of this formula whenapplied to spelling words did not seem to take this into account, butseemed to measure the probable error on a line perpendicular to thecurve at the given point. This did not give the true location of the PEcurve since all the distances must be measured on the abscissa orordinate.

SUMMARY

A review of previous experimental work designed to test the validityof the Spearman-Brown prediction formula shows it to give a meaning-ful prediction on such materials as mental test items, spelling words,lifted weights, true-false test items in language, and component unitsof rating scales.

The present investigation using judges as equivalent to test itemsin the sense required by the formula, concerned itself with correlationsbetween groups of judgments varying from 1 to 13. These judgmentswere obtained by having each man in a fraternity group of 26 judgeevery other man on certain traits of character and personality as givenin the Purdue Personnel Rating Scale.

CONCLUSIONS

1. The reliability of the average of a number of judgments withthis scale is higher than that of one judgment.

2. The Spearman-Brown formula predicts within two probableerrors or one standard deviation the empirical reliability obtained byexperiment up to and including thirteen judges.

3. Although the increase in reliability diminishes as the number ofjudgments is increased, the decrease in per cent of error of the averagejudgment continues to be appreciable.

4. If the Spearman-Brown formula holds for numbers of judgmentsbeyond thirteen, the number of judgments required for any desiredreliability can be readily determined. The practical import of this

Page 9: an empirical study of the validity of the spearman-brown formula as

SpearmanrBrown Formula and Purdue Scale 195

generalization is of considerable consequence in actual rating or judgingsituation.

REFERENCES

1. Kelley, T. I,.- "Statistical Method," page 353.2. Ibid., page 205.3. Monroe, W. S.: "Theory of Educational Measurements," page 205.4. Holzinger, K. L.: Notes on the Use of the Spearman Prophecy Formula for

Reliability. Journal of Educational Psychology, Vol. XIV, page 302.5. Holzinger and Clayton: Further Experimenta in the Application of Spear-

man's Prophecy Formula. Journal of Educational Psychology, Vol. XVI,page 289.

6. Kelley, T. L.: The Applicability of the Spearman-Brown Formula for theMeasurement of Reliability. Journal of Educational Psychology, Vol. XV,page 300.

7. Ruch, Ackerson and Jackson: An Empirical Study of the Spearman-BrownFormula as Applied to Educational Test Materials. Journal of Educa-tional Psychology, Vol. XVII, page 319.

8. Wood, Ben: Studies of Achievement Testa. Journal of Educational Psy-chology, Vol. XVII, page 263.

9. Furfey, Paul' An Improved Rating Scale. Journal of Educational Psy-chology, Vol. XVII, page 45.

10. Plice, Max Jennings: A Study of Some of the Psychological Aspects ofPersonnel Selection with Special Reference to Purdue University, page 50.(To be published in an early number of the Journal of Industrial Psychology.)

11. Monroe, W. S.: "Theory of Educational Measurements," page 325.12. Ibid., pagp 331.