critical analysis of the statistical and ethical

19
Piycholfiictl BvUttin 1976, Vol. 83, No. 6, 1053-1071 Critical Analysis of the Statistical and Ethical Implications of Various Definitions of Test Bias John E. Hunter and Frank L. Schmidt Michigan State University This article defines three mutually incompatible ethical positions in regard to the fair and unbiased use of psychological tests for different groups such as blacks and whites. Five statistical definitions of test bias are also reviewed and are related to the three ethical positions. Each definition is critically examined for its weaknesses on either technical or social grounds. We ultimately argue that the attempt to define fair use without recourse to substantive and causal analysis is doomed to failure. In the last several years there has been a series of articles devoted to the question of the fairness of employment and educational tests to minority groups (Cleary, 1968; Dar- lington, 1971; Thorndike, 1971). Although each of these articles came to an ethical con- clusion, the basis for that ethical judgment was left unclear. If there were only one ethi- cally defensible position, then this would pose no problem. But such is not the case. The articles that we review have a second common feature. Each writer attempts to establish a definition on purely statistical grounds, that is, on a basis that is independent of the con- tent of test and criterion and that makes no explicit assumption about the causal explana- tion of the statistical relations found. We argue that this merely makes the substantive considerations implicit rather than explicit. In this article we first describe three dis- tinct ethical positions. We next examine five statistical definitions of test fairness in detail and show how each is based on one of these ethical positions. Finally, we examine the technical, social, and legal advantages and disadvantages of the various ethical positions and statistical definitions. Frank L. Schmidt is now at the U.S. Civil Service Commission. Requests for reprints should be sent to John E. Hunter, Department of Psychology, Olds Hall, Michi- gan State University, East Lansing, Michigan 48823. THREE ETHICAL POSITIONS Unqualified Individualism The classic American definition of an ob- jective advancement policy is giving the job to the person "best qualified to serve." Couched in the language of institutional se- lection procedures, this means that an organi- zation should use whatever information it possesses to make a scientifically valid pre- diction of each individual's performance and always select those with the highest predicted performance. From this point of view, there are two ways in which an institution can act unethically. First, an institution may know- ingly fail to use an available, more valid pre- dictor ; for example, it may select on the basis of appearance rather than scores on a valid ability test. Second, it may knowingly fail to use a more valid prediction equation based on its available information; for example, it may administer a more difficult literacy test to blacks than to whites and then use a cut- off score for both groups that assumes they both took the same test. In particular, if in fact race, sex, or ethnic group membership were a valid predictor of performance in a given situation over and above the effects of other measured variables, then the unquali- fied individualist would be ethically bound to use such a predictor. Quotas Most corporations and educational institu- tions are creatures of the state or city in 1053

Upload: others

Post on 29-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Critical Analysis of the Statistical and Ethical

Piycholfiictl BvUttin1976, Vol. 83, No. 6, 1053-1071

Critical Analysis of the Statistical and Ethical Implicationsof Various Definitions of Test Bias

John E. Hunter and Frank L. SchmidtMichigan State University

This article defines three mutually incompatible ethical positions in regard tothe fair and unbiased use of psychological tests for different groups such asblacks and whites. Five statistical definitions of test bias are also reviewed andare related to the three ethical positions. Each definition is critically examinedfor its weaknesses on either technical or social grounds. We ultimately arguethat the attempt to define fair use without recourse to substantive and causalanalysis is doomed to failure.

In the last several years there has been aseries of articles devoted to the question ofthe fairness of employment and educationaltests to minority groups (Cleary, 1968; Dar-lington, 1971; Thorndike, 1971). Althougheach of these articles came to an ethical con-clusion, the basis for that ethical judgmentwas left unclear. If there were only one ethi-cally defensible position, then this would poseno problem. But such is not the case. Thearticles that we review have a second commonfeature. Each writer attempts to establish adefinition on purely statistical grounds, thatis, on a basis that is independent of the con-tent of test and criterion and that makes noexplicit assumption about the causal explana-tion of the statistical relations found. Weargue that this merely makes the substantiveconsiderations implicit rather than explicit.

In this article we first describe three dis-tinct ethical positions. We next examine fivestatistical definitions of test fairness in detailand show how each is based on one of theseethical positions. Finally, we examine thetechnical, social, and legal advantages anddisadvantages of the various ethical positionsand statistical definitions.

Frank L. Schmidt is now at the U.S. Civil ServiceCommission.

Requests for reprints should be sent to John E.Hunter, Department of Psychology, Olds Hall, Michi-gan State University, East Lansing, Michigan 48823.

THREE ETHICAL POSITIONSUnqualified Individualism

The classic American definition of an ob-jective advancement policy is giving the jobto the person "best qualified to serve."Couched in the language of institutional se-lection procedures, this means that an organi-zation should use whatever information itpossesses to make a scientifically valid pre-diction of each individual's performance andalways select those with the highest predictedperformance. From this point of view, thereare two ways in which an institution can actunethically. First, an institution may know-ingly fail to use an available, more valid pre-dictor ; for example, it may select on the basisof appearance rather than scores on a validability test. Second, it may knowingly fail touse a more valid prediction equation basedon its available information; for example, itmay administer a more difficult literacy testto blacks than to whites and then use a cut-off score for both groups that assumes theyboth took the same test. In particular, if infact race, sex, or ethnic group membershipwere a valid predictor of performance in agiven situation over and above the effects ofother measured variables, then the unquali-fied individualist would be ethically bound touse such a predictor.

Quotas

Most corporations and educational institu-tions are creatures of the state or city in

1053

Page 2: Critical Analysis of the Statistical and Ethical

1054 JOHN E. HUNTER AND FRANK L. SCHMIDT

which they function. Thus, it has been arguedthat they are ethically bound to act in away that is "politically appropriate" to theirlocation. In particular, in a city whose popu-lation is 45% black and 55% white, anyselection procedure that admits any otherratio of blacks and whites is "politicallybiased" against one group or the other. Thatis, any politically well denned group has the"right" to ask and receive its "fair share"of any desirable product or position that isunder state control. These fair share quotasmay be based on population percentages or onother factors irrelevant to the predicted futureperformance of the selectees (Darlington,1971; Thorndike, 1971).

Qualified Individualism

There is one variant of individualism thatdeserves separate discussion. This positionnotes that America is constitutionally opposedto discrimination on the basis of race, religion,national origin, or sex. A qualified individual-ist interprets this as an ethical imperative torefuse to use race, sex, and so on, as a pre-dictor even if it were in fact scientificallyvalid to do so. Suppose, for example, that racewere a valid predictor of some criterion, thatis, assume that the mean difference betweenthe races on the criterion is greater than thatthat would be predicted on the basis of thebest ability test available. This would meanthat the use of race in conjunction with theability test would increase the multiple cor-relation with the criterion. That is, predic-tion would be better if separate regressionlines were used for blacks and whites. To theunqualified individualist, on the other hand,failure to use race as a predictor would beunethical and discriminatory, since it wouldresult in a less accurate prediction of thefuture performance of applicants and would"penalize" or underpredict performance ofindividuals from one of the applicant groups.The qualified individualist recognizes thisfact but is ethically bound to use one overallregression line for ability and to ignore race.Thus, the qualified individualist relies solelyon measures of ability and motivation to per-form the job (e.g., scores on valid aptitudeand achievement tests, assessment of pastwork experiences, etc.).

Definition of Discrimination

There is one very important point to bemade before leaving this issue: The worddiscriminate is not ambiguous. The qualifiedindividualist interprets the word discriminateto mean treat differentially. Thus, he will nottreat blacks and whites differently even if itis statistically warranted. However, the un-qualified individualist also refuses to discrimi-nate, but he uses a different definition ofthat word. The unqualified individualist in-terprets discriminate to mean treat unfairly.Thus, the unqualified individualist would saythat if there is in fact a valid difference be-tween the races that is not accounted for byavailable ability tests, then to refuse to rec-ognize this difference is to penalize the higherperforming of the two groups. Finally, theperson who adheres to quotas will also refuseto discriminate, but he will use yet a thirddefinition of that word. The person who en-dorses quotas interprets discriminate to meanselect a higher proportion of persons from onegroup than from the other group. Thus, theadherents of all three ethical positions ac-cept a constitutional ban against discrimina-tion, but they differ in their views of how thatban is to be put into effect.

THREE ATTEMPTS TO DEFINETest Fairness STATISTICALLY

In this section we briefly review three at-tempts to arrive at a strictly statistical cri-terion for a fair or unbiased test. For ease ofpresentation, the discussion uses comparisonof blacks and whites. However, the readershould bear in mind that other demographicclassifications, such as social class or sex, couldbe substituted with no loss of generality.

The Cleary Definition

deary (1968) defined a test to be un-

biased only if the regression lines for blacksand whites are identical. The reason for thisis brought out in Figure 1, which shows ahypothetical case in which the regression linefor blacks lies above the line for whites andis parallel to it. Consider a white and a blacksubject, each of whom have a score of A onthe test. If the white regression line wereused to predict both criterion scores, then

Page 3: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1055

the black applicant would be underpredictedby an amount Ay, the difference between hisexpected score making use of the fact he isblack and the expected score assigned by thewhite regression line. Actually, in this situa-tion in order for a white subject to have thesame expected performance as a black whosescore is A, the white subject must have ascore of B.

That is, if the white regression line under-predicts black performance, then a white andblack are only truly equal in their expectedperformance if the white's test score is higherthan the black's by an amount related to theamount of underprediction. Similarly, if thewhite regression line always overpredictsblack performance, then a black subject hasequal expected performance only if his testscore is higher than the corresponding whitesubject's score by an amount related to theamount of overprediction. Thus, if the re-gression lines for blacks and whites are notequal, then each person will receive a statis-tically valid predicted criterion score only ifseparate regression lines are used for the tworaces. If the two regression lines have ex-actly the same slope, then this can be accom-plished by predicting performance from twoseparate regression equations or from a multi-ple regression equation with test score andrace as the predictors. If the slopes are notequal, then either separate equations mustbe used or the multiple regression equationmust be expanded by the usual product termfor moderator variables. Thus, we can viewCleary's definition of an unbiased test as an

Performance

Criterion

Black Regression Line

White Regression Line

Test

-C (Blocks)

BlockMean

WhiteMean

• Ab i l i ty

FIGURE 1. A case in which the white regression lineunderpreclicts black performance.

FIGURE 2. Regression artifacts produced by unrelia-bility in a Cleary-deflned unbiased test. (A is thecommon regression line for a perfectly reliable abilitytest. B and C are the regression lines for whites andblacks, respectively, for a test of reliability .SO.)

attempt to rule out disputes between quali-fied and unqualified individualism.

If the predictors available to an institutionare unbiased in Cleary's sense, then the ques-tion of whether to use race as a predictordoes not arise. But if the predictors arebiased, the recommended use of separate re-gression lines is clearly equivalent to usingrace as a predictor of performance. Thus,although Cleary may show a preference fortests that meet the requirements of both un-qualified and qualified individualism, in thefinal analysis, her position is one of unquali-fied individualism.

A Cleary-defined unbiased test is ethicallyacceptable to those who advocate quotas onlyunder very special circumstances. In addi-tion to identical regression lines, blacks andwhites must have equal means and equal stan-dard deviations on the test, and this in turnimplies equal means and standard deviationson the performance measure. Furthermore,the proportion of black and white applicantsmust be the same as their proportion in therelevant population. These are conditions thatrarely occur.

Linn and Werts (1971) have pointed outan additional problem for Cleary's definition—the problem of defining the fairness whenusing less than perfectly reliable tests. Sup-pose that a perfectly reliable measure of in-telligence were in fact an unbiased predictor

Page 4: Critical Analysis of the Statistical and Ethical

1056 JOHN E. HUNTER AND FRANK L. SCHMIDT

in Cleary's sense. But because perfect relia-bility is unattainable in practice, the testused in practice will contain a certain amountof error variance. Will the imperfect test beunbiased in terms of the regression equationsfor blacks and whites? If black applicantshave lower mean IQs than white applicants,then the regression lines for the imperfecttest will not be equal. This situation is illus-trated in Figure 2. In this figure we see thatif an unreliable test is used, then that testproduces the double regression line of abiased test in which the white regression lineoverpredicts black performance. That is, byCleary's definition, the unreliable test isbiased against whites in favor of blacks.1'2

Cleary's critics question whether the fail-ure to attain perfect reliability (impossibleunder any circumstances) should be adequategrounds for labeling a test as biased. Butsuppose we first consider this question froma different viewpoint. Suppose there were onlyone ethnic group, whites, for example. As-sume that Bill has a true ability level of 115and Jack has an ability of 110. If abilityis a valid predictor of performance in thissituation, then Bill has the higher expectedperformance; and if a perfectly reliable testis used, Bill will invariably be admitted aheadof Jack. But suppose that the reliability ofthe ability test is only .50. Then the two ob-tained scores will each vary randomly fromtheir true values, and there is some proba-bility that Bill's will be randomly low whileJack's is randomly high—that is, some proba-bility that Jack will be admitted ahead ofBill. If the standard deviation of the observedability scores is 15, then the difference be-tween their observed scores has a mean of 5and a standard deviation of 21. The proba-bility of a negative difference is then .41.Thus, the probability that Bill is admittedahead of Jack drops from 1.00 to .59.

The unreliable test is in fact sharplybiased against better qualified applicants.This bias, however, is not directly racial orcultural in nature. It takes on the appearanceof a racial bias only because the proportionof better qualified applicants is higher in thewhite group. Thus, the bias created by ran-dom error works against more applicants inthe white group, and thus, on balance, the

test is biased against that group as a whole.But at the individual level, such a test is nomore biased against a well-qualified whitethan a well-qualified black. The question thenis whether Cleary's (1968) definition is de-fective in some sense in labeling this situationas biased. If so, it may perhaps be desirableto modify the definition to apply only to biasbeyond that expected on the basis of testreliability alone.

While on the topic of reliability, we should

1 This phenomenon would account for perhaps half ofthe overprediction of black grade-point average in theliterature. In standard score units, the difference inintercepts due to unreliability is AF = (1 — rxx)X (jiw — yua), where rxx is the test reliability andnw — us is the white-black mean difference on thecriterion (about 1 SD). For rxx = -80, this would beonly .2 SD, whereas in the data reported in Linn (1973),the overprediction is about .37 SD.

1 The reader may wonder why we show so muchconcern with the reliability of the test and no concernwith the reliability of the criterion. Actually, despiteits large effect on the validity coefficient, no amount ofunreliability in the criterion has any effect on the regres-sion line of criterion on predictor. Let the true scoreequations for X and Y be X = T + «i and Y = V + 62,and let the regression true score equation beU = aT + /3. Then the observed regression line willnot have the same coefficients. Let the observed regres-sion line be Y = aX + b. The slope of the observedregression line will be

<ry—ax

— = rru~ --ax ax OY ax

avat at / 0piA/<rj

l!\

= TTU --- — = I *TU— II — , Ii f T a x a x \ aT/\ax /

That is, the slope of the observed regression line is theslope of the true score regression line multiplied by thereliability of X. However, note that the slope of theobserved regression line is completely independent ofthe reliability of Y. The intercept of the observedregression line is given by:

j.x = fJ>u — d-t*T = P-U — rxxwfj.T '

Thus, the intercept is also affected by the reliabilityof X, but is completely independent of the reliability ofY. Since the slopes of the true score regression equationsare equal (assuming equal standard deviations, as wehave in this article) any differences in the regressionlines will be equal to the difference between the inter-cepts and hence independent of ryy. In the case inwhich the true score regression lines are the same, thedifference between the observed regression lines isbw — ba = (1 — rxx)(j*uw — ,

Page 5: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1057

note that as the reliability approaches .00,the test becomes a random selection deviceand is hence utterly reprehensible to an in-dividualist of either stripe. On the other hand,a totally unreliable test would select blacksin proportion to the number of black appli-cants and hence might well select in propor-tion to population quotas. Ironically, theargument that tests are biased against blacksbecause they are unreliable is not only false,it is exactly opposite to the truth.

Let us consider in detail the comparison ofwhites and blacks on the unreliable test. Wefirst remark that it is a fact that on the av-erage, whites with a given score have a highermean performance than do blacks who havethat same score. Thus, the use of a single re-gression line will in fact mean that whitesnear the cutoff will be denied admission infavor of blacks who will, on the average, notperform as well, deary's (1968) definitionwould clearly label such a situation biased.Furthermore, in this situation the partialcorrelation between race and performancewith observed ability held constant is notzero. Thus, race makes a contribution to themultiple regression because with an unre-liable ability test, race is in fact a valid pre-dictor of performance after ability is partialedout. That is, from the point of view of un-qualified individualism, the failure to use raceas a second predictor is unethical. If the testis used with only one regression line, thenthe predictors are in fact biased againstwhites. If two regression lines are used, theneach person is being considered solely on thebasis of expected performance.

Thus, in summary, we feel that Cleary'scritics have raised a false issue. To use anunreliable predictor is to blur valid differ-ences between applicants, and an unreliabletest is thus, to the extent of the unreliability,biased against people or groups of peoplewho have high true scores on the predictor.Thus, from the point of view of an unquali-fied individualist, an unreliable test is indeedbiased. On the other hand, a qualified indi-vidualist would object to this conclusion. Useof separate regression lines is statisticallyoptimal because the unreliable test does notaccount for all the real differences on the truescores. But the qualified individualist is ethi-

cally prohibited from using race as a pre-dictor and therefore can employ only a singleregression equation. He can, however, consolehimself with the fact that the bias in the testis not specifically racial in nature. And, ofcourse, he can attempt to raise the reliabilityof the test.

Thorndike's Definition

Thorndike (1971) began his discussionwith the simplifying assumption that the slopeof the two regression lines is equal. Thereare then three cases. If the regression linesare identical, then the test satisfies Cleary's(1968) definition. If the regression line forblacks is higher than that for whites (as inFigure 1), then Thorndike labels the test"obviously unfair to the minor group." Onthe other hand, if the regression line forwhites is higher than that for blacks, thenhe does not label the test as obviously unfairto whites. Instead, he has an extended argu-ment that the use of two regression lineswould be unfair to blacks. Clearly there mustbe an inconsistency in his argument and in-deed we ultimately show this.

Thorndike noticed that whereas using tworegression lines is the only ethical solutionfrom the point of view of unqualified indi-vidualism, it need not be required by anethics of quotas. In particular, if the blackregression line is lower, then blacks will nor-mally show a lower mean on both predictorand criterion. Suppose that blacks are onestandard deviation lower on both and thatvalidity is .50 for both groups. If we knewthe actual criterion scores and set the cutoffat the white mean on the criterion, then50% of the whites and 16% of the blackswould be selected. If the white regressionline were used for both groups, then 50%of the whites and 16% ot the blacks wouldbe selected. However, if blacks are selectedusing the black regression line, then becausethe black regression line lies .5 standard de-viations below the white regression line,blacks will have to have a larger score to beselected (i.e., a cutoff two sigmas above theblack mean instead of one), and hence fewerblacks will be selected. Thus, if the predictorscore is used with two regression lines, then50% of the whites but only 2% of the blacks

Page 6: Critical Analysis of the Statistical and Ethical

1058 JOHN E. HUNTER AND FRANK L. SCHMIDT

will be admitted. Thorndike argued that thisis unfair to blacks as a group. He then recom-mended that we throw out individualism asan ethical imperative and replace it with aspecific kind of quota. The quota that hedefined as the fair share for each group isthe percentage of that group that would havebeen selected had the criterion itself beenused or had the test had perfect validity. Inthe above situation, for example, Thorndike'sdefinition would consider the selection pro-cedure fair only if 16% of the black appli-cants were selected.

What Thorndike has rediscovered has longbeen known to biologists: Bayes's law is cruel.If one of two equally reproductive species hasa probability of .49 for survival to reproduceand the other species has a probability of.SO, then ultimately the first species will beextinct. Maximization in probabilistic situa-tions is usually much more extreme than mostindividuals expect (Edwards & Phillips,1964).

What then was Thorndike's contradiction?He labeled the case in which the black re-gression line was higher than the white lineas "obviously unfair to the minor group." Buthis basis for this was presumably unqualifiedindividualism. In effect, he said that if blacksperform higher than whites over and abovethe effects of measured ability, then this factshould be recognized and blacks should havea correspondingly higher probability of beingselected. That is, if the black regression lineis higher, then separate regression lines shouldbe used. But if separate regression lines areused, then the number of whites selectedwould ordinarily be drastically reduced. Infact, the number of whites selected would befar below the Thorndike-defined fair shareof slots. The mathematics of unequal regres-sion lines (i.e., of Bayes's Law) is the samefor a high black curve as for a high whitecurve: The use of a single regression linelowers the validity of the prediction but tendsto yield selection quotas that are much closerto the quotas that would have resulted fromclairvoyance (i.e., much closer to the selec-tion quotas that would have resulted had aperfectly valid test been available). Thus,Thorndike's inconsistency lies in his failure toapply his definition to the case in which the

white regression line underpredicts black per-formance. The fact that because of a tech-nicality (i.e., racial equality on the perform-ance measure), this effect would not manifestitself in Thorndike's (1971) Case 1 shouldnot be allowed to obscure this general prin-ciple.

If only one regression line is to be used,then a test will meet the Thorndike quotasonly if the mean difference between thegroups in standard score units is the same forboth predictor and criterion. For this to holdfor a Cleary-defined unbiased test, the valid-ity of the test must be 1.00—a heretoforeunattainable outcome.

Once Thorndike's position is shown to bea form of quota setting, then the obviousquestion is, Why his quotas? After all, thestatement that 16% of the blacks can per-form at the required level would not apply tothe blacks actually selected and is in thatsense irrelevant. In any event, it seems highlyunlikely that this method of setting quotaswould find support among those adherents ofquotas who focus on population proportionsas the proper basis of quota determination.Thorndikean quotas will generally be smallerthan population-based quotas. On the otherhand, Thorndike-determined quotas may haveconsiderable appeal to large numbers ofAmericans as a compromise between the re-quirements of individualism and the socialneed to upgrade the employment levels ofminority group members.

There is another question that must beraised concerning Thorndike's position: Is itethically compatible with the use of imper-fect selection devices? We will show thatThorndike's selection rule is contradictoryto present test usage, that is, that accordingto Thorndike we must fill N vacancies not bytaking the top N applicants but by makinga much more complicated selection. Considerthe following example: Assume that one is us-ing a test score of 50 (£ = SO, SD = 10,rxv — .50) as a cutoff and that the data showthat 50% of those with a test score of 50 willbe successful. Applicants with a score of 49would all be rejected, though for rxv = .50about 48% of them would have succeededhad they been selected (a score of 49 is —.1sigmas on X and implies an average score

Page 7: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1059

of —.05 on the criterion with a sigma of .87).Thus, applicants with a test score of 49 canthen correctly state, If we were all admitted,then 48% of us would succeed. Therefore,according to Thorndike, 48% of us should beadmitted. Yet we were all denied. Thus, youhave been unfair to our group, those peoplewith scores of 49 on the test. That is, strictlyspeaking Thorndike's ethical position pre-cludes the use of any predictor cutoff in se-lection, no matter how reasonably determined.Instead, from each predictor category onemust select that percentage that would fallabove the criterion cutoff if the test wereperfectly valid. For example, if one wantedto select 50% of the applicants and the valid-ity were .60, then one would have to take77% of those who lie 1 SD above the mean,50% of those within 1 SD of the mean, and23% of those who fall 1 SD below the mean.And Thorndike's definition could be inter-preted, of course, as requiring the use of evensmaller intervals of test scores.

There are several problems with this pro-cedure. First, one must attempt to explainto applicants with objectively higher quali-fications why they were not admitted—arather difficult task and from the point ofview of individualism, an unethical one. Sec-ond, the general level of performance will beconsiderably lower than it would have beenhad the usual cutoff been used. In the pre-vious example, the mean performance of thetop 50% on the predictor would be .48 stan-dard score units, whereas the mean perform-ance of those selected by the Thorndike ethicwould be .29. That is, in this example, usingThorndike's quotas has the effect of cuttingthe usefulness of the predictor by about 60%.(These calculations are shown in the ap-pendix.)

One possible reply to this criticism wouldbe that Thorndike's definition need not beinterpreted as requiring application to all de-finable groups. The definition is to be appliedonly to "legitimate minority groups," and thiswould exclude groups defined solely by ob-tained score on the predictor. If agreementcould be reached that, for example, blacks,Chicanos, and Indians are the only recog-nized minority groups, the definition might be

workable. But such an agreement is highlyunlikely. On what grounds could we fairlyexclude Polish, Italian, and Greek Americans,for example?

Perhaps an even more telling criticism canbe made. In a college or university, perform-ance below a certain level means a bittertragedy for a student. In an employmentsituation, job failure can often be equallydamaging to self-esteem. In the selection sit-uation described above, the percentage offailure would be 25% if the top half were ad-mitted, but one third if a Thorndikean ad-mission rule were used. Furthermore, most ofthe increase in failures comes precisely from thepoor-risk admissions. Their failure rate istwo thirds. Thus, in the end, a Thorndikeanrule may be even more unfair to those at thebottom than to those at the top,

Darlington's Definition

Darlington's (1971) first step was a re-statement of the Cleary (1968) and Thorn-dike (1971) criteria for a "culturally" fairtest in terms of correlation rather than re-gression. Let X be the predictor, Y the cri-terion, and C the indicator variable for cul-ture (i.e., C = 1 for whites, C = 0 forblacks). He made the empirically plausibleassumption that the groups have equal stan-dard deviations on both predictor and cri-terion and that the validity of the predictoris the same for both groups (hence parallelregression lines). Darlington then correctlynoted that deary's (1968) criterion for afair test could be stated,

That is, there is no criterion difference be-tween the races beyond that produced by theirdifference on X (if any). If all people areselected using a single regression line, thenThorndikean quotas are guaranteed by Dar-lington's "definition two," that is,

rex — rGy.

That is, the racial difference on the predictormust equal the racial difference on the cri-terion in standard score units. However, ifpeople are selected using multiple regressionor separate regression lines, then this equa-

Page 8: Critical Analysis of the Statistical and Ethical

1060 JOHN E. HUNTER AND FRANK L. SCHMIDT

tion would not be correct. Instead there are partial correlation between test and race withtwo alternate conditions: the criterion partialed out is not zero, then

or0 .

That is, if separate regression lines are used,then the percentages selected match Thorn-dike's quotas only if the test has perfectvalidity or if there are no differences betweenthe groups on the criterion.3

Darlington then attacked the Cleary defi-nition on two very questionable bases: (a)the reliability problem raised by Linn andWerts (1971), which was discussed in depthabove, and (b) the contention that race itselfwould be a Cleary-defmed fair test. Actually,if race were taken as the test, then therewould be no within-groups variance on thatpredictor and hence no regression lines tocompare. Thus, deary's definition cannot beapplied to the case in which race itself isused as the predictor test.4 The nontrivialequivalent of this is a test whose sole contri-bution to predicting 7 is the race differenceon the mean of X, but for such a test theregression lines are perfectly horizontal andgrossly discrepant. That is, in a real situa-tion, Cleary's definition would rule that apurely racial test is biased.

Darlington's Definition 3 andCole's Argument

Darlington (1971) proposed a third defini-tion of test fairness, his Definition 3. Thisdefinition did not attract a great deal of at-tention until Cole (1973) offered a persuasiveargument in its favor. We first present Dar-lington's definition, his justification of it,and our critique of that justification. Wethen consider Cole's argument.

If X is the test and Y is the criterion andif C, the variable of culture, is scored 0 forblacks, 1 for whites, then Darlington's Defi-nition 3 can be written as follows: The testis fair if

o-r = 0 •

His argument for this definition went as fol-lows: The ability to perform well on thecriterion is a composite of many abilities, asis the ability to do well on the test. If the

3 Because the groups have equal standard deviationson both predictor and criterion, assume for algebraicsimplicity that the variables have been scaled so thatall within-groups standard deviations are unity. Thismeans that deviation scores are standard scores.Suppose that the selection ratio for whites has beendetermined. Then there is a corresponding standardscore on Y, say Y*, such that the standard scoreY* — Yw would cut off that percentage of whites.To select that same percentage of whites, there is aprediclor score on the lest, Xw, such that

X*w ~ Y

ff the multiple regression equation is

Y = aX + 0C + 7 ,

then the multiple regression cutoff score is

Y* = aX*w + (3 + y

= a(X*w - Xw + Xw) + 0 + -,

= <*(X*w - Xw) + «Xw + ft + 7 .

Because multiple regression always matches the groupmean perfectly,

Yw = ctXw + 0 + 7 ,

and hence

Y* = «(X*W - Xw) + Yw .

The predictor cutoff score for blacks is determined by

Y * = aX*B + j

Because multiple regression matches means

Y B — (*XB ~r~ 7

and hence the black predictor cutoff satisfies

Y* = «(**„ - XB) + YB ,

Thorndike's quotas for blacks are obtained if thestandard score for the predictor cutoff is the same as thestandard score for the criterion cutoff, that is, if

X*B - XB = Y*~ Y,, .

Now we have in general

n(X*B-XB) = Y* - YB.

Thus, Thorndike's quotas obtain only if

<v(F* - YB) = Y* - ?B= [«(rV - Xw) + Ywl - YB= «(F* - Yw) + Yw - Ys .

This is true only it

Page 9: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1061

it means that there is a larger difference be-tween the races on the test than would bepredicted by their difference on the criterion.Hence the test must be tapping abilities thatare not relevant to the criterion but on whichthere are racial differences. Thus, the test isdiscriminatory.

Note that Darlington's argument makes useof assumptions about causal inference. Ifthose assumptions about causality are in factfalse, then his interpretation of the meaningof the partial correlation is no longer valid.Are his assumptions so plausible that theyneed not be backed up with evidence? Con-sider the time ordering of his argument. Heis partialing the criterion from the predictor.In the case of college admissions, this meansthat he is calculating the correlation betweenrace and entrance exam score, with GPA 4years later being held constant. This is look-ing at the causal influence of the future onthe past and is only valid in the context ofvery special theoretical assumptions. The defi-nition would in fact be inappropriate evenin the context of a concurrent validationstudy, since concurrent validities are typicallyderived only as convenient estimates of pre-dictive validity. Thus, even when there is notime lag between predictor and criterion mea-surement, one is operating implicitly withinthe predictive validity model.

Let us explore this point more fully throughthe use of two concrete examples. First, con-sider a pro football coach attempting toevaluate the rookies who have joined the teamas a result of the college draft. Since theplayers have all come from different schools,there are great differences in the kind andquality of training that they have had. There-

Thus, Thorndike's quotas are obtained only if one oftwo things is true: either a = 1 or both sides are zero;that is, either a = 1 or Yw — FB = 0.

Because the variables were all scaled to have equalwithin-groups standard deviations, the regressionweight a is in fact the within-groups predictor-criterioncorrelation. Thus, a — 1 means that the test hasperfect validity._ The _equation Yw — YB — 0 is equivalent toYw ** Yg, that is, no group difference on the criterionand hence rcr — 0.

4 Darlington's error was subtle. He assumed that= 0 when in fact rcv.c = 0/0, which is undefined.

fore, the coach cannot rely on how well theyplay their positions at that moment in time;they will undergo considerable change as theylearn the ropes over the next few months.What the coach would like to know is ex-actly what their athletic abilities are withoutreference to the way they've learned to playto date. Suppose he decides to rely solelyon the 40-yard dash as an indicator of foot-ball ability, that is, as a selection test. It ispossible that he would then find that he wasselecting a much larger percentage of blacksthan he had using his judgment of currentperformance. Would this mean that the testis discriminatory against whites? That de-pends on the explanation for this outcome.Consider the defensive lineman on a passingplay. His ability to reach the quarterbackbefore he throws the ball depends not onlyon the speed necessary to go around the offen-sive lineman opposing him but also on suffi-cient arm strength to throw the offensivelineman to one side. Assume, for the sake ofthis example, that blacks are faster, on theaverage, than whites but that there are noracial differences in upper body strength.Since the 40-yard dash represents only speedand makes no measure of upper body strength,it cannot meet Darlington's substantive as-sumptions. That is, the 40-yard dash tapsonly the abilities on which there are racialdifferences and does not assess those thatshow no such differences.

How does the 40-yard dash behave statis-tically? If speed and upper body strengthwere the only factors in football ability andif the 40-yard dash were a perfect index ofspeed, then the correlations would satisfyr-ra-x — 0. That is, by deary's definition, the40-yard dash would be an unbiased test.Since rYc-x = 0, rxc-y cannot be zero and,hence, according to Darlington's definition the40-yard dash is culturally unfair (i.e., biasedagainst whites). (Since the number of whitesselected would be less than the Thorndikequota, Thorndike too would call the testbiased.) If the coach was aware that upperbody strength was a key variable and wasdeliberately avoiding the use of a measure ofupper body strength in a multiple regressionequation, then the charge that the coach wasdeliberately selecting blacks would seem quite

Page 10: Critical Analysis of the Statistical and Ethical

1062 JOHN E. HUNTER AND FRANK L. SCHMIDT

reasonable. But suppose that the nature ofthe missing predictor (i.e., upper bodystrength) was completely unknown. Would itthen be fair to charge the coach with usingan unfair test?

At this point, we should note a relatedissue raised by Linn and Werts (1971). Theytoo considered the case in which the criterionis affected by more than one ability, one ofwhich is not assessed by the test. If the testassessed only verbal ability and the onlyracial differences were on verbal ability, thenthe situation would be like that described inthe preceding paragraph: The test would beunbiased by the Cleary definition, but unfairaccording to Darlington's Definition 3. How-ever, if there are also racial differences on theunmeasured ability, then the test will not beunbiased by Cleary's definition. For example,if blacks were also lower, on the average, innumerical ability and numerical ability wasnot assessed by the entrance test, then theblack regression line and the test would bebiased against whites by Cleary's definition.According to Darlington's Definition 3, on theother hand, the verbal ability test would befair if, and only if, the racial difference on thenumerical test were of exactly the same mag-nitude in standard score units as the differenceon the verbal test. If the difference on themissing ability were less than the differenceon the observed ability, then Darlington'sdefinition would label the test unfair to blacks,whereas if the difference on the missing abil-ity were larger than the difference on the ob-served ability, then the test would be unfairto whites. Furthermore, if the two abilitiesbeing considered were not the only causalfactors in the determination of the criterion(i.e., if personality or financial difficulties,etc., were also correlated), then these state-ments would no longer hold. Rather, the fair-ness of the ability test under considerationwould depend not only on the size of racialdifferences on the unknown ability, but on

the size of racial differences on the other un-known causal factors as well. That is, accord-ing to Darlington's Definition 3, the fairnessof a test cannot be related to the causal de-termination of the criterion until a perfectmultiple regression equation on known pre-

dictors has been achieved. That is, Darling-ton's definition can be statistically but notsubstantively evaluated in real situations.

For the purpose of illustration, we now con-sider a simplified theory of academic achieve-ment in college. Suppose that the collegeentrance test were in fact a perfect measureof academic ability for high school seniors.Why is the validity not perfect? Considerthree men of average ability. Sam meets andmarries "wonder woman." She scrubs thefloor, earns $200 a week, and worships theground Sam walks on. Sam carries a B av-erage. Bill dates from time to time, gets hurta little, turns off on girls who like him onceor twice, and generally has the average num-ber of ups and downs. Bill carries a C aver-age. Joe meets and marries "Wanda thewitch." She lies around the house, contin-ually nags Joe about money, and continuallyreminds him that he is "sexually inadequate."As Joe spends more and more time at thelocal bar, his grades drop to a D average, andhe is ultimately thrown out of school. In anutshell, the theory of academic achievementthat we wish to consider is this: Achievementequals ability plus luck, where luck is a com-posite of few or no money troubles, sexualproblems, automobile accidents, deaths in thefamily, and so on. There are few known cor-relations between luck and ability and fewknown correlations between luck and per-sonality, but for simplicity of exposition, letus ignore these and assume that luck is com-pletely independent of ability and personality.Then luck in this theory is the component ofvariance in academic performance that repre-sents the situational factors in performancethat arise after the test is taken and duringthe college experience. Some of these factorsare virtually unpredictable: just what girl hehappens to sit beside in class, whether he ar-rives at the personnel desk before or afterthe opening for the ideal part-time job isfilled, whether he gets a good advisor or apoor one, and so on. Some of these factorsmay be predicted: family financial support incase of financial crisis, probable income ofspouse (if any), family pressure for continu-ing in college in case of personal crisis, andso on. However, even those factors that maybe predicted are potentialities and will not

Page 11: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1063

actually be relevant unless other, nonpredicta-ble events occur. Thus, there will inevitablybe a large random component to the situa-tional factors in performance that is not mea-surement error, but that has the same effectas measurement error in that it sets an upperbound on the validity of any predictor testbattery even if it includes biographical in-formation on family income, stability, andthe like.

According to this theory, a difference be-tween the black and white regression lines(over and above the effect of test unreliabil-ity) indicates that blacks are more likely tohave bad luck than are whites. Before goingon to the statistical questions, we note thatbecause we have assumed a perfect abilitytest, there can be no missing ability in thefollowing discussion. And because we haveassumed that nonability differences are solelydetermined by luck, the entity referred to as"motivation" is in this model simply theconcrete expression of luck in terms of overtbehavior. That is, in the present theory, moti-vation is assumed to be wholly determined byluck and hence already included in the re-gression equation.

Now let us consider the statistical interpre-tations of the fairness of our hypothetical,perfectly valid (with respect to ability) andperfectly reliable test. Because on the averageblacks are assumed to be unlucky as well aslower in ability, the racial difference in col-lege achievement in this model will be greaterthan that predicted by ability alone, andhence the regression lines of college perform-ance compared with ability will not be equal.The black regression line will be lower. Thus,according to Cleary, the test is biased againstwhites. According to Thorndike, the test maybe approximately fair (perhaps slightly biasedagainst blacks). According to Darlington, thetest could be either fair or unfair: If theracial difference on luck were about the samein magnitude as the race difference on theability test, then the test would be fair; butif the race difference on luck were less thanthe difference on ability, then the test wouldbe unfair to blacks. That is, the Darlingtonassessment of the fairness of the test wouldnot depend on the validity of the test in as-sessing ability, but on the relative harshness

of the personal-economic factors determiningthe amount of luck accorded the two groups.Darlington's statistical definition thus doesnot fit his substantive derivation in this con-text—unless one is willing to accept luck asan "ability" inherent to a greater extent insome applicants than in others.

The problem with Darlington's definitionbecomes even clearer if we alter slightly thetheory of the above paragraph. Suppose thatthe world became more benign and that thetendency for blacks to have bad luck dis-appeared. Then, making the same assump-tions as above (i.e., a perfect test and ourtheory of academic achievement), the re-gression curves would be equal and rYa-x — 0.Thus, according to Cleary's definition, the testwould be unbiased against whites. Darling-ton's Definition 3 would then label the testunfair to blacks. This last statement is par-ticularly interesting. In our theory we haveassumed that exactly the same ability lay atthe base of performance on both the test andlater GPA. Yet it is not true in our theorythat rxo.Y = 0. Thus, this example has shownthat Darlington's substantive interpretationof rxo-v does not hold with our additional as-sumption (of a nonstatistical nature), andhence his argument as to the substantivejustification of his definition is not logicallyvalid.

We note in passing that this last exampleposes a problem for Cleary's definition as wellas for Darlington's. If the difference betweenthe regression lines were in fact produced bygroup differences in luck, then would it beproper to label the test biased? And if thismodel were correct, how many unqualified in-dividualists would feel comfortable usingseparate regression lines to take into accountthe fact that blacks have a tougher life (onthe average) and hence make poorer GPAsthan their ability would predict? In the caseof both definitions, this analysis points up thenecessity of substantive models and considera-tions. Statistical analyses alone can obscureas much as they illuminate.

Darlington's Definition 3 received little at-tention until a novel and persuasive argumentin its favor was advanced by Cole (1973).Her argument was this: Consider those ap-plicants who would be "successful" if selected.

Page 12: Critical Analysis of the Statistical and Ethical

1064 JOHN E. HUNTER AND FRANK L. SCHMIDT

Should not such individuals have equal proba-bility of being selected regardless of racial orethnic group membership? Under the assump-tion of equal slopes and standard deviationsfor the two groups, the answer to her ques-tion is in the affirmative only if the two re-gression lines of test on criterion are the same(and hence rxo-y — 0). That is, Cole's defini-tion is the same as deary's with the roles ofthe predictor and criterion reversed. However,this similarity of statement does not implycompatibility—just the reverse. If there aredifferences between the races on either test orcriterion, then the two definitions are com-patible only if the test validity is perfect, sothe two definitions almost invariably conflict.

Although Cole's argument sounds reasona-ble and has a great deal of intuitive appeal,it is flawed by a hidden assumption. Her defi-nition assumes that differences between groupsin probability of acceptance given later suc-cess if selected are due to discrimination basedon group membership. Suppose that the tworegression lines of criterion performance asa function of the test are equal (i.e., the testis Cleary-defined unbiased). If a black whowould have been successful on the criterionis rejected and a white who fails the criterionis accepted, this need not imply discrimina-tion. The black is not rejected because he isblack but because he made a low score on theability test. That is, the black is rejectedbecause his ability at the time of the predictortest was indistinguishable from that of agroup of other people (of both races) who, onthe average, would have low scores on thecriterion.

To make this point more strongly, we notethat according to Cole's definition of a fairtest, it is unethical to use a test of less thanperfect validity. To illustrate this, consider theuse of a valid ability test to predict academicachievement in any one group, say whites, ap-plying for university admission. If the uni-versity decides to take only the people in thetop half of the distribution of test scores, thenacting under Cole's definition, applicants inthe bottom half may well file suit chargingdiscriminatory practice. According to Cole,an applicant who would be successful if se-lected should have the same probability ofbeing selected regardless of group member-

ship. That is, among the applicants who wouldhave been successful had they been selected,there are two groups. One group of appli-cants has a probability of selection of 1.00 be-cause their scores on the entrance exam arehigher than the cutoff point. The other groupof potentially successful applicants has a se-lection probability of .00 because their examscores are lower than the cutoff point. Accord-ing to Cole, we should ask: Why should aperson who would be successful be denied acollege berth merely because he had a lowtest score? After all it's success that counts,not test scores. But the fact is that for anystatistical procedure that does not have per-fect validity, there must always be peoplewho will be incorrectly predicted to have lowperformance, that is, there will always be suc-cessful people whose predictor scores weredown with the generally unsuccessful peopleinstead of up with the generally successfulpeople (and vice versa). In that sense, any-thing less than a perfect test will always be"unfair" to the potentially high achievingpeople who were rejected. It can be seen thatlack of perfect validity functions in exactlythe same way as test unreliability, discussedearlier.

As noted earlier in the case of Thorndike'sdefinition, this problem can be partly over-come in practice if social consensus restric-tions could be put on the defining of bonafide minority groups. But given the almostunlimited number of potentially definablesocial groups, it is unlikely that social orlegal consensus could be reached limiting theapplication of this definition to blacks, Chi-canos, American Indians, and a few othergroups.

Basically Cole has noted the same fact thatThorndike noted: In order for a test with lessthan perfect validity to be fair to individuals,the test must be unfair to groups. In particu-lar, in our example, the group of applicantswho score below average on the test will havenone of their members selected despite thefact that some of them would have shownsuccessful performance if selected. It is thusunfair to this group. However, it is fair toeach individual, since each is selected or re-jected based on the best possible estimateof his future performance. It is perhaps im-

Page 13: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1065

portant to note that this is not a problemproduced by the use of psychological tests;it is a problem inherent in reality. Society andits institutions must make selection decisions.They are unavoidable. Elimination of validpsychological tests will usually mean theirreplacement with devices or methods withless validity (e.g., the interview), thus furtherincreasing the unfairness to individuals and/or groups.

DARLINGTON'S DEFINITION 4

The fourth concept of test bias discussedby Darlington (1971) defines a test as faironly if rcx = 0. That is, by this definition,a test would be unfair if it showed any meandifference between the races at all, regardlessof the size of difference that might exist onsome criterion that is to be predicted. If thesame cutoff score is to be used for blacks andwhites, then this statistical criterion corre-sponds to the use of population-based quotas.If separate regression lines and hence sepa-rate cutoff scores are to be used, then meandifferences on the test are irrelevant to theissue of quotas.

A FIFTH DEFINITION OF TEST FAIRNESS

After defining and discussing four differentstatistical models of test fairness, Darlington(1971) turned to the commonly occuring pre-diction situation in which there is a differ-ence favoring whites on both the test and thecriterion and the black regression equationfalls below that for whites. This situation isshown in Figure 3a. Noting that the use ofseparate regression equations (or the equiva-lent, use of multiple regression equations withrace as a predictor), as required by deary's(1968) definition, would admit or select onlyan extremely small percentage of blacks,Darlington introduced his concept of the "cul-turally optimum" test. Darlington suggestedthat admissions officers at a university beasked to consider two potential graduatingseniors, one white and the other black, andto indicate how much higher the white's GPAwould have to be before the two candidateswould be "equally desirable for selection"(p. 79). This number is symbolized K andgiven a verbal label such as "racial adjust-ment coefficient." Then in determining the

Willing

BlacksBlocks and Whiles

3a. The Original Dala 3b. The Doctored Data

WhitesBlacks

9c. The Doctored Data 3d. The Dcctored Data

FIGURE 3. Darlington's (1971) method of doctoringthe data to define a culturally optimum test.

fairness of the test, K is first subtracted fromthe actual criterion scores (GPAs) of eachof the white subjects. If these altered datasatisfy deary's (1968) definition of a fairtest, the test is considered culturally optimum.

Figure 3 illustrates the geometrical mean-ing of Darlington's doctored criterion. If theadmissions officer chooses a value of K thatis equal to A Y in Figure 3a, then the altereddata will appear as in Figure 3b, that is, therewill be a single common regression line andthe test as it stands will be culturally opti-mum. If, however, an overzealous admissionsofficer chooses a value of K greater than AF,then the doctored data will appear as inFigure 3c, that is, the test will appear to bebiased against blacks according to deary'sdefinition and will thus not be culturally opti-mum. Similarly, should an uncooperative ad-missions officer select a value of K < AF,then the altered data would look like Figure3d and would thus appear to be biased againstwhites by Cleary's criterion and hence showthe test not to be culturally optimum.

Thus, a cynic might well assume that inpractice Darlington's definition of culturallyoptimum would simply lead to the selectionof a value of K that would make whatevertest was being used appear to be culturallyoptimum. But suppose that admissions officers

Page 14: Critical Analysis of the Statistical and Ethical

1066 JOHN E. HUNTER AND FRANK L. SCHMIDT

were willing to choose K without looking atthe effect that their choices would have onthe data. How then are the nonoptimum teststo be used? One might suppose that sinceDarlington is eager to doctor the criteriondata, he would also be willing to doctor thepredictor data as well. For example, thesituation in Figure 3d could be remedied bysimply subtracting a suitable constant fromeach black's test score. However, Darlingtonpermits only the doctoring of criterion scores;lie is opposed to doctoring the predictor scores.

If the predictor scores are not manipulatedby a direct mathematical rescoring, then thesame effect must be obtained by constructinga new test. Consider then the situation shownin Figure 3d. The new test must have theultimate effect of giving blacks lower scoresthan they would have gotten on the originaltest, while leaving white scores untouched.Thus, the test constructor is in the awkwardposition of adding items that are biasedagainst blacks in order to make the test fair!

Darlington was not unaware of this prob-lem, though he dealt with it in a differentplace. Darlington would not label the testfair unless the two regression lines using thedoctored criterion were equal. But what ifwe do not yet have a culturally fair test?What did Darlington recommend as an in-terim procedure for using an unfair test? Hestated that the unfair test can be used in afair way if it is combined with race in amultiple regression equation based on thedoctored criterion (i.e., if separate doctoredregression lines are used). Thus, he used theunequal doctored regression lines in much thesame way as Cleary recommended use of theunequal regression lines for undoctored re-gression lines. What does this procedure cometo in the case in which the administrator haschosen a value of K that is too low to labelthe existing test culturally optimum? If thedoctored regression line for blacks is still be-low the doctored regression line for whites,then the beta weight for blacks will still benegative and the multiple regression equationwill implicitly subtract that weight fromeach black's score.

What would an administrator do if thiswere pointed out to him? We believe that hewould react by increasing the value of K to

make the doctored regression lines equal.That is, we think that the actual consequenceof Darlington's recommendation for the fairuse of an unfair test would be to further in-crease the likelihood of using a value of Kthat makes the doctored regression linesequal. That is, we believe that Darlington'sdefinition of jair use, like his definition offair test, is most likely to result in a harriedadministrator choosing K to eliminate the dif-ference between the doctored regression lines.If we are right in this, then it means thatDarlington's recommendations for fair usewill lead in practice to simply labeling exist-ing tests culturally optimum. We feel thisbolsters our earlier argument that this is alsothe likely result of his basic procedure fordefining a test as fair.

What is the upshot of Darlington's sugges-tion? From a mathematical point of view,adding or subtracting a constant to the cri-terion is exactly equivalent to adding or sub-tracting constants to the predictor, and thisin turn is equivalent to using different cut-off scores for the two groups. Thus, in thelast analysis, Darlington's method is simplyan esoteric way of setting quotas, and hencethe expense of constructing culturally opti-mum tests to do this is a waste of time andmoney.

ETHICAL POSITIONS, STATISTICALDEFINITIONS, AND PROBLEMS

In this section, we briefly relate each ethicalposition to its appropriate statistical operationand point out some of the advantages anddisadvantages of each approach.

Unqualified Individualism

The ethical imperative of individualism isto apply to each person that prediction pro-cedure which is most valid for that person.Thus, white performance should be predictedby that test which has maximum validityfor whites. Black performance should be pre-dicted using that test which has maximumvalidity for blacks. The person with the high-est predicted criterion score is then selected.

There is no reason why the test used toselect blacks need be the same as that usedto select whites. Indeed, if there is a more

Page 15: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1067

valid test for blacks, then it is ethically wrongnot to use it. Furthermore, in situations inwhich the mean black criterion performanceis lower than that for whites, the number ofblacks admitted is maximized by using thattest which has maximum validity for blacks.

Consider the alternative. Suppose there isa group for which the test used has low valid-ity. For simplicity assume no validity at all.Then, the predicted criterion score for every-one in that group is the same—the meancriterion score for that group. Thus, eithereveryone in that group is accepted or every-one in that group is rejected. If that groupis in fact highly homogeneous on the cri-terion, then this is perfectly reasonable. Butif the zero validity group has the same de-gree of spread on the criterion as othergroups, then this lack of discrimination posesethical problems: either a great many poorprospects are being admitted, or a great manyexcellent prospects are being overlooked. Be-cause selection ratios are typically low (say50% or less), this means that the use of alow validity test for some groups is likely tomean that that group is virtually eliminated.Thus, indeed, it is important to seek themaximum validity test for any group. Butthere is little evidence to suggest that differ-ent demographic groups will in fact requiredifferent tests. In an age of mass culture, thisseems a very implausible hypothesis for mostsuch groups. For example, the research evi-dence strongly indicates that differentialvalidity by race is no more than a chancephenomenon (Schmidt, Berner, & Hunter,1973). The same may later be shown withrespect to other population subgroups, thusgreatly reducing the scope of this problem.The problem would not thus be eliminated,however; although the same tests may bevalid across population subgroups and regres-sion slopes may be equal, there is much re-search evidence (Reilly, 1973; Schmidt &Hunter, 1974; Ruch, Note 1) that interceptsoften differ significantly. That is, the sametest may be a maximum validity test for manygroups, but it need not therefore be unbiasedby Cleary's definition. Thus, some adjust-ment for differences in group intercepts wouldstill have to be made.

Qualified Individualism

For the most part, the qualified individual-ist is also concerned with maximum validity.However, should there be a subgroup forwhom there was low validity, it would posegreater problems for the qualified individual-ist because he cannot give different tests todifferent groups. Thus, should such a caseever be found, the qualified individualistwould presumably respond by searching fora less valid test (for the population as awhole) that had less variability in subgroupvalidity.

There is another, more subtle but perhapsmore real, problem that advocates of quali-fied individualism must face. The ethical im-perative here requires that the predictionequation that has maximum validity for theentire population—without regard to groupmembership—be identified and employed. Butthere is a problem with this solution. Sup-pose, for example, that for a certain citycollege the black regression line falls belowthe white regression line, that is, race is avalid predictor for that college. Use of raceas a predictor is, of course, forbidden to thequalified individualist, but there may be al-ternative ways of increasing the overall valid-ity of the prediction equation that are equallyobjectionable. For example, if race is a validpredictor, then a properly coded version ofthe student's address may also be a validpredictor and increase overall validity. This"indirect indicator of race" would probablybe detected and rejected, but a more subtlecue might not be properly identified. In par-ticular, the most subtle problem is the onefacing the test constructor: If the black re-gression line falls below the white regressionline, then the introduction of items whosecontent is biased against blacks would in-crease the overall validity of the test. If theseparate regression lines of the unqualifiedindividualist are used, then racially biasedtest material would have no effect on the se-lection of applicants. But if that is forbidden,then material biased against blacks wouldlower the black scores on the predictor andhence make their scores using the white re-gression line more accurate. That is, theintroduction of material biased against blacks

Page 16: Critical Analysis of the Statistical and Ethical

1068 JOHN E. HUNTER AND FRANK L. SCHMIDT

would reduce the overprediction of black per-formance and hence raise the validity of aone-regression-line use of the test.

The problem, in its general form, is thatany measured variable which correlates withrace, sex, religion, and so on (i.e., shows groupdifferences), can be considered to be an in-direct (and imperfect) indicator of groupmembership. Because he is forbidden to usegroup membership itself as a predictor evenif valid, the qualified individualist may betempted to substitute indirect indicators ofgroup membership that may be unfair. Howcan he decide whether a given race-correlatedpredictor is fair or unfair? We discuss twosuch criteria: (a) Is the relation between pre-dictor and criterion an "intrinsic" one? (b)Is the within-groups validity high enough?

The first criterion is the apparent intrinsic-ness of the relationship between the predictorand performance. If the predictor is a jobsample test (e.g., a typing test) assessing theskills actually required on the job, there islittle doubt that the relation is intrinsic.Scores on a written achievement test couldalso easily pass this test, as would a face-valid aptitude test. Scores on a weighted bio-graphical information inventory, on the otherhand, would be allowed only if they were ableto meet the second, less subjective standard:high validity coefficients for both groupsseparately. Thus, the qualified individualist'sanswer to the question posed in the exampleabove is that if the material to be added tothe test appears to have an intrinsic rela-tion to performance, it is ethically admissible.It is not biased against blacks as blacks butmerely against applicants (of whatever race)who are less capable of performing well onthe criterion. The fact that there happen tobe more blacks with low ability (percentage-wise) than whites is an ethically irrelevantfact.

While the preceding distinctions are re-garded as crucial among qualified individual-ists, they receive short shrift from thosecommitted to other ethical positions. Thosewho are wwqualified individualists will arguethat parental income is an indicator of moti-vation to do well in school or on the job andthat such motivation is surely intrinsic tohigh performance. That is, the unqualified

individualist says that the question of whethera variable is intrinsically related to perform-ance is subject to empirical test: If it is cor-related with performance, then it is intrin-sically related, whereas if it is not correlated,then it is not. Those who promote quotas willalso reject validity of the intrinsic-extrinsicdistinction. They argue that ability justmeans whether or not you went to a goodschool and is thus highly contaminated withextrinsic elements. The qualified individualistmay offer scientific theories in arguing hiscase, but his opponents will simply argue thatthe theories are wrong. And the data we havein 1975 will not decide the argument.

Is there a less subjective way to test foran intrinsic relation between the predictorand the criterion? Certainly one test is within-groups validity. If the relation is intrinsic,then there should be a correlation for eachgroup separately. But how high should thatcorrelation be? Certainly statistical signifi-cance is no answer. If a college were gather-ing data on a new test on an entering classof 4,000, then it would only take a within-groups validity of .01 to be significant. Onthe other hand, if we set some standard, suchas .10, we run into another problem. If thewithin-groups validity of some piece of bio-graphical information were ,10, while the cor-relation with race were .70, it would be clearthat most of the validity of the test wouldlie in its serving as an indirect indicator ofrace. Indeed, one might well consider requir-ing that the correlation with race be less thanthe within-groups validity.

But rather than continue searching for ar-bitrary standards, let us consider a purelystatistical definition of an indirect indicatorof race: X is an indirect indicator of race tothe extent that it correlates more highly withrace than is required by its relation with thecriterion. By this definition X would be safefrom a charge of being an indirect indicatorif fxc-v — 0—that is, if it satisfied the Dar-lington-Cole definition of a fair test! Andindeed the Darlington-Cole definition relatesthe within-groups validity to the test racedifference. If the within-groups validity were.10, then by this definition, the biographicalitem would be an indirect indicator of raceunless the standard score difference on the

Page 17: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1069

test were less than one tenth the standardscore difference on the criterion. But for allpractical purposes, this means that there can-not be any difference on the test at all!

But that brings us full circle on the statis-tical question. The concept of an intrinsicrelation is inherently a matter of causal argu-ments, and as we noted above, it cannot thusbe assessed by any statistical procedure. Thus,as we noted in our discussion of the theo-retical objections to the intrinsic-extrinsicdistinction, most such arguments between ad-herents of different ethical positions will comedown to scientific issues that cannot be re-solved on the basis of the data at hand today.

Quotas

The main technical question for an adher-ent of quotas is, Whose? Once the quotashave been set, the only remaining ethicalquestion is how to select from within eachgroup. Although some would use random se-lection within groups, most would evoke in-dividualism at this point. With this assump-tion, the optimal strategy for filling quotascan be stated. For each group, obtain pre-dicted criterion performance using that testwhich has maximum validity for the givengroup. If the test with maximum validity forblacks is not the test with maximum validityfor whites, then it is unethical to use thesame test for both.

The major problem for a quota-based sys-tem is that the criterion performance of se-lectees as a whole can be expected to be con-siderably lower than under unqualified oreven qualified individualism. In college selec-tion, for example, the poor-risk blacks whoare admitted by a quota are more likely tofail than are the higher scoring whites whoare rejected because of the quota. Thus, insituations in which low-criterion performancecarries a considerable penalty, being selectedon the basis of quotas is a mixed blessing.Second, there is the effect on the institution.The greater the divergence between the quotasand the selection percentages based on ac-tual expected performance, the greater thedifference in mean performance in those se-lected. If lowered performance is met byincreased rates of expulsion or firing, thenthe institution is relatively unaffected, but

(a) the quotas are undone and (b) there isconsiderable anguish for those selected whodon't make it.8 On the other hand, if the in-stitution tries to adjust to the candidates se-lected by quotas, there may be great cost andinefficiency. Finally, there is the one otherproblem that academic institutions must face.Quotas will inevitably lower the average per-formance of graduating seniors, and hencelower the prestige rating of the school. Simi-lar considerations apply in the case of theemployment setting. In both cases, the ef-fect of these changes on the broader societymust also be considered. These effects are dif-ficult to assess, but they may be quite sig-nificant.

CONCLUDING REMARKS

We have presented three ethical positionswith respect to the use of tests and otherpsychological devices in selecting people forentry into various kinds of institutions, andwe have shown these ethical positions to beirreconcilable. We have also reviewed a num-ber of attempts to define the fair or unbiaseduse of tests or other devices and have shownthem to be related to different ethical posi-tions. Moreover, we have shown that thescientific principles used to justify the sta-tistical procedures vary considerably in theirplausibility from one concrete selection situa-tion to another. Indeed we feel that we haveshown that any purely statistical approachto the problem of test bias is doomed torather immediate failure.

The dispute reviewed in this article is typi-cal of ethical arguments—the resolution de-pends in part on irreconcilable values. Fur-thermore, even among those who agree onvalues there will be disagreements about thevalidity of certain relevant scientific theoriesthat are not yet adequately tested. Thus, we

5 Furthermore, the public image of the institutionmay suffer as much from the higher rate of expulsionas from the charge of discrimination in hiring. Forexample, if we read the trial records correctly, there isa company that deliberately reduced its entrancestandards to hire more blacks. However, these peoplecould not then pass the internal promotion tests andhence accumulated in the lowest level jobs in theorganization. The government then took them to courtfor discriminatory promotion policies!

Page 18: Critical Analysis of the Statistical and Ethical

1070 JOHN E. HUNTER AND FRANK L. SCHMIDT

feel that there is no way that this dispute canbe objectively resolved. Each person must

choose as he sees fit (and in fact we are di-vided). We do hope that we have clarified theissues to make the choice more explicitly re-lated to the person's own values and beliefs.

REFERENCE NOTE

1. Ruch, W. W. A re-analysis of published differ-ential validity studies. In R. E. Biddle (Chair),Differential validation under Equal EmploymentOpportunity Commission and Office of FederalContract Compliance testing and selection regula-tions. Symposium presented at the meeting of theAmerican Psychological Association, Honolulu,September 1972.

REFERENCES

Cleary, T. A. Test bias: Prediction of grades ofNegro and white students in integrated colleges.Journal of Educational Measurement, 1968, 5, 115-124.

Cole, N. S. Bias in selection. ACT Research ReportNo. 51, May, 1972. Iowa City, Iowa: The Ameri-can College Testing Program, 1972. (Also in Jour-

nal of Educational Measurement, 1973, 10, 237-255.)

Darlington, R. B. Another look at "cultural fair-ness." Joitrnal of Educational Measurement, 1971,8, 71-82.

Edwards, W., & Phillips, L. D. Man as transducerfor probabilities in Bayesian command and con-trol systems. In M. W. Shelley II & G. L. Bryan(Eds.), Human judgements and oplimality. NewYork: Wiley, 1964.

Linn, R. L. Fair test use in selection. Review of Edu-cational Research, 1973, 43, 139-161.

Linn, R. L., & Werts, C. E. Considerations for stud-ies of test bias. Journal of Educational Measure-ment, 1971, 8, 1-4.

Reilly, R. R. A note on minority group test biasstudies. Psychological Bulletin, 1973, 80, 130-132.

Schmidt, F. L., Berner, J. G., & Hunter, J. E. Racialdifferences in validity of employment tests: Real-ity or illusion? Journal of Applied Psychology,1973, 58, S-9.

Schmidt, F. L., & Hunter, J. E. Racial and ethnicbias in psychological tests: Divergent implicationsof two definitions of test bias. American Psycholo-gist, 1974, 20, 1-8.

Thorndike, R. L. Concepts of culture-fairness. Jour-nal of Educational Measurement, 1971, 8, 63-70.

APPENDIX

This appendix contains the mathematicalcalculation of the expected achievement levelof the group that would be selected by the f u l lapplication of Thorndike's criterion, that is, agroup selected so that for each test score x, thenumber of people selected is proportional tothe probability that persons at that test levelwould in fact be successful. The definition ofsuccessful used below is performance aboveaverage on the criterion. That is, the calcula-tions below assume a 50% selection ratio.

For simplicity, both test and performancehave been assumed to be measured in standardscores. The symbol <p(x) is the standardnormal density function and the symbol $(#)is the standard normal cumulative distributionfunction. The symbol A is used for accepted

(or selected for admission).

If the criterion of success is the top 50%,then in terms of standard scores, the successcriterion is Y > 0. Thus, the conditionalprobability of being accepted is:

P(A.\X) = P[F> 0|JT]

Let us simplify the following expressions by

defining the parameter a by

ra = —, - .

Vl - r1

In particular, if r — .6, then

.6

Vl - .36

We can then write

P(A|Z) = P(z > -OLX} ,

where z has a standard normal distribution.Thus,

P(A\X) = 1 - $(-«*) = $(<**) .

If the number selected at each test score isP(A \ X ) , then the overall selection ratio will be

P(A) = l$(ax)<p(x)dx = i.

The distribution of the test score among thoseselected is

Since X and Y are in standard score form, theregression of Fon X is given by E(F| X} ~ rx.

Page 19: Critical Analysis of the Statistical and Ethical

ANALYSIS OF TEST BIAS 1071

Thus, the mean criterion score among those _ a f _(

selected will be ~ 2rr J e

i (x}dxwhere

f 0 = Vl+a2 .= / rx 2&(ax)<p(x')dx .

J Step 3. Using the substitution x =

This is not an easy integral to calculate, and we can calculate the following integral:

the calculation below will thus be broken into /•five steps. The formula for the mean criterion Jperformance among those selected can then bewritten

r vE(F) = 2p / *&(&x'}(p(x}dx

^ ' J *• ' ^ ' ' Step 4. Thus, we can finally calculate themain integral:

Step 1. First we apply the method of inte-gration by parts:

/

•~ \ / \ i /x^>(ax)tp(x)dx _ / 1 \ / a \

, W27/\Vl +a2/

J Because a was defined to be

= <fr(ax)u(x) — /«(«){$'(ax)a}dx , a — r

where we have

p—xili a r\l — r*

/e~~ '

xif>(x)dx = p= .V2ir

= r .

1 — r2

Step 2. Thus, we have the definite integral: v

+00 Step 5. Finally, we can use the main integral/ x&(ax)<p(x}dx to calculate the expected achievement level:

/ t W= $Ca. ) u ( x ) +X - fu(xW(a. )otdv E(F'1 = 2r InteSral = (2r)( "7== )( /- " "^

r —alxl"

= 0 - Iu(x)~=

a f

~ 2ir J e

',dx

For r = .6, this formula yields E(F) = .288.

(Received June 2, 197S)