1979 churchill a paradigm for developing better constructs in marketing

Measure and Construct Validity Studies

GILBERT A. CHURCHILL, JR,*

A critical element in the evolution of a fundamental body of knowledgein marketing, as well as for improved marketing practice, is the developmentof better measures of the variables with which marketers work. In this articlean appraach is outlined by which this goal can be achieved and portions

of the approach are illustrated in terms of a job satisfaction measure.

A Paradigm for Developing Better Measuresof Marketing Constructs

In an article in the April 1978 issue of the Journalof Marketing, Jacoby placed much of the blame forthe poor quality of some of the marketing literatureon the measures marketers use to assess their variablesof interest (p. 91):

More stupefying than the sheer number of our measuresis the ease with which they are proposed and theuncritical manner in which they are accepted. In pointof fact, mostof our measures are only measures becausesomeone says that they are, not because they havebeen shown to satisfy standard measurement criteria(validity, reliability, and sensitivity). Stated somewhatdifferently, most of our measures are no more sophisti-cated than first asserting that the number of pebblesa person can count in a ten-minute period is a measureof that person's intelligence; next, conducting a studyand fmding that people who can count many pebblesin ten minutes also tend to eat more; and, finally,concluding from this: people with high intelligence tendto eat more.

•Gilbert A. Churchill is Professor of Marketing, University ofWisconsin-Madison. The significant contributions of Michael Hous-ton, Shelby Hunt, John Nevin, and Michael Rothschild throughtheir comments on a draft of this article are gratefully acknowledged.as are the many helpful comments of the anonymous reviewers.

The AMA publications policy states: "No articie(s) will bepublished in the Journal of Marketing Research written by the Editoror the Vice President of Publications." The inclusion of this articlewas approved by the Board of Directors because: (1) the articlewas submitted before the author took over as Editor, (2) ihe authorplayed no part in its review, and (3) Michael Ray, who supervisedthe reviewing process for the special issue, formally requestedhe be allowed to publish it.

Burleigh Gardner, President of Social Research,Inc., makes a similar point with respect to attitudemeasurement in a recent issue of the Marketing News(May 5, 1978, p. 1):

Today the social scientists are enamored of numbersand counting. . .. Rarely do they stop and ask, "Whatlies behind the numbers?"When we talk about attitudes we are talking aboutconstructs of the mind as they are expressed in responseto our questions.But usually all we really know are the questions weask and the answers we get.

Marketers, indeed, seem to be choking on theirmeasures, as other articles in this issue attest. Theyseem to spend much effort and time operating bythe routine which computer technicians refer to asGIGO—garbage in, garbage out. As Jacoby so suc-cinctly puts it, "What does it mean if a finding issignificant or that the ultimate in statistical analyticaltechniques have been applied, if the data collectioninstrument generated invalid data at the outset?'' (1978,p. 90).

What accounts for this gap between the obviousneed for better measures and the lack of such mea-sures? The basic thesis of this article is that althoughthe desire may be there, the know-how is not. Thesituation in marketing seems to parallel the dilemmawhich psychologists faced more than 20 years ago,when Tryon (1957, p. 229) wrote:

If an investigator should invent a new psychologicaltest and then turn to any recent scholarly work for

64

Journal of Marketing ResearchVol. XVI (February 1979). 64-73

PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 65

guidance on how to determine its reliability, he wouldconfront such an array of different formulations thathe would be unsure about how to proceed. After fiftyyears of psychological testing, the problem of discover-ing the degree to which an objective measure ofbehavior reliably differentiates individuals is still con-fused.

Psychology has made progress since that time.Attention has moved beyond simple questions ofreliability and now includes more "direct" assess-ments of validity. Unfortunately, the marketing litera-ture has been slow to reflect that progress. One ofthe main reasons is that the psychological literatureis scattered. The notions are available in many bitsand pieces in a variety of sources. There is nooverriding framework which the marketer can embraceto help organize the many definitions and measuresof reliability and validity into an integrated whole sothat the decision as to which to use and when isobvious.

This article is an attempt to provide such a frame-work. A procedure is suggested by which measuresof constructs of interest to marketers can be devel-oped. The emphasis is on developing measures whichhave desirable reliability and validity properties. Partof the article is devoted to clarifying these notions,particularly those related to validity; reliability notionsare well covered by Peter's article in this issue. Finally,the article contains suggestions about approaches onwhich marketers historically have relied in assessingthe quality of measures, but which they would dowell to consider abandoning in favor of some neweralternatives. The rationale as to why the newer al-ternatives are preferred is presented.

THE PROBLEM AND APPROACHTechnically, the process of measurement or opera-

tionalization involves "rules for assigning numbersto objects to represent quantities of attributes" (Nun-nally, 1967, p. 2). The definition involves two keynotions. First, it is the attributes of objects that aremeasured and not the objects themselves. Second,the definition does not specify the rules by whichthe numbers are assigned. However, the rigor withwhich the rules are specified and the skill with whichthey are applied determine whether the construct hasbeen captured by the measure.

Consider some arbitrary construct, C, such as cus-tomer satisfaction. One can conceive at any givenpoint in time that every customer has a "true" levelof satisfaction; call this level X,. Hopefully, eachmeasurement one makes will produce an observedscore. A",,, equal to the object's true score, Xj.Further, if there are differences between objects withrespect to their X,, scores, these differences wouldbe completely attributable to true differences in thecharacteristic one is attempting to measure, i.e., truedifferences in X,-. Rarely is the researcher so fortu-

nate. Much more typical is the measurement wherethe X^ score differences also reflect (Selltiz et al.,1976, p. 164-8):

1. True differences in other relatively stable charac-teristics which affect the score, e.g., a person swillingness to express his or her true feelings.

2. Differences due to transient personal factors, e.g.,a person's mood, state of fatigue.

3. Differences due to situational factors, e.g., whetherthe interview is conducted in the home or at a centralfacility.

4. Differences due to variations in administration, e.g.,interviewers who probe differently.

5. Differences due to sampling of items, e.g., thespecific items used on the questionnaire; if the itemsor the wording of those items were changed, theXg scores would also change.

6. Differences due to lack of clarity of measuringinstruments, e.g., vague or ambiguous questionswhich are interpreted differently by those respond-ing.

7. Differences due to mechanical factors, e.g., a checkmark in the wrong box or a response which is codedincorrectly.

Not all of these factors will be present in everymeasurement, nor are they limited to informationcollected by questionnaire in personal or telephoneinterviews. They arise also in studies in which self-ad-ministered questionnaires or observational techniquesare used. Although the impact of each factor on theA",, score varies with the approach, their impact ispredictable. They distort the observed scores awayfrom the true scores. Functionally, the relationshipcan be expressed as:

where:

X„ =

systematic sources of error such as stable char-acteristics of the object which affect its score,andrandom sources of error such as transient per-sonal factors which affect the object's score.

A measure is va//f/when the differences in observedscores reflect true differences on the characteristicone is attempting to measure and nothing else, thatis, X^ = Xj.. A measure is reliable to the extent thatindependent but comparable measures of the sametrait or construct of a given object agree. Reliabihtydepends on how much of the variation in scores isattributable to random or chance errors. If a measureis perfectly reliable, X^ = 0. Note that if a measureis valid, it is reliable, but that the converse is notnecessarily true because the observed score whenA"̂ = 0 could still equal Xj- + X^. Thus it is oftensaid that reliability is a necessary but not a sufficientcondition for validity. Rehability only provides nega-tive evidence of the validity of the measure. However,the ease with which it can be computed helps explain

JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

its popularity. Reliability is much more routinelyreported than is evidence, which is much more difficultto secure but which relates more directly to the validityof the measure.

The fundamental objective in measurement is toproduce X^ scores which approximate Xj- scores asclosely as possible. Unfortunately, the researchernever knows for sure what the A*7. scores are. Rather,the measures are always inferences. The quality ofthese inferences depends directly on the proceduresthat are used to develop the measures and the evidencesupporting their "goodness." This evidence typicallytakes the form of some reliability or vahdity index,of which there are a great many, perhaps too many.

The analyst working to develop a measure mustcontend with such notions as split-half, test-retest,and alternate forms rehability as well as with face,content, predictive, concurrent, pragmatic, construct,convergent, and discriminant validity. Because someof these terms are used interchangeably and othersare often used loosely, the analyst wishing to developa measure of some variable of interest in marketingfaces difficult decisions about how to proceed andwhat rehability and validity indices to calculate.

Figure 1 is a diagram of the sequence of steps that

can be followed and a list of some calculations that

should be performed in developing measures of mar-

Figure 1' SUGGESTED PROCEDURE FOR DEVELOPING BETTER

MEASURES

Recommended Coefficientsor TeGhniques

Specify domainof construct

Q^nerate sampleof items

Purifymeasure

Literature searchExperience surveyInsight stimulating examplesCritical incidentsFocus groups

Coefficient alphaFactor analyaia

Coilectdata

Assessreliability

Assessvalidity

Developnorms

Coefficient alphaSplit-half reliability

Hultitrait-multimathod niatrixCriterion validity

Average and other statisticssummariainq distribution of

keting constructs. The suggested sequence has workedwell in several instances in producing measures withdesirable psychometric properties (see Churchill etal., 1974, for one example). Some readers will un-doubtedly disagree with the suggested process or withthe omission of their favorite reliability or validitycoefficient. The following discussion, which detailsboth the steps and their rationale, shows that someof these measures should indeed be set aside becausethere are better alternatives or, if they are used, thatthey should at least be interpreted with the properawareness of their shortcomings.

The process suggested is only applicable to multi-item measures. This deficiency is not as serious asit might appear. Multi-item measures have much torecommend them. First, individual items usually haveconsiderable uniqueness or specificity in that eachitem tends to have only a low correlation with theattribute being measured and tends tc relate to otherattributes as well. Second, single items tend to cate-gorize people into a relatively small number of groups.For example, a seven-step rating scale can at mostdistinguish between seven levels of an attribute. Third,individual items typically have considerable measure-ment error; they produce unreliable responses in thesense that the same scale position is unlikely to bechecked in successive administrations of an instru-ment.

All three of these measurement difficulties can bediminished with multi-item measures: (1) the specific-ity of items can be averaged out when they arecombined, (2) by combining items, one can makerelatively fine distinctions among people, and (3) thereliability tends to increase and measurement errordecreases as the number of items in a combinationincreases.

The folly of using single-item measures is illustratedby a question posed by Jacoby (1978, p. 93):

How comfortable would we feel having our intelligenceassessed on the basis of our response to a singlequestion?" Yet that's exactly what we do in consumerresearch. . . . The literature reveals hundreds of in-stances in which responses to a single question sufficeto establish the person's level on the variable of interestand then serves as the basis for extensive analysisand entire articles.

. . . Given the complexity of our subject matter, whatmakes us think we can use responses to single items(or even to two or three items) as measures of theseconcepts, then relate these scores to a host of othervariables, arrive at conclusions based on such aninvestigation, and get away calling what we have done"quality research?"

In sum, marketers are much better served withmulti-item than single-item measures of their con-structs, and they should take the time to develop them.This conclusion is particularly true for those investi-gating behavioral relationships from a fundamental

PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS

as well as applied perspective, although it applies alsoto marketing practitioners.

SPECIFY DOMAIN OF THE CONSTRUCTThe first step in tbe suggested procedure for devel-

oping better measures involves specifying the domainof the construct. Tbe researcher must be exactingin delineating what is included in the definition andwhat is excluded. Consider, for example, the constructcustomer satisfaction, whicb lies at the heart of themarketing concept. Though it is a central notion inmodern marketing thought, it is also a construct whichmarketers bave not measured in exacting fashion.

Howard and Sbeth (1969. p. 145), for example, definecustomer satisfaction as

. . .the buyer's cognitive state of being adequately orinadequately rewarded in a buying situation for thesacrifice he has undergone. The adequacy is a conse-quence of matching actual past purchase and consump-tion experience with the reward that was expected fromthe brand In terms of its anticipated potential to satisfythe motives served by the particular product class.It includes not only reward from consumption of thebrand but any other reward received in the purchasingand consuming process.

Thus, satisfaction by their definition seems to beattitude. Further, in order to measure satisfaction,it seems necessary to measure both expectations atthe time of purchase and reactions at some time afterpurchase. If actual consequences equal or exceedexpected consequences, tbe consumer is satisfied, butif actual consequences fall short of expected conse-quences, the consumer is dissatisfied.

But what expectations and consequences should tbemarketer attempt to assess? Certainly one would wantto be reasonably exhaustive in the list of productfeatures to be included, incorporating such facets ascost, durability, quality, operating performance, andaesthetic features (Czepeiletal., 1974). But what aboutpurchasers' reactions to tbe sales assistance theyreceived or subsequent service by independent dealers,as would be needed, for example, after the purchaseof many small appliances? What about customerreaction to subsequent advertising or tbe expansionof the channels of distribution in wbich the productis available? What about the subsequent availabilityof competitors' alternatives wbich serve tbe sameneeds or the publishing of information about tbeenvironmental effects of using the product? To detailwhich of these factors would be included or howcustomer satisfaction should be operationalized isbeyond the scope of this article; ratber, tbe exampleemphasizes that the researcher must be exacting inthe conceptual specification of tbe construct and whatis and what is not included in tbe domain.

It is imperative, though, that researchers consultthe hterature when conceptualizing constructs andspecifying domains. Perhaps if only a few more had

done so, one of the main problems cited by Kollat,Engel, and Blackwell as impairing progress in con-sumer researcb—namely, tbe use of widely varyingdefinitions—could have been at least diminished (Kol-lat et al., 1970, p. 328-9).

Certainly definitions of constructs are means ratherthan ends in themselves. Yet the use of differentdefinitions makes it difficult to compare and accumu-late findings and thereby develop syntheses of whatis known. Researchers should bave good reasons forproposing additional new measures given the manyavailable for most marketing constructs of interest,and those publishing should be required to supplytheir rationale. Perhaps the older measures are inade-quate. The researcher should make sure this is thecase by conducting a thorough review of literaturein which the variable is used and should present adetailed statement of the reasons and evidence as towhy the new measure is better.

GENERA TE SAMPLE OF ITEMSTbe second step in tbe procedure for developing

better measures is to generate items which capturethe domain as specified. Those techniques that aretypically productive in exploratory research, includingliterature searches, experience surveys, and insight-stimulating examples, are generally productive here(Selltiz et al., 1976). Tbe literature should indicatehow the variable has been defined previously and howmany dimensions or components it has. The searchfor ways to measure customer satisfaction wouldinclude product brochures, articles in trade magazinesand newspapers, or results of product tests such asthose published by Consumer Reports. The experiencesurvey is not a probability sample but a judgmentsample of persons who can offer some ideas andinsights into tbe phenomenon. In measuring consumersatisfaction, it could include discussions with (I)appropriate people in the product group responsiblefor the product, (2) sales representatives, (3) dealers,(4) consumers, and (5) persons in marketing researchor advertising, as well as (6) outsiders who have aspecial expertise such as university or governmentpersonnel. The insight-stimulating examples could in-volve a comparison of competitors' products or adetailed examination of some particularly vehementcomplaints in unsolicited letters about performanceof the product. Examples which indicate sharp con-trasts or have striking features would be most produc-tive.

Critical incidents and focus groups also can be usedto advantage at the item-generation stage. To use thecritical incidents technique a large number of scenariosdescribing specific situations could be made up anda sample of experienced consumers would be askedwhat specific behaviors (e.g., product changes, war-ranty handling) would create customer satisfaction ordissatisfaction (Flanagan, 1954; Kerlinger, 1973, p.

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

68 JOURNAL OF MARKETING RESEARCH, FEBRUARY 1979

536). The scenarios might be presented to the respon-dents individually or 8 to 10 of them might be broughttogether in a focus group where the scenarios couldbe used to trigger open discussion among participants,although other devices might also be employed topromote discourse (Calder, 1977).

The emphasis at the early stages of item generationwould be to develop a set of items which tap eachofthe dimensions of the construct at issue. Further,the researcher probably would want to include itemswith slightly different shades of meaning because theoriginal list will be refined to produce the final measure.Experienced researchers can attest that seeminglyidentical statements produce widely different answers.By incorporating slightly different nuances of meaningin statements in the item pool, the researcher providesa better foundation for the eventual measure.

Near the end of the statement development stagethe focus would shift to item editing. Each statementwould be reviewed so that its wording would be asprecise as possible. Double-barreled statements wouldbe split into two single-idea statements, and if thatproved impossible the statement would be eliminatedaltogether. Some of the statements would be recastto be positively stated and others to be negativelystated to reduce "yea-" or "nay-" saying tendencies.The analyst's attention would also be directed atrefining those questions which contain an obvious"socially acceptable" response.

After the item pool is carefully edited, furtherrefinement would await actual data. The type of datacollected would depend on the type of scale usedto measure the construct.

PURIFY THE MEASUREThe calculations one performs in purifying a measure

depend somewhat on the measurement model oneembraces. The most logically defensible model is thedomain sampling model which holds that the purposeof any particular measurement is to estimate the scorethat would be obtained if all the items in the domainwere used (Nunnaliy, 1967, p. 175-81). The score thatany subject would obtain over the whole sampledomain is the person's true score, Xj..

In practice, though, one does not use all of theitems that could be used, but only a sample of them.To the extent that the sample of items correlates withtrue scores, it is good. According to the domainsamphng model, then, a primary source of measure-ment error is the inadequate sampling of the domainof relevant items.

Basic to the domain sampling model is the conceptof an infinitely large correlation matrix showing allcorrelations among the items in the domain. No singleitem is likely to provide a perfect representation ofthe concept, just as no single word can be used totest for differences in subjects' spelling abihties andno single question can measure a person's intelligence.

Rather, each item can be expected to have a certainamount of distinctiveness or specificity even thoughit relates to the concept.

The average correlation in this infinitely large ma-trix, r, indicates the extent to which some commoncore is present in the items. The dispersion of correla-tions about the average indicates the extent to whichitems vary in sharing the common core. The keyassumption in the domain sampling model is that allitems, if they belong to the domain of the concept,have an equal amount of common core. This statementimplies that the average correlation in each columnof the hypothetical matrix is the same and in turnequals the average correlation in the whole matrix(Ley, 1972, p. I l l ; Nunnally, 1967, p. 175-6). Thatis, if all the items in a measure are drawn from thedomain of a single construct, responses to those itemsshould be highly intercorrelated. Low mteritem cor-relations, in contrast, indicate that some items arenot drawn from the appropriate domain and are pro-ducing error and unreliability.

Coefficient AlphaThe recommended measure of the internal consis-

tency of a set of items is provided by coefficientalpha which results directly from the assumptions ofthe domain sampling model. See Peter's article in thisissue for the calculation of coefficient alpha.

Coefficient alpha absolutely should be the firstmeasure one calculates to assess the quality of theinstrument. It is pregnant with meaning because thesquare root of coefficient alpha is the estimatedcorrelation of the k-item test with errorless true scores(Nunnally, 1967. p. 191-6). Thus, a low coefficientalpha indicates the sample of items performs poorlyin capturing the construct which motivated the mea-sure. Conversely, a large alpha indicates that the A:-itemtest correlates well with true scores.

If alpha is low, what should the analyst do?' Ifthe item pool is sufficiently large, this outcome sug-gests that some items do not share equally in thecommon core and should be eliminated. The easiestway to find them is to calculate the correlation ofeach item with the total score and to plot thesecorrelations by decreasing order of magnitude. Itemswith correlations near zero would be eliminated.Further, items which produce a substantial or suddendrop in the item-to-total correlations would also bedeleted.

' What is "low" for alpha depends on the purpose of the research.For early stages of basic research. Nunnally (1967);iuggests reliabil-ities of .50 to .60 suffice and thai increasing reliabilities beyond.80 is probably wasteful. In many applied settings, however, whereimportant decisions are made with respect to specific test scores,"a reliability of .90 is the minimum that should be tolerated, anda reliability of .95 should be considered the desirable standard"(p. 226).

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Underline

abhigyan

Highlight

abhigyan

Highlight

abhigyan

Markup

Cancelled set by abhigyan

abhigyan

Highlight

abhigyan

Underline

PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS

If the construct had, say, five identifiable dimen-sions or components, coefficient alpha would becalculated for each dimension. The item-to-total cor-relations used to delete items would also be basedon the items in the component and the total scorefor that dimension. The total score for the constructwould be secured by summing the total scores forthe separate components. The reliability of the totalconstruct would not be measured through coefficientalpha, but rather through the formula for the reliabilityof linear combinations (Nunnally, 1967. p. 226-35).

Some analysts mistakenly calculate split-half reli-ability to assess the internal homogeneity of the mea-sure. That is, they divide the measure into two halves.The first half may be composed of all the even-num-bered items, for example, and the second half all theodd-numbered items. The analyst then calculates atotal score for eacb balf and correlates these totalscores across subjects. The problem with this approachis that the size of this correlation depends on tbeway the items are split to form the two halves. With,say, 10 items (a very smafl number for most measure-ments), there are 126 possible splits.^ Because eachof these possible divisions will likely produce a dif-ferent coefficient, what is the split-half reliability?Further, as the average of all of these coefficientsequals coefficient alpha, why not calculate coefficientalpha in the first place? It is almost as easy, is notarbitrary, and bas an important practical connotation.

Factor AnalysisSome analysts like to perform a factor analysis on

the data before doing anything else in the hope ofdetermining the number of dimensions underlying theconstruct. Factor analysis can indeed be used tosuggest dimensions, and the marketing literature isreplete witb articles reporting such use. Much lessprevalent is its use to confirm or refute componentsisolated by other means. For example, in discussinga test composed of items tapping two common factors,verbal fluency and number facility, Campbell (1976,p. 194) comments:

Recognizing multidimensionality when we see it is notalways an easy task. For example, rules for when tostop extracting factors are always arbitrary in somesense. Perhaps the wisest course is to always makethe comparison between the split half and internalconsistency estimates after first splitting the compo-nents into two halves on a priori grounds. That is,every effort should be made to balance the factor

^The number of possible splits with 2n items is given by the(2n)!

formula (Bohrnstedl, 1970). For the example cited.

10!/I = 5 and the formula reduces to

2(5!) (5!)

content of each half [part] before looking at componentintercorrelations.

When factor analysis is done before the purificationsteps suggested heretofore, there seems to be a ten-dency to produce many more dimensions than canbe conceptually identified. Tbis effect is partly dueto the "garbage items" which do not bave the commoncore but which do produce additional dimensions inthe factor analysis. Though this application may besatisfactory during the early stages of research ona construct, the use of factor analysis in a confirmatoryfashion would seem better at later stages. Further,theoretical arguments support the iterative process ofthe calculation of coefficient alpha, the eliminationof items, and the subsequent calculation of alpha untila satisfactory coefficient is achieved. Factor analysisthen can be used to confirm whether the number ofdimensions conceptualized can be verified empirically.

Iteration

The foregoing procedure can produce several out-comes. The most desirable outcome occurs when themeasure produces a satisfactory coefficient alpha (oralphas if there are multiple dimensions) and the dimen-sions agree with those conceptualized. The measureis then ready for some additional testing for whicha new sample of data should be collected. Second,factor analysis sometimes suggests that dimensionswhich were conceptualized as independent clearlyoverlap. In this case, the items which have pureloadings on the new factor can be retained and anew alpha calculated. If this outcome is satisfactory,additional testing with new data is indicated.

The third and least desirable outcome occurs whenthe alpha coefficient(s) is too low and restructuringof the items forming each dimension is unproductive.In this case, the appropriate strategy is to loop backto steps 1 and 2 and repeat the process to ascertainwhat might have gone wrong. Perhaps the constructwas not appropriately delineated. Perhaps the itempool did not sample all aspects of the domain. Perhapsthe emphases within the measure were somehowdistorted in editing. Perhaps the sample of subjectswas biased, or the construct so ambiguous as to defymeasurement. The last conclusion would suggest afundamental change in strategy, starting with a re-thinking of the basic relationships that motivated theinvestigation in the first place.

ASSESS RELIABILITY WITH NEW DATAThe major source of error within a test or measure

is the sampling of items. If the sample is appropriateand the items "look right," tbe measure is said tohave face or content validity. Adherence to the stepssuggested will tend to produce content valid measures.But that is not tbe whole story! What about transientpersonal factors, or ambiguous questions which pro-

abhigyan

Highlight

abhigyan

Highlight

abhigyan

Highlight

abhigyan

Underline


duce guessing, or any of the other extraneous influ-ences, other than the sampling of items, whicb tendto produce error in the measure?

Interestingly, all of the errors that occur within atest can be easily encompassed by the domain samplingmodel. All the sources of error occurring within ameasurement will tend to lower the average correlationamong the items within the test, but the averagecorrelation is all that is needed to estimate the reliabil-ity. Suppose, for example, that one of the items isvague and respondents have to guess its meaning.This guessing will tend to lower coefficient alpha,suggesting there is error in the measurement. Subse-quent calculation of item-to-total correlations will thensuggest this item for elimination.

Coefficient alpba is tbe basic statistic for determin-ing the reliability of a measure based on internalconsistency. Coefficient alpha does not adequatelyestimate, though, errors caused by factors extemalto the instrument, sucb as differences in testing situa-tions and respondents over time. If the researcherwants a reliability coefficient which assesses tbebetween-test error, additional data must be collected.It is also advisable to collect additional data to ruleout the possibility that tbe previous findings are dueto chance. If the construct is more than a measurementartifact, it should be reproduced when the purifiedsample of items is submitted to a new sample ofsubjects.

Because Peter's article treats tbe assessment ofreliability, it is not examined here except to suggestthat test-retest reliability should not be used. Tbe basicproblem witb straight test-retest reliability is respon-dents' memories. They will tend to reply to an itemthe same way in a second administration as they didin the first. Thus, even if an analyst were to puttogether an instrument in which the items correlatepoorly, suggesting there is no common core and thusno construct, it is possible and even probable thatthe responses to each item would correlate well acrossthe two measurements. The high correlation of thetotal scores on the two tests would suggest the measurehad small measurement error wben in fact very littleis demonstrated about vaUdity by straight test-retestcorrelations.

ASSESS CONSTRUCT VALIDITYSpecifying the domain of the construct, generating

items tbat exhaust the domain, and subsequentlypurifying the resulting scale should produce a measurewhich is content or face valid and reliable. It mayor may not produce a measure which has constructvalidity. Construct validity, which lies at the veryheart of tbe scientific process, is most directly relatedto the question of what the instrument is in factmeasuring—what construct, trait, or concept underliesa person's performance or score on a measure.

The preceding steps should produce an internally

consistent or internally homogeneous set of items.Consistency is necessary but not sufficient for con-struct validity (Nunnally, 1967, p. 92).

Rather, to estabhsb the construct validity of ameasure, tbe analyst also must determine (1) the extentto which the measure correlates with other measuresdesigned to measure the same thing and (2) whetherthe measure behaves as expected.

Correlations With Other MeasuresA fundamental principle in science is that any

particular construct or trait should be measurable byat least two, and preferably more, different methods.Otherwise the researcher has no way of knowingwhether tbe trait is anything but an artifact of themeasurement procedure. All the measurements of thetrait may not be equally good, but science continuallyemphasizes improvement of the measures of the vari-ables with which it works. Evidence of the convergentvalidity of the measure is provided by the extent towhich it correlates highly with other methods designedto measure the same construct.

The measures should bave not only convergentvalidity, but also discriminant validity. Discriminantvalidity is the extent to which the measure is indeednovel and not simply a reflection of some othervariable. As Campbell and Fiske (1959) persuasivelyargue, "Tests can be invalidated by too high correla-tions with other tests from which they were intendedto differ" (p. 81). Quite simply, scales that correlatetoo highly may be measuring the same rather thandifferent constructs. Discriminant validity is indicatedby "predictably low correlations between the measureof interest and other measures that are supposedlynot measuring the same variable or concept" (Heelerand Ray, p. 362).

A useful way of assessing the convergent anddiscriminant validity of a measure is through themultitrait-multimethod matrix, whicb is a matrix ofzero order correlations between different traits wheneach of the traits is measured by different methods(Campbell and Fiske, 1959). Table I, for example, isthe matrix for a Likert type of measure designed toassess salesperson job satisfaction (Churchill et al.,1974). The four essential elements of a multitrait-multimethod matrix are identified by the numbers intbe upper left corner of each partitioned segment.

Only the reliability diagonal (I) corresponding totbe Likert measure is shown; data were not collectedfor the thermometer scale because it was not of interestitself. The entries refiect the reliability of alternateforms admitiistered two weeks apart. If tbese areunavailable, coefficient alpba can be used.

Evidence about the convergent validity of a measureis provided in the validity diagonal (3) by the extentto which tbe correlations are significantly differentfrom zero and sufficiently large to encourage furtherexamination of vaUdity. The validity coefficients in

abhigyan

Highlight

abhigyan

Highlight

abhigyan

Highlight

PARADIGM FOR DEVELOPING MEASURES OF MARKETING CONSTRUCTS 71

Table 1MULTITRAIT-MULTIMETHOD MATRIX

Method 1—Likert Scale

Job Role RoleSatisfaction Conflict Ambiguity

Method 2—Themometer Scale

Job Bole RoleSatisfaction Conflict Ambiguity

Method 1 —Likert Scale

Method 2 —ThermameterScale

Job Satisfaction

Role Conflict

Role Ambiguity

Job Satisfaction

Role Conflict

Role Ambiguity -.170

Table 1 of .450, .395 and .464 are all significant atthe .01 level.

Discriminant validity, however, suggests three com-parisons, namely that:

1. Entries in the validity diagonal (3) should be higherthan the correlations that occupy the same row andcolumn in the heteromethod block (4). This is aminimum requirement as it simply means that thecorrelation between two different measures of thesame variable should be higher than the correlations"between that variable and any other variable whichhas neither trait nor method in common" (Campbelland Fiske, 1959. p. 82). The entries in Table 1 satisfythis condition.

2. The validity coefficients (3) should be higher thanthe correlations Ln the heterotrait-monomethod tri-angles (2) which suggests that the correlation withina trait measured by different methods must be higherthan the correlations between traits which havemethod in common. It is a more stringent require-ment than that involved in the heteromethod com-parisons of step I as the off-diagonal elements inthe monomethod blocks may be high because ofmethod variance. The evidence in Table 1 is consis-tent with this requirement.

3. The pattern of correlations should be the same inall of the heterotrait triangles, e.g.. both (2) and(4). This requirement is a check on the significanceof the traits when compared to the methods andcan be achieved by rank ordering the correlationcoefficients in each heterotrait triangle: though avisual Inspection often suffices, a rank order cor-

relation coefficient such as the coefficient of con-cordance can be computed if there are a great manycomparisons.

The last requirement is generally, though not com-pletely, satisfied by the data in Table I. Within eachheterotrait triangle, the pairwise correlations are con-sistent in sign. Further, when the correlations withineach heterotrait triangle are ranked from largest posi-tive to largest negative, the same order emerges exceptfor the lower left triangle in the heteromethod block.Here the correlation between job satisfaction and roleambiguity is higher, i.e., less negative, than thatbetween job satisfaction and role conflict whereasthe opposite was true in the other three heterotraittriangles (see Ford et al., 1975, p. 107, as to whythis single violation of the desired pattern may notrepresent a serious distortion in the measure).

Ideally, the methods and traits generating the multi-trait-multimethod matrix should be as independent aspossible (Campbell and Fiske, 1959, p. 103). Some-times the nature of the trait rules out the opportunityfor measuring it by different methods, thus introducingthe possibility of method variance. When this situationarises, the researcher's efforts should be directed toobtaining as much diversity as possible in terms ofdata sources and scoring procedures. If the traits arenot independent, the monomethod correlations willbe large and the heteromethod correlations betweentraits will also be substantial, and the evidence about


the discriminant validity of the measure will not beas easily established as when they are independent.Thus, Campbell and Fiske (1959, p. 103) suggest thatit is preferable to include at least two sets of indepen-dent traits in the matrix.

Does the Measure Behave as Expected?Intemal consistency is a necessary but insufficient

condition for construct validity. The observables mayall relate to the same construct, but that does notprove that they relate to the specific construct thatmotivated tbe research in the first place. A suggestedfinal step is to show that the measure behaves asexpected in relation to other constructs. Thus oneoften tries to assess whether the scale score candifferentiate the positions of "known groups" orwhether the scale correctly predicts some criterionmeasure (criterion validity). Does a salesperson's jobsatisfaction, as measured by the scale, for example,relate to the individual's likelihood of quitting? Itshould, according to what is known about dissatisfiedemployees; if it does not, then one might questionthe quality of the measure of salesperson job satisfac-tion. Note, though, there is circular logic in tbeforegoing argument. The argument rests on four sepa-rate propositions (Nunnally, 1967, p. 93):

1. The constructs job satisfaction (A) and likelihoodof quitting (B) are related.

2. The scale X provides a measure of A.3. y provides a measure of B.4. X and Y correlate positively.

Only the fourth proposition is directly examinedwith empirical data. To establish tbat X truly measuresA, one must assume that propositions 1 and 3 arecorrect. One must have a good measure for B, andthe theory relating A and B must be true. Thus, tbeanalyst tries to establish the construct validity of ameasure by relating it to a number of otber constructsand not simply one. Further, one also tries to usethose theories and hypotheses which have been suffi-ciently well scrutinized to inspire confidence in theirprobable truth. Thus, job satisfaction would not berelated to job performance because there is muchdisagreement about the relationsbip between theseconstructs (Schwab and Cummings, 1970).

DEVELOPING NORMSTypically, a raw score on a measuring instrument

used in a marketing investigation is not particularlyinformative about the position of a given object onthe characteristic being measured because the unitsin which the scale is expressed are unfamiliar. Forexample, what does a score of 350 on a lOO-itemLikert scale with 1-5 scoring imply about a salesper-son'sjob satisfaction? One would probably be temptedto conclude that because the neutral position is 3, a350 score with 100 statements implies slightly positive

attitude or satisfaction. The analyst should be cautiousin making such an interpretation, though. Supposethe 350 score represents the highest score everachieved on this instrument. Suppose it representsthe lowest score. Clearly there is a difference.

A better way of assessing the position of theindividual on the characteristic is to compare theperson's score with the score achieved by other people.The technical name for this process is "developingnorms," although it is something everyone does im-plicitly every day. Thus, by saying a person "sureis tall," one is saying the individual is much tallerthan others encountered previously. Each person hasa mental standard of what is average, and classifiespeople as tall or short on the basis of how they comparewith this mental standard.

In psychological measurement, such processes areformalized by making the implicit standards explicit.More particularly, meaning is imputed to a specificscore in unfamiliar units by comparing it witb tbetotal distribution of scores, and this distribution issummarized by calculating a mean and standard devia-tion as well as other statistics such as centile rankof any particular score (see Ghiselli, 1964, p. 37-102,for a particularly lucid and compelling argument abouttbe need and the procedures for norm development).

Norm quality is a function of both the number ofcases on which the average is based and their repre-sentativeness. The larger the number of cases, themore stable will be the norms and the more definitivewill be the conclusions that can be drawn, if the sampleis representative of the total group the norms are torepresent. Often it proves necessary to develop distinctnorms for separate groups, e.g., by sex or by occupa-tion. The need for such norms is particularly commonin basic research, although it sometimes arises inapplied marketing research as well.

Note that norms need not be developed if one wantsonly to compare salespersons i and / to determinewho is more satisfied, or to determine how a particularindividual's satisfaction has changed over time. Forthese comparisons, all one needs to do is comparethe raw scores.

SUMMARYAND CONCLUSIONSThe purpose of this article is to outline a procedure

which can be followed to develop better measuresof marketing variables. The framework represents anattempt to unify and bring together in one place thescattered bits of information on how one goes aboutdeveloping improved measures and how one assessestbe quaUty of the measures that have been advanced.

Marketers certainly need to pay more attention tomeasure development. Many measures with whichmarketers now work are woefully inadequate, as themany literature reviews suggest. Despite the time anddollar costs associated with following the processsuggested here, the payoffs with respect to the genera-

abhigyan

Highlight

PARADIGM FOR DEVELOPING MEASURES Of MARKETING CONSTRUCTS ntion of a core body of knowledge are substantial.As Torgerson (1958) suggests in discussing the orderingof the various sciences along a theoretical-correlationalcontinuum (p. 2):

It is more than a mere coincidence that the scienceswould order themselves in largely the same way ifthey were classified on the basis to which satisfactorymeasurement of their important variables has beenachieved. The development of a theoretical science. . . would seem to be virtually impossible unless itsvariables can be measured adequately.

Progress in the development of marketing as a sciencecertainly will depend on the measures marketers de-velop to estimate the variables of interest to them(Bartels. 1951; Buzzell. 1963; Converse, 1945; Hunt,1976).

Persons doing research of a fundamental nature arewell advised to execute the whole process suggestedhere. As scientists, marketers should be willing tomake this committment to "quality research." Thosedoing applied research perhaps cannot "afford" theexecution of each and every stage, although manyof their conclusions are then likely to be nonsense,one-time relationships. Though the point could beargued at length, researchers doing applied work andpractitioners could at least be expected to completethe process through step 4. The execution of steps1-4 can be accomplished with one-time, cross-section-al data and will at least indicate whether one or moreisolatable traits are being captured by the measuresas well as the quality with which these traits are beingassessed. At a minimum the execution of steps 1-4should reduce the prevalent tendency to apply ex-tremely sophisticated analysis to faulty data and there-by execute still another GIGO routine. And once steps1-4 are done, data collected with each applicationof the measuring instrument can provide more andmore evidence related to the other steps. As Ray pointsout in the introduction to this issue, marketing re-searchers are already collecting data relevant to steps5-8. They just need to plan data collection and analysismore carefully to contribute to improved marketingmeasures.

REFERENCES

Bartels, Robert. "Can Marketing Be a Science?," Journalof Marketing, 15 (January 1951), 319-28.

Bohrnstedt, George W. "Reliability and Validity Assessmentin Attitude Measurement," in Gene F. Summers, ed..Attitude Measurement. Chicago: Rand McNally andCompany, 1970, 80-99.

Buzzell, Robert D. "Is Marketing a Science," HarvardBusiness Review, 41 (January-February 1963), 32-48.

Calder, Bobby J. "Focus Groups and the Nature of Qualita-tive Marketing Rese&Tcii," Journal of Marketing Research,14 (August 1977), 353-64.

Campbell, Donald R. and Donald W. Fiske. "Convergentand Discriminant Validation by the Multitrait-MultimethodMatrix," Psychological Bulletin, 56 (1959), 81-105.

Campbell, John P. "Psychometric Theory," in Marvin D.Dunette, ed.. Handbook of Industrial and OrganizationalPsychology. Chicago: Rand McNally. Inc., 1976. 185-222.

Churchill, Gilbert A., Jr., Neil M. Ford, and Orville C.Walker, Jr. "Measuring the Satisfaction of IndustrialSalesmen," Journal of Marketing Research, 11 (August1974), 254-60.

Converse, Paul D. "The Development of a Science inMsiTkelin$,'" Journal of Marketing. 10 (July 1945), 14-23.

Cronbach, L. J. "Coefficient Alpha and the Internal Struc-ture of Tests," Psychometrika, 16 (1951), 297-334.

Czepeil, John A., Larry J. Rosenberg, and Adebayo Akerale."Perspectives on Consumer Satisfaction," in Ronald C.Curhan, ed., 1974 Combined Proceedings. Chicago: Amer-ican Marketing Association, 1974, 119-23.

Flanagan, J. "The Critical Incident Technique," Psychologi-cal Bulletin, 51 (1954), 327-58.

Ford, Neil M., Orville C. Walker. Jr. and Gilbert A.Churchill, Jr. "Expectation-Specific Measures of theIntersender Confiict and Role Ambiguity Experienced byIndustrial Salesmen," Journal of Business Research, 3(April 1975), 95-112.

Gardner, Burleigh B. "Attitude Research Lacks System toHelp It Make Sense," Marketing News, 11 (May 5, 1978),1 + .

Ghiselli, Edwin E. Theory of Psychological Measurement.New York: McGraw-Hill Book Company. 1964.

Heeler, Roger M. and Michael L. Ray. "Measure Validationin Marketing," Journal of Marketing Research, 9 (No-vember 1972), 361-70.

Howard, John A. and Jagdish N. Sheth. The Theory ofBuyer Behavior. New York: John Wiley & Sons, Inc.,1969.

Hunt, Shelby D. "The Nature and Scope of Marketing,"Journal of Marketing, 40 (July 1976), 17-28.

Jacoby, Jacob. "Consumer Research: A State of the ArtReview," Journal of Marketing, 42 (April 1978), 87-96.

Kerlinger, Fred N. Foundations of Behavioral Research,2nd ed. New York: Holt. Rinehart, Winston. Inc., 1973.

Kollat, David T., James F. Engel, and Roger D. Blackwell."Current Problems in Consumer Behavior Research,"Journal of Marketing Research 7 (August 1970), 327-32.

Ley, Philip. Quantitative Aspects of Psychological Assess-ment. London: Gerald Duckworth and Company, Ltd.,1972.

Nunnally, Jum C. Psychometric Theory. New York: Mc-Graw-Hill Book Company. 1967.

Schwab, D. P. and L. L. Cummings. "Theories of Perfor-mance and Satisfaction: A Review," Industrial Relations,9 (1970), 408-30.

Selltiz, Claire, Lawrence S. Wrightsman, and Stuart W.Cook. Research Methods in Social Relations, 3rd ed. NewYork: Holt, Rinehart, and Winston, 1976.

Torgerson, Warren S. Theory and Methods of Scaling. NewYork: John Wiley & Sons. Inc., 1958.

Tryon, Robert C. "Reliability and Behavior Domain Validity:Reformulation and Historical Critique," PsychologicalBulletin, 54 (May 1957), 229-49.

1979 churchill a paradigm for developing better constructs in marketing

Documents