profs.etsmtl.caprofs.etsmtl.ca/claporte/english/enseignement/cmu_sqa/lectures/mesures/... · if you...

7
Evaluating measurement schemes by Gem Kaner, J.D., Ph.D. erhaps the most respected book on measurement in computing, Fenton and Pfleeger's Software Metrics: A Rigorous and Practical Ap- proach, defines measurement as follows: "Measurement is the process by which numbers or symbols are assignedto attributes QUI C K L O O K of entities in the real world in such a way as to describe them according to clearly defined. Nine factors to assess rules " T ' h bl 'th thi d ti ' t ' , th t a measurement scheme e pro em Wl s e InllOn 18 a there are lots of available clear-cut rules, We. An examination have to select the right clearly defined rule, Otherwise we could measure "goodness of of three common measures testers" by the clearly defined rule called "counting their bug reports," This (as we'll see in a few moments) is ridiculous, I prefer the following defInition: Measurement is the assignment of numbers to attributes of objects or events according to a rule derived from a model or theory. Fenton and Pfleeger do point out the issue of the need for a model. They dis- cuss this as an issue of the defInition of the attribute. AU that I'm doing here is making the issue much more visible. The invisibility of underlying measurement models has led people to use in- adequate and inappropriate "metrics," deluding themselves and wreaking havoc on their staffs. For a good read on this as a general problem, read R.D. Austin's 1996 book, Measuring and Managing Performance in OrganizatiOrlS-or, for that matter, Scott Adams' comic strip Dilbert. 51 Software Testing & Ouality Engineering www.stqemagazine.com

Upload: others

Post on 05-Sep-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Evaluating measurement schemes

by Gem Kaner, J.D., Ph.D.

erhaps the most respected book on measurement in computing, Fentonand Pfleeger's Software Metrics: A Rigorous and Practical Ap-proach, defines measurement as follows:

"Measurement is the process by whichnumbers or symbols are assigned to attributes QUI C K L O O Kof entities in the real world in such a way as todescribe them according to clearly defined. Nine factors to assessrules "T'

h bl 'th thi d ti 't ' , th t a measurement schemee pro em Wl s e InllOn 18 athere are lots of available clear-cut rules, We. An examinationhave to select the right clearly defined rule,Otherwise we could measure "goodness of of three common measures

testers" by the clearly defined rule called"counting their bug reports," This (as we'll see in a few moments) is ridiculous,

I prefer the following defInition:

Measurement is the assignment of numbers to attributes

of objects or events according to a rule derived from a

model or theory.

Fenton and Pfleeger do point out the issue of the need for a model. They dis-cuss this as an issue of the defInition of the attribute. AU that I'm doing here ismaking the issue much more visible.

The invisibility of underlying measurement models has led people to use in-adequate and inappropriate "metrics," deluding themselves and wreaking havocon their staffs. For a good read on this as a general problem, read R.D. Austin's1996 book, Measuring and Managing Performance in OrganizatiOrlS-or,for that matter, Scott Adams' comic strip Dilbert.

51

Software Testing & Ouality Engineering www.stqemagazine.com

technical but are essential for an un-derstanding of the attribute, the in-strument, and their relationship:

your basis for saying that this instrumentmeasures this attribute well? Here's aquestion that I find useful: Suppose thatwe increase the attribute by 20%. What dowe expect to happen to the instrument,and why? How confident are we that theinstrument reading (the measured value)will go up? What mechanism relates theattribute to the instrument?

6. The appropriate scale for the at-

tribute. We'll discuss scale types in a mo-

ment.

7. The variation of the attribute. The at-tribute itself is probably subject to randomfluctuations. You might be a more effectivetester on Tuesday than on Wednesday.Therefore we need a theory of variation of

the attribute.

4. The probable side effects of using thisinstrument to measure this attribute.When people realize that you are measur-ing something, how will they change theirbehavior to make the numbers look better?If you find a way to change the measure-ment by 20%, what has happened to theattribute? Some side effects are so badthat the result of trying to make the num-ber bigger is to drive the attribute lower(for example, as discussed further below,driving people to increase bug count mightincrease the number of bugs reported-but make the testers much less effective).

a. The scale of the instrument. To be dis-

cussed shortly.

9. The variation of measurements madewith this instrument. The act of takingmeasurements, using the instrument, car-ries random fluctuations. Thus, we need atheory of measurement error, or of varia-tion associated with using and reading the

instrument.

5. The scope of the measurement, proba-bly defined in terms of what this measure-ment will be used for.

Example: Using a RulerLet's start with an example of the sim-plest case, measuring the length of atable with a one-foot ruler.The next four factors are more

.Attribute The attribute of interest is the

length of the table.

1. The attribute to be measured. This iswhat you want to measure, such as the ex-tent of testing that you've done, the com-plexity of the program you're testing, or theeffectiveness of a tester.

.Instrument The ruler is the measuring in-strument. It's one foot long, so we'll have tolay it down six times to mark off a six-foot

length.

.Theory of Relationship The relationshipbetween the attribute and the instrument isdirect. They're both on the same scale and(except for random error) a change in the at-tribute results in a directly comparablechange in the measured value. (Example:Cut the table down to four feet and the nexttime you measure it, the ruler measurement

will read four feet.)

hehavinr tn make the numbers2. The instrument that measures the at-tribute. Think of a measure as a readingfrom an instrument. For example, a stop-watch shows how much time has passed.Some people try to measure extent of test-ing with a coverage program, softwarecomplexity with a complexity program, ortester effectiveness with bug counts.

h¥ 200/n. what has happened

tn the attrihute73. The relationship between the at-

tribute and the instrument. What is .Probable Side Effects I don't anticipateany side effects of using a ruler to measurethe length of the table.

.Attribute's Scale We'll have more to sayabout scaling soon. For now, note that thelength of the table can be measured on a ra-tio scale. A table that is six feet long is twiceas long as one that is three feet long. If wedoubled the lengths.. the twelve-foot tablewould still be twice as long as the six-foottable. This preservation of the ratio relation-

www.stqemagazine.com Software Testing & O.uality Engineering March/April 2000

of Measurement

~easurernent theory addresses prob-lems that run through rnany disci-plines, including Cornputing. I learnedabout the theory of measurement pri-marily frorn Steve Link and A.B.Kristofferson, when I did rny doctoralstudies in psychophysics (also knownas perceptual measurernent). (For athoughtful history of that field, readLink's 1992 book, The Wave Theoryof Difference and Similarity.)

This article is a preliminary re-port of my attempts to pull togetherthinking frorn several disciplines intoa more coherent, and I think rnorepractical, approach to software-relat-ed measurement. ~y goal is to helpyou evaluate rneasurement schernesthat people ask you to use, to helpyou explain why the bad ones should-n't be imposed on your group, and tohelp you develop more useful alterna-tives.

In summary, I think that the theo-ry underlying a rneasurement rnusttake into account at least nine factors.This article defmes these nine factorsand applies them to a few examples.

The fIrst five are intuitive:

ship ("twice as long") when both items aremultiplied is at the essence of the ratioscale.

.Attribute's Variation The race only gaveus one sample of the speed of these run-ners. If they ran the same track and distance

again tomorrow, they'd probably have differ-ent times..Attribute's Variation If you measure the

lengths of a few tables, you'll get a fewslightly different measurements becausethere is some (not much, but a little) manu-facturing variability in the lengths of the ta-bles. Tables that are supposed to be six feetlong might actually vary between 5.98 and6.02 feet. The length of individual tablesmight not vary much, but if you went to anoffice supply store and asked for a six-foottable, the table that you would get wouldonly approximate (perhaps very closely ap-proximate) six feet.

.Instrument's Scale This instrument op-erates on an ordinal scale-not a ratioscale, and not an interval scale. (For de-tailed discussions of scale types, see S.S.Stevens' 1976 book Psychophysics, andFenton and Pfleeger's Software Metrics.)The following comparisons should give youa sense of the differences among thescales.

If we were operating on a ratio scale, wewould be able to say that Susan (measuredas $100} was 1/100th as fast as Sandy(measured as $10,000).

.Instrument's Scale The instrument (ruler)measures length on a ratio scale.

.Instrument's Variation Try to measure asix-foot table with a one-foot ruler ten times.Record your result as precisely as you can.You'll probably get ten slightly different mea-surements, probably none of them exactly6.00 feet. Even with ruler-based measure-ment, we have error and variability.

If we were operating on an interval scale,we could say that the difference betweenSusan and Joe (3-2= 1) was the same asthe difference between Joe and Sandy(2-1=1) and that the difference betweenSusan and Sandy (3-1 =2) was twice thedifference between Susan and Joe.

Example: The Scaling ProblemSuppose that Sandy, Joe, and Susanrun in a race. Sandy comes in first,Joe comes in second, and Susan third.The race comes with prize money.Sandy gets $10,000, Joe gets $1000,and Susan gets $100.

The attrihute it~elf On an ordinal (or positional) scale, all weknow is that Sandy crossed the line first, Joecrossed it second, and Susan crossed itthird. We don't know how much faster Joewas than Susan, but we do know that hewas not as fast as Sandy.fluctuations.

The ordinal scale doesn't tell us much aboutspeed but it tells us more than a nominal{or categorical) scale. It would stretch theexample too far to try to talk about a nominalscale for speed. But since we've been com-paring the different scale types here, let'ssay a few words about nominal scales aswell. Suppose that you had hundreds of dif-ferent bug reports, and you sorted them intocategories (this is a printing error, that is ausability error, etc.). We can assign numbersto the categories (printing = type 1; usability= type 2), but even though we might assign

these numbers to the reports according towell-considered rules, the numbers are justlabels.

.Attribute Let's say that our attribute of in-terest is the speed of the runners.

.Instrument The instrument is the simpleobservation of the order of the runners asthey cross the finish line. There are two dif-ferent sets of markers on this instrument(like the inch and centimeter markers onrulers). One set of markers says First, Sec-ond, and Third. The other markers are prizemoney amounts, say $10,000' $1000, and$100. You might prefer to measure speedwith some other instrument, but (oops) youforgot your stopwatch and this is whatyou've got. (If this example looks oversim-plified, please be patient with it; I'm tryingto use something everyone understands inorder to illustrate the problem that some-times we are stuck with crude measuringinstruments.)

c on Wednesday.

of variation of the attrihllte-

Faster speed results in a better position(First, Second, Third).

.Probable Side Effects I don't see obviousside effects (changing how people runraces) in this particular case.

The final scale to mention is the absolutescale. If you have one (1) pen, you have one(1) pen. If you cut it in half, you get a mess;not two halves of a working pen..Attribute's Scale The attribute's scale is

probably best expressed in terms of miles(or meters) per hour. If so, this is a ratioscale (one mile per four minutes equals fif-teen miles per hour).

.Theory of Relationship The mechanismunderlying the relationship between speedand position in the race is straightforward.

.Instrument's Variation There is probablynot much measurement error in this exam-ple, unless the race was very close.

March/April2000 Software Testing & Uuality Engineering www.stqemagazine.com

~lIhjer:tive issues are

immeasllrahle and unscientific.

(Thi~ arti~le i~n ' t the place

vided an example of how not to doit-the development team were askedto rate the customer interface com-plexity of their own projects as nor-mal, greater than normal, or less thannormal. He pointed out some biasesassociated with using developers torate their own code, and then con-cluded that "any exercise that tries togive a numeric value to an unquantumwithout doing any real measurementalong the way is a bit of a fraud." An"unquantum" is, according to DeMar-co, "a relevant factor that is unmea-sured." Evidently, rankings by humansdon't count as measurements. In-stead, he said that measures like thefollowing were "true metrics."

Measurement

in the Real World

The examples of length and positionin the race are toys. They are easy tofigure out. The theories of relation-ship are clear-cut and the side effectsare minimal.

When it comes to things that wewould really like to measure, life ismore difficult. Examples of the kindsof things that testers are routinelyasked about include:

.Productivity Who's doing a better job oftesting or programming (or whatever)?

.Extent How much testing has been done?.Change Rate: customer-introduced

changes per unit time.Progress Where are we relative to some

plan?.Change Impact: unit cost of average

change

one involves complex issues. Typical-Iy, they involve a lot of judgment(which is subjective). Additionally,several of the most interesting dimen-sions involve human behavior. That'shardly a surprise-we are working ina field of human endeavor, calledcomputing, whose essential workproduct is the stuff of mental cre-ation. The essence of "quality" is"qualitative." As Gerald Weinbergwrote in the 1993 book Quality Soft-ware Management, "Quality is valueto some person."

There is a bias in computingagainst measurements that involvesubjective quantities. Somehow, somepeople have developed the idea thatsubjective issues are immeasurableand unscientific. (This article isn't theplace to refute that directly; but ifyou're of that view, go read Link'swave theory book. )

Tom DeMarco provided an exam-pIe of this bias almost twenty yearsago. Although he has taken a differentapproach in his more recent book,Why Does Software Cost So Much? ,his original presentation in 1982'sControlling Software Projects is awell-written, still influential example.DeMarco asked how to measure cus-tomer interface complexity, and pro-

.Reliability What is the actual and probablefuture rate of failure?

.Customer Dissatisfaction: cost of change

during a fixed period.Usability Whalfor example, is the probableuser error rate?

I accept DeMarco's opinion thatthe developers' rankings of their ownwork are unusably biased, but to sayon the basis of this that human rank-ing of complexity is some kind of un-

.Support Burden What will this cost to

support?

How do we measure these? Each

541

I www.stqemagazine.comMarch/April 2000Software Testing & Iluality Engineering

.Attribute's Scale I don't know. Neither do

you.

.Instrument's Variation There's not muchrandom variation in the counting of bug re-ports (although there is some, such as bugsclassified as duplicates). There is systematicmeasurement-related variation, as we'll seein the discussion of side effects.

.Attribute's Variation I don't know. Butthere is variation. Joe probably does betterwork on some days than others.

.Instrument We don't have an obvious,easy-to-use, unambiguous direct measureof tester effectiveness, skill, etc. So insteadwe use a surrogate measure, somethingthat is easy to count and that seems self-ev-idently related to the attribute of interest. Inthis case, the instrument is a counter of bugreports.

.Theory of Relationship There is hardlyany theory of relationship between the at-tribute and the instrument here. We countbugs because they are easy to count, notbecause this is an essential measure of theworth of the tester. Yes, testers shouldsearch for bugs. But when you increase atester's true effectiveness by 20%, youmight get someone who goes after harder-to-find bugs, takes on code that is more sta-ble and has problems that are much moresubtle, or who writes better test documenta-tion, mentors the rest of the staff, spendsmore time building tools, or who does othergreat stuff that never reflects directly on herpersonal bug count. Bug counts would seri-ously mismeasure such testers.

.Instrument's Scale Bug counts are anabsolute scale (two half-reports do not equalone whole-report).

quantum because it isn't expressed ineasy-to-count numbers that are muchless directly related to the value wewant to measure (that is, the com-plexity of the thing to humans)seems. ..well, I guess we just disagree[ on that 1982 conclusion]. I think itwould be interesting to ask customerswho interacted with the system to ratethe different areas' customer interfacecomplexity. The fact that there arelots of ways to do this badly doesn'tcreate an excuse for walking awayfrom a fundamental point: that if youwant to talk about the complexity of ahuman-machine interface, the hu-man's sense of that complexity is akey measure-perhaps the most di-rect and the important measure-ofthat complexity.

Software-related attributes ofteninvolve psychological or subjectivecomponents. Our measurements ofthem are questionable when they failto take these factors into account.

Let's consider three common at-tempts to develop software metrics:

There is hardJ¥

between the attrihlrte .Probable Side Effects People are good attailoring their behavior to whatever they'rebeing measured against. If you ask a testerfor more bugs, as Weinberg and Schulmanhave pointed out, you'll probably get morebugs. Probably you'll get more bugs that areminor, or similar to already reported bugs, ordesign quibbles-more chaff. But the bugcount will go up. In general, this measure-ment system creates incentives for activitiesthat generate certain results (lots of b~gs)and disincentives for anything else. The pre-dictable result is dysfunctional-peoplechanging their behavior in ways that bringup their bug counts, at the expense of un-measured variables that might have muchmore to do with the genuine effectiveness ofa tester in a group. (Austin talks about sideeffects-or, as he calls it, "dysfunction"-indetail in his 1996 book.)

and the instrument...Example: Bug Counts andthe Theory of RelationshipShould we measure the quality (pro-ductivity, efficiency, skill, etc.) oftesters by counting how many bugsthey find? Leading books on softwaremeasurement suggest that we compute"average reported defects/workingday" and "tester efficiency" as "numberof faults found per KLOC" or "defectsfound per hour." [See complete refer-ences at the end of this article, specifi-cally Grady et al. 1987, Fenton et al.1997, and Fenton et al. 1994.] Theseauthors are referring to averages, notmeasures of individual performance,and they sometimes warn against indi-vidual results (because they might beunfair). However, I repeatedly run intomanagers (or, at least, the test man-agers who work for them) who com-pute these numbers and take them intoaccount for decisions about raises,promotions, and layoffs. For that mat-ter, are these even valid measures ofthe efficiency of the group as a whole?

the¥ are eas¥ to count.

not because this is an

essential measure

of the worth of the tester.

Let's do the analysis and see the prob-lems with this measure:

Problems like these have causedseveral measurement advocates(specifically Grady and Caswell, andAustin) to warn against measurementof attributes of individuals unless, asDeMarco suggested in 1995, the mea-surement is being done for the benefitof the individual (for genuine coach-ing or for discovery of trends) andotherwise kept private.

With only a weak theory of rela-tionship between bug counts andtester goodness, and serious probableside effects, we should not use thismeasure (instrument).

.Attribute The attribute of interest is thegoodness (skill, quality, effectiveness, effi-ciency,productivity) of the tester.

they stop testing when they have found mostof some types of errors, without necessarilyrealizing that they have no assurance thatthey have found most of several other types

of errors.

Example: Code CoverageSuppose that you want to know howmuch testing has been done. Howwould you measure that?

One approach is to compute"code coverage." The most commondefinition of coverage involves thepercentage of statements tested, orthe percentage of statements plusbranches tested. Supposedly, a higherpercentage means more testing. Somepeople (vendors included) go furtherand foolishly say that 100% coveragemeans complete, or sufficient, testing.

There are several other types ofcoverage beyond statement andbranch coverage (some examples aredescribed in my 1995 article, listed inthe complete references at the end ofthis feature). Each of these involvesmeasuring the percentage of a certaintype of test that you have run, or acertain level of thoroughness ofchecking for a spectllc type of error.We are never using the population ofall possible tests of a product as ourbaseline when we compute code cov-erage-if we were, coverage would al-ways be 0.00%, a rather boring num-ber. But because we are notaccounting for all possible tests, wecan have a 100% covered product thatstill has undiscovered defects.

.Scope Do these side effect problems meanthat we should not use coverage tools? No,of course not. It depends on what we usecoverage for. I think it is very useful to evalu-ate coverage in order to discover that somecode is completely untested, but that it isalso quite dangerous to use a coverage"metric" as an indicator of how close tocompletion you are. -

In sum, our measures of the ex-tent of testing-like so many mea-sures that we take in software-arenumeric. This might make them lookmore "scientific," but they are funda-mentally judgment-driven. A theory oftesting and of testing adequacy is em-bedded (often hidden) in such mea-sures.

In ~lIm. nllr measures

take in snftware-are numeric.

Example: Code ComplexityMcCabe's complexity metric is oftenenough criticized as incomplete (see,for example, Fenton and Pfleeger inSoftware Metrics). But let's applyour model to this metric.

Innk mnr~ "sr:i~ntifir:."

.Attribute Let me suggest that "complexity"is a fundamentally psychological concept. Ifthe term means anything, it deals with howcomplex the software is to a human. Indeed,the complexity metrics are sometimes ex-plicitly referred to as measuring "psycholog-ical complexity,"

.Attribute Extent of testing completed

.Instrument Number of tests run of a cer -

tain type, number of lines touched by thetests, etc. Depends on the definition of cov-

erage.

of te~ting and of testing

adequa~¥ i~ embedded

(often hidden) in such .Instrument A count of the number of num-ber of nodes on the graph, essentially of thebranches in the program, certainly looks atone aspect of complexity. It is easy tocount. ..but does this counting give us a truemeasure of complexity?

.Theory of Relationship We could repre-sent the possible bugs in a product in termsof a disjoint collection of sets: E1, E2, E3,and so on through EN. I don't know how bigN is. E1 is the set of all errors of type 1 (suchas failure to initialize a variable). E2 is theset of all errors of type 2 (such as a syntaxerror in a line of code). A coverage measuretells you that you have made great progressagainst certain types of errors-perhaps at100% coverage, you have tested for all pos-sible errors of types E2, E3, and E4. As anormal part of the process, you will stumbleover some other types of errors, so cover -

age-driven testing is broader than just thecollection of errors that the tool is focusedon. But still, suppose that we increase theextent of testing by running a bunch of teststhat this particular coverage tool is insensi-tive to. For example, you can run through

mea~lIre~-

every line of code (statement coverage)without ever testing a boundary value. Goback to test all boundary values and youmight find new bugs, but you won't increasestatement coverage at all.

.Theory of Relationship We can easily

drive up true complexity (for example,replacing all the variable names withrandom numbers) without affecting thebranch-driven complexity measure at all.The relationship is weak and without much,if any, theoretical basis.

.Probable Side Effects Brian Marick hasrepeatedly pointed out the side effects ofdriving testing by using coverage metrics.People focus their efforts on the types oftests that will drive up the percentages, andthey tend to stop when they have reached

the target percentage (85% "coverage"might be good enough in some companies,95% the target in others.) The result is that

.Probable Side Effects Distortions may oc-cur in the code as we pay tremendous at-tention to a counter of paths-and less at-tention to the unmeasured complexity issueslike clarity of expression, variable naming

March/April 2000Software Testiog & Ouality EngineeringI www.stqernagazine.corn

conventions, algorithmic sanity, and operat-ing environment.

to outright antagonistic. It is rare to findthe practitioner who really thinks of mea-surement as a useful and indispensabletool for good software work. Most feelthat they get back very little from themeasurement activity. ..The psychologi-cal dislike and distrust our practitionershave about measurement is a significantchallenge facing us. From my perspec-tive, we've been pretty unsuccessful inserving working engineers and practi-tioners."

In Sum

In his 1993 book, Making SoftwareMeasurement Work: Building anEffective Measurement Model, BillHetzel pointed out that

"Practitioner attitudes [toward measure-ment] tend to range from barely neutral

Hetzel suggests an alternative,bottom-up approach to measurementthat is worth looking into. He wants touse metrics to stimulate questions, toexplore the engineering activity,rather than to use them to irnmediate-Iy focus on setting targets and goalsto control engineering.

The approach that I'm suggestingisn't incompatible with Hetzel's. Orwith Fenton and Pfleeger's, or manyother common approaches to soft-ware metrics. But what I'm proposinghere is more explicit about some is-sues that we too often approach toocasually:

Measures are not made accept-able simply because they are easy tocompute and seem relevant. Theyare not valuable merely becausethey have something to do with thelatest goal-of-the-week. They workwhen they actually relate to some-thing we care about, and when therisks associated with taking themeasures (the probable side ef-fects), in the context of the scope ofuse of those measures, are insignifi-cant compared to the value of infor-mation we actually obtain fromthem. To understand that value, wemust understand the underlying re-lationship between the measure andthe attribute measured.

This material was first pub-licly presented at the Pacific North-west Quality Conference in Octob~1999. This model was reviewedand extended at the Eighth Los Al-tos Workshop on Software Testingin December, 1999. I thank theLA WST attendees, Chris Agruss,James Bach, Jaya Carl, RockyGrober, Payson Hall, ElisabethHendrickson, Doug Hoffman, BobJohnson, Mark Johnson, BrianLawrence, Brian Marick, HungQuoc Nguyen, Bret Pettichord,Melora Svoboda, and Scott Vernon,for their critical analyses. STQE

Cem Kane1; Ph.D., J.D., is the se-nior author of Testing ComputerSoftware and of Bad Software:What to Do When Software Fails.He consults and teaches courseson software testing and practiceslaw, focusing on the law of soft-ware quality. Contact him [email protected], www.kaner.com,or www.badsoftware.com

57March/April 2000 Software Testing & Iluality Engineering www.stqemagazine.com