experimental designusers.ece.utexas.edu/~perry/education/382c/l08.pdf · defining terms for...

Post on 11-Mar-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 1© 2000-present, Dewayne E Perry

Experimental Design

Dewayne E PerryENS 623

Perry@ece.utexas.edu

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 2

Problems in Experimental Design

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 3

True Experimental DesignGoal: uncover causal mechanismsPrimary characteristic: random assignment to sampling unitsIf not random, then only Quasi Experimental Without randomization, cannot rule out some systematic biasesTypes of designs

Between subject designs: sampling units are subjected to one treatment eachWithin subjects designs: sampling units receive two or more treatments

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 4

The “No Effect” HypothesesSpecial place to the test of the hypothesis that the treatment is entirely without effectReason: in a randomized experiment, this test may be performed virtually without assumptions of any kind – ie, relying only on random assignmentContribution of randomization is clearest when expressed in terms of the test of no effect

Does not mean that such tests are of greater practical importanceIt sets randomized and non-randomized aspects in sharper contrast

Whereas inferences in non-randomized experiments require assumptions

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 5

The “No Effect” HypothesesTo say a treatment has no effect is to say that that each unit would exhibit the same value of the response whether assigned to treatment or to controlA change in response indicates the treatment has some effectWill discuss later various tests of the significance and the size of effects

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 6

Alternative HypothesesAs opposed to the null hypothesis, experimental (alternative) hypotheses take a stand

Different treatments behave differently (two tailed prediction)Predict what direction the expected differences take (one tailed)

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 7

Hypotheses and TheoryTheory is a large scale map, different areas represent general principlesHypotheses are like small sectional maps, focus on specific areasConceptual similarities

range from very explicit to very vaguefall back on hidden assumptions, regulative principlesGive directions to our observations

Some hypotheses spring from experimental observationsDon’t know where they will go

Others from theoryConceptual hypothesesRely on previous studies, theories

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 8

Creating HypothesesDefining terms

For observations to have value, abstractions have to be concretizedTwo types of definitions

Operational: x is defined in terms of test y under conditions zTheoretical: abstract constructs used

At some point must be operational for the experimentHypotheses are predictive statements about the expected outcomeThey call for a test and embed a conclusionExplicit statements are de rigueurWhen comparisons are predicted they have to be explicated

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 9

Stating the HypothesisConcomitant variation: X is a direct function of YComparative: other things being equal ….H0 and H1 (null and alternative) are mutually exclusive and exhaustive

Usually a specific H0 and a general H1Try to reject H0

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 10

Hypothesis Testing Errors2 types of errors: I and II

Type I – rejecting the null hypothesis when it is trueGreater psychologically important riskThink there is a relationship when there is notWaste time in blind alleys

Type II – accepting the null hypothesis when it is falseDeny a relationship when there is oneIn effect, reject useful results

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 11

VariablesIndependent and dependent

Dependent: effect in which the researcher is interestedIndependent: cause of the effectAny event or condition can be conceptualized as either an independent of dependent variable

Concerned about the effects of X on YIe, the causal effects of one on the otherBoth in the labs and in the field

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 12

Independent VariablesNo single or standard way of classifying variablesUseful categorization (not mutually exclusive):

BiologicalEg, affects of gender in mentoring developers

EnvironmentalEg, schedule pressure and fault insertion

HereditaryEg, IQ effects on complexity

Previous training and experienceEg, effects of first programming languages

MaturityEg, age and elegance of program structures

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 13

Independent VariablesManipulated variables

If an experiment, one expects manipulationIntentional and systematic variation

Naturally occurring variablesManipulated by real life experienceEg, desk versus meeting inspectionsContext: normal of exceptional conditions

Static group variablesPre-existing groups with identified characteristics:

Organismic variables: sex, age, weight, etcStatus variables: education, occupation, marital statusAttribute variables: diagnoses, personality traits, behaviors

Cannot be manipulated – but are selected to gain proper contrast groups

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 14

Independent VariablesAnalogous to experimental treatmentWhen used as a dependent variable, may make inferences as to how the group acquired its characteristics

Eg, overweight lowers self-esteemLower self esteem causes overweightness

Risk of causal inferences -Tempting but riskyDependent variable not an accurate descriptorAt best an association, connection, relationship, correlationExample of weight/esteem experiments

Case 1: high calorie diet -> check esteem• Ethical problemsCase 2: overweight + low calorie -> raise esteem• Doesn’t prove overweight, low esteemCase 3: overweight + success -> lower weight• High esteem, low weight doesn’t prove le/owMust be careful about the logic

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 15

Independent VariablesUnidirectional paths

Eg, height and self-esteem – cannot switchFixed by logic of antecedents and consequencesMultiple variables: income, age -> truancy, discipline

Need 2x2 analysesQuestion: does income discriminate truancy and discipline problems

One-way, non-causal enabling relationshipsEg, income – IQ -> incomeBut not vice versa

Two-way, sequential causationEg, success-failure and self-confidenceEg, baseball players slumps, hitting streaks

-> Causation established by experimentationManipulationUsing static variables is descriptive/relationalBut not experimental

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 16

Independent VariablesEstablishing levels of independent variables

First decision: categorical or continuousEg, age is continuous, sex is categoricalIf continuous data, whole range, dichotomous, or graduatedRisky as information is lostMay be theoretical reasons for categories

If hypothesis is state in categorical terms, then should be consistentIf a relationship, not appropriate to break into dichotomies or nominal categories if variable is continuous

For theoretical or rational, not statistical reasonsExamine how the levels of categories established

Should be consistent with hypothesisPossible groupings: extremes, ranges of categories, median split (as in IQ)

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 17

Independent VariablesContinuous full-range distribution

Sometimes linear correlations, sometimes notEg, learning (perhaps), visual acuity (not)

Theory driven levelsHypothesis stated consistently with current theory

Strength of independent variable (magnitude of effect)Extreme groupings tend to magnify effectIncreasing magnitude may reduce generalityCan more easily argue weaker to stronger (eg, stress) and have great generalityLevels of independent variable should match hypothesis

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 18

Dependent VariablesMany possible flavors, literally thousandsEg, learning new design techniques

Direction of observed changeAmount of changeThe ease with which change effectedPersistence of changes over time

2 general classesDiffusion – fan out

Eg, technology insertion and adoptionHierarchical variations – changes in ranking

Eg, changing roles in organizational structures

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 19

Practical ApplicationDistinguish two kinds of control groups

No treatmentOk for physical effectsProblems where belief may confound

PlaceboRule out belief effects

Practical decision is not easy which to useQuestion of greatest interestExperience or knowledge of the general areaEasy to make mistakes in a new area

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 20

Basic DesignsDesign 1

One shot case study: X OH- M- I(NR) S-

Deficient in terms of any reasonable controlsHistory may be alternative explanationMaturation not controlled forMay be changes in instruments or judgesUnknown state of participants

Instrumentation not a factor: no pre-measurementDesign 2

One group pretest: O X OSlight improvement, but no comparison

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 21

Basic DesignsDesign 3 – Solomon Design

True experimental, 4 groupI R O X OII R X OIII R O OIV R O

H+ M+ I+ S+All well controlled for

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 22

Basic DesignsSolomon provides elegant illustration of logic of control

Pretest performance scores in I & III to estimate pretest scores in II & IV

Requires a leap of faith even if scores the sameEven if differ greatly, II & IV could be equal to the mean of I & III

Use estimated pre-test scores to enrich factorial analysis of variance of post test scoresTells us if any confounding of pre-test and treatment

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 23

Basic DesignsPre/post test effects:

I+ II- III+ IV-Experimental treatment effects:

I+ II+ III- IV-Pretest & X sensitization:

I+ II- III- IV-Pretest sensitization =

Extraneous effects:I+ II+ III+ IV+

( ) ( )IVIIIIII YYYY −−−

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 24

Basic DesignsExternal Validity

All 3 suffer from the possible confounding of selection and treatmentDesign 3 - Solomon:

controls for confounding of treatment and pre-test sensitizationDesign 4:

I R O X O III R O OIV: H+ M+ I+ S+Deficient in pre-test sensitization – eg, problem in attitude change or learning experiments

Design 5II R X O IV R OIV: H+ M+ I+ S+Avoids pretest sensitization issues

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 25

Basic DesignsWithin subjects designs

Each subject receives all treatments in turnUseful in SWE/CS – repeated measures designAdvantages:

Same number of subjects used more effectivelyEach sampling unit serves as its own controlCan examine relationships longitudinally

Difficulties:Sensitization problems

Learning etcOrder of treatments may produce differences in successive measures

Another threat to IV in longitudinal studiesRegression towards mean

When linear relationship is imperfectEg, overweight people appear to lose weight, low IQs appear to become brighterObserved when variables consists of the same measure taken at two points in time and the correlation r < 1

382C Empirical Studies in Software Engineering Lecture 8

© 2000-present, Dewayne E Perry 26

Basic DesignsSolve threat by standard Z score

A raw score from which the sample mean has been subtracted and the difference then divided by the standard deviationRegression equation:

The estimated score of Y is predicted from the XY correlation r times the standard score of XIf there is a perfect correlation, the Z scores will be equivalent; otherwise not if r < 1

XXYY ZrZ =

top related