new chapter nine - media.usm.maine.edumedia.usm.maine.edu/~jbeaudry/research literacy/ch 9.pdf ·...

CHAPTER NINE

DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN

GROUP EXPERIMENTS

Chapter Objectives:

• Understand Null Hypothesis Significance Testing (NHST)

• Understand statistical significance and probabilities (p-values)

• Understand Type I and Type II Errors in hypothesis testing

• Understand t-tests, ANOVA / MANOVA, and ANCOVA / MANCOVA

and how they are used in analyzing experiments and quasi-experiments.

• Understand Effect Size Estimates

• Understand the difference between statistical and practical significance

• Understand Power Analysis

• Understand how validity (quality) of experiments and quasi-experiments is

evaluated

• Understand internal validity and threats

• Understand statistical conclusion validity

I. Data Analysis

Data analysis in experimental and quasi experiment research is a multi-

layered process that begins with collecting data on the dependent variable after

the experiment has been implemented and ends with making inferences about

the statistical soundness and the meaning of the results. There are three

complementary approaches to statistical analysis.

Null Hypothesis Significance Testing (NHST) is the term applied to the set

of procedures that (1) establishes statistical significance of results and (2)

confirms or disconfirms an experiment’s hypothesis.

Effect Size refers to the magnitude of differences on the dependent

variable after a treatment

Power Analysis refers to the probability of avoiding errors in conclusions

drawn from NHST.

Null Hypothesis Significance Testing: Significance Testing

• Statistical significance is a low probability that results are due to

sampling error or chance and a high probability that results are due to

the treatment.

Significance testing is the first step in NHST. It is important to note that

‘significance’ as used here does not mean importance. It merely reports on the

probability that outcomes are not due to error, and thus affirms that the outcomes

are real and are due to a treatment. Significance is based on probability.

• p represents the level of probability that the differences are due to

various sources of error (e.g., sampling). In most cases the lower the p-

value, the better.

• Alpha is the p-value that the researcher sets at the beginning of the

study as an acceptable level of probability that differences are due to

sampling error.

• p ≤ .05 is the conventional alpha used in educational research. This

means there is less than a 5% probability the differences are due to

sampling error and a high probability the differences are due to the

treatment.

Probabilities are derived from the normal distribution curve. Remember from

Chapter Six that 95% of values will randomly occur within the first two standard

deviations (plus or minus 2 standard deviations) on the curve.

Figure 1. Normal Distribution Curve p= .05

By establishing alpha at .05, the researcher is setting a standard that rules out

the probability that 95% of randomly occurring results (those occurring within the

shaded area) are due to sampling error or chance. Quantitative researchers use

inferential statistical tests to arrive at a p-value.

• Inferential statistics use probability theory to make inferences from data

about the likelihood that results occurred by chance and whether the

results can be generalized from the subjects in the sample to all of the

subjects in the population from which the sample was drawn.

The inferential tests use the means and standard deviations on measured

outcomes to calculate the statistical significance in experiments are the t-test,

ANOVA / MANOVA, and ANCOVA / MANCOVA.

t-tests

t-tests compare outcomes of two groups or two variables. There are two

types of t-tests: 1) t-tests for independent means that analyze the difference in

means between two groups and 2) t-test for correlated means that analyze

differences in means for the same group before and after a treatment. The t-test

yields a t-value that leads to a p value. For example, in a post-test only/ true

experiment with randomly assigned treatment and control groups, the researcher

set alpha at p ≤ .05. He uses a t-test to compare math scores of students who

were instructed with an innovative method to those who had traditional instruction

that yielded t- 2.29. He then derives a p-value from the value to see if the alpha

has been achieved. The summary data display will look like this:

N=60 Treatment Control T

Mean

SD

2.89

4.0

2.54

4.1

2.29*

df = 58 * p =.05

The number of subjects in the study is designated by “n”, n=60. The (*) that

appears next to the t-value directs us to the bottom of the chart where we see df

(58); this stands for degrees of freedom (the number of subjects number of

groups: 60-2) and is used to find p =.05 on a Table of Critical t-Values table of t-

values. A segment of that table for degrees of freedom from 40 to 75 is

presented below.

df Probability value (p)

.10 .05 .02 .01 .005 .002 .001

40 1.68 2.02 2.42 2.70 2.97 3.31 3.55

45 1.68 2.01 2.41 2.69 2.95 3.28 3.52

50 1.68 2.01 2.40 2.68 2.94 3.26 3.50

55 1.67 2.00 2.40 2.67 2.93 3.25 3.48

60 1.67 2.00 2.39 2.66 2.91 3.23 3.46

65 1.67 2.00 2.39 2.65 2.91 3.22 3.45

70 1.67 1.99 2.38 2.65 2.90 3.21 3.44

75 1.67 1.99 2.38 2.64 2.89 3.20 3.43 Table 1. Critical t-values (df = 40 - 75)

In our example df=58. When we look down the table we look at df=60, which is

close enough. When we scan across the table for a p= .05, the t-value listed is

2.00. Any t-value above that would signify statistical significance. Since t=2.29 in

our example, that indicates that the results meet the criteria for ≤ .05 and are

significant.

ANOVA / MANOVA

ANOVA (Analysis of Variance) and MANOVA (Multivariate Analysis of

Variance) also compare outcomes between groups; ANOVA is more robust than

t-tests in that it can compare outcomes on two or more groups. ANOVA is used

to analyze differences between two or more variables and differences within

groups. MANOVA (Multivariate Analysis of Variance) is used when testing more

than two groups when there are more than two dependent variables. Both the

ANOVA and MANOVA yield an F-value that leads to a p-value.

For example a researcher uses ANOVA to determine whether there are

differences in results from three approaches to reinforcing skills in math. The

sample is selected from two fourth grade classrooms in a school; students are

randomly assigned to three groups after receiving direct instruction. Group # 1

works in cooperative leaning groups; group #2 receives instruction from an

interactive computer program; and group #3 uses concept mapping strategies

Group #1 Cooperative

(n = 15)

Group #2 Computer

(n = 14)

Group #3 Concept Maps

(n = 15)

Mean Scores

SD

80

4

80.5

4.1

86 *

6

*F (7.14); p =. 002

The ANOVA generated a critical F-value of 7.14, which resulted in p = .002,

which is below the .05 alpha. The results are statistically significant for the

concept-mapping group. The conversion from the F to p is based a Table of

Critical F-values that, like the table of t-values, lists degrees of freedom and p-

values.

ANCOVA / MANCOVA

ANCOVA (Analysis of Covariance) and MANCOVA (Multiple Analysis of

Co Variance) are predominantly used in quasi-experimental studies, when there

is a likelihood that the groups are non-equivalent at the start; they are also used

in other experimental designs in order to ensure the equivalence of the groups.

These tests provide more certainly that the outcomes of a quasi experiment are

not being affected by the differences, or variances, that exist in the subjects

before the experiment begins. An ANCOVA or MANCOVA is performed at the

end of the experiment to level the playing field; it is like a golf handicap that

accommodates differences in competitors before the round begins. ANCOVA

and MANCOVA yield a F-value that leads to a p-value.

For example, in a quasi-experimental study that uses two intact

classrooms to investigates the effect of an innovative approach to language arts

on the two variables of writing and reading scores, the researcher uses

MANCOVA to “level the playing field” between the two non-equivalent groups.

The display of data will look like this:

Treatment Control Pre Test Scores Writing Reading

M SD 30.87 12.85 25.17 12.71

M SD 23.35 13.84 19.71 11.02

Post Test Scores Writing Reading

M SD 36.32 12.89 36.89 13.79

M SD 28.61 12.16 26.42 10.05

*F (11); p= .0018

** F (20); p= .0001

The MANCOVA generated critical F-values of 11 and 20, which led to p= .0018

and p= .0001. These values are below the .05 alpha; the results are statistically

significant for both reading and writing. These F-values tell you that there are

statistically significant differences they do not specify exactly which groups are

different; researchers follow up with t-tests to make these comparisons.

One- and Two-Tailed Tests

Researchers chose between using a one or two tailed inferential test.

A one tailed test uses one end, or tail, of the curve to generate a p ≤ .05; a two-

tailed test uses two ends, or tails, with p ≤.025 on each end of the curve of the

curve, to generate p≤ .05. The illustration below shows these probabilities.

Figure 2. One tailed tests at p=.05

Figure 3. Two-tailed test

The most objective test is the two-tailed test. It represents a higher standard and

degree of difficulty for results because the probability must be distributed to both

ends of the curve. A one-tailed test is more lenient because all of the probability

for the hypothesis test is on one side of the distribution curve. In the final analysis

what matters is the probability (p) value, and moreover, a theory that provides a

rationale for making a well-founded prediction.

Null Hypothesis Significance Testing: Hypothesis Testing

Hypothesis testing is the next step in process of null hypothesis

significance testing (NHST) and uses statistical significance to confirm or

disconfirm a hypothesis. As a reminder, here are the definitions that are useful in

this discussion.

• The hypothesis operationalizes the theory in terms of the independent

and dependent variables. A hypothesis makes a prediction about the

effect of the independent variable on the dependent variable and can be

stated as either a directional or non-directional hypothesis.

• The directional hypothesis predicts that the treatment will result in a

change on and that the change will be a positive result of the

experiment.

• The non-directional hypothesis predicts that a treatment will result in a

change in outcomes, but does not predict the direction of the change:

whether it will be positive or negative.

In hypothesis testing, the researcher uses a construct known as the null

hypothesis.

• The null hypothesis predicts there will be no significant difference

between the experimental group(s) and control group(s) as a result of

the treatment.

To confirm that a hypothesis is true, the research has to demonstrate that the null

is false. This is where significance testing comes into play. The researcher uses

the p-value derived from significance testing as the probability level that the null

is true. If p ≤ .05 there is little probability the null is true; therefore the researcher

can reject the null hypothesis and confirm that the hypothesis is true. In effect,

the null hypothesis is set up as a “straw man;” it is easier to prove something

false than it is to prove something true. Below is a graphic that summarizes the

multi-layered process of data analysis through NHST.

Summary of Null Hypothesis Significance Testing

Figure 3. Summary of Null Hypothesis Significance Testing

Type I and Type II Errors

Hypothesis testing is often used in decision-making, and researchers have

to guard against making inferential errors.

• Type-I error means that the researcher erroneously rejects the null

hypothesis. This is a false positive that leads to the incorrect

conclusion that the hypothesis has been confirmed.

• Type-II error occurs researcher erroneously accepts the null

hypothesis. This is a false positive that leads to the incorrect

conclusion that the hypothesis has not been confirmed.

Figure 4. Type I and Type II Errors; Decision-making about the null

hypothesis

A Type I error is called the alpha error because it usually occurs when the alpha

may have set too high (p=. 05), leaving too much room for error. The researcher

can avoid this error by setting a lower alpha (p= .025 or p ≤.01). The Type II Error

is called the Beta error. To avoid this error, the researcher has to follow a set of

statistical procedures before the study begins. This is called power analysis and is

described in the section below.

Effect Size Estimates and Power Analysis

NHST is not without its critics. As Thompson stated, "statistical

significance is not sufficiently useful to be invoked as the sole criterion for

evaluating the noteworthiness" of research (2002a, p. 66). Statistical significance

does tell whether differences exist, but it does not tell the size, or magnitude, of

those differences; nor does it provide insight into the ultimate meaningfulness of

the research. And it may obscure Type I and Type II Errors. Researchers use

effect size estimates and power analysis to address those concerns.

Effect Size Estimates

Effect size estimates convey "the magnitude of an effect or the strength of

a relationship" (APA, 2001, p. 25); they are meaningful estimations of impact.

Effect Size Estimates provide information that statistical significance does not;

they provide evidence about what is termed practical significance.

• Practical significance answers the question: Are the differences big

enough to have real meaning and to guide decision-making in

practical situations?

Expressed as ‘ES’ or ‘d,’ effect size is an essential statistical calculation in social

research. The fifth edition of the APA Publication Manual (2001) states, “The

general principle to be followed … is to provide the reader not only with

information about statistical significance but also with enough information to

assess the magnitude of the observed effect or relationship” (pp. 25-26).

There are several ways to calculate Effect Size. The formula commonly

used in educational research is called the Glass Δ (delta). This formula simply

calculates the difference between the means of the treatment and control groups

and divides the answer by the standard deviation of the control group (Glass,

1976; Cohen, 1968).

Glass Δ = Mean (experiment) – Mean (control)

Standard deviation (control)

The Glass Δ can be translated into units of standard deviation gain: the greater

the effect size, the greater the gain in SD units. By way of explanation: an effect

size of 1.0 is equivalent to a one standard deviation change in outcomes on a

normal distribution curve. An effect size of 0.5 is equivalent to ½ standard

deviation change; this would mean that the average pupil experiencing a

treatment would improve by almost one half ( ½ ) a standard deviation on a

standardized measure. For example a student would move from the 50th

percentile to the 67th percentile on the math SAT, from a score of around 520 to a

score of 570, as the result of a test preparation program.

Cohen developed the following guidelines for interpreting effect size. ES=0.5

is considered adequate for establishing difference of sufficient magnitude

< 0.1 = trivial effect (1/10 SD gain) 0.1 - 0.3 = small effect (up to ⅓ SD gain) 0.3 - 0.5 = moderate effect (up to ⅕ SD gain) > 0.5 = large effect (above ½ SD gain)

Just as researchers usually have as their goal to achieve an alpha of p ≤ .05,

they usually strive for ES = 0.5. Effect Size is not only used to determine practical

significance; it is also used in calculating statistical power.

Statistical Power and Power Analysis

Statistical Power tells the likelihood that the researcher will avoid making a

Type II error; accepting as true a null hypothesis that is false. Power is expressed

as a statistic of probability; the higher the power, the greater the probability of

avoiding error. A power analysis allows a researcher to calculate the sample size

that will avoid a Type II error, given the desired alpha and effect size. A power of

0.80 is considered the acceptable threshold for avoiding error. Poor power

cannot be corrected after an experiment is completed; to have use, it must occur

before the experiment begins.

1. Before the experiment begins, the research sets the desired alpha at .05

and the desired effect size at 0.50.

2. The researcher conducts a statistical power analysis to determine the

sample size necessary to reach power= 0.80.

3. The power analysis determines that a sample size of 65 subjects for each

experimental group would be adequate to reach a 0.80 power.

4. If the researcher cannot have access to a sample of that size, she can

increase power by lowering the alpha to .025, by raising the ES to 0.80, or

by doing both.

The concept of power is crucial to the conduct of responsible and sound research.

An understanding of statistical power enables educators as consumers of

research to ask the right questions about what “the research says” and to make

informed judgments about effective interventions they might use in their own

practice.

II. Validity of True and Quasi Experiments

In evaluating the quality of an experiment, a reader has to consider the

overall validity of the study

• Validity is the “approximate truth of propositions, inferences, or

conclusions” (Trochim).

There are three types of validity that taken together lead to the overall validity of

an experiment.

• Conclusion validity answers the question: Is there a relationship

between the independent and dependent variables?

• Internal validity answers the question: Is the relationship causal?

• External validity answers the question: Can we generalize findings to

other people and settings?

There are barriers to achieving each validity component that researchers have to

be mindful of before they can make a judgment about the quality of the validity

profile of the study.

• Threats to validity are factors that interfere with a study and

compromise our confidence about its results.

Conclusion Validity

Conclusion Validity is the degree that it is reasonable to conclude that

there is a relationship between variables. Threats to conclusion validity

compromise our confidence that a relationship does exist between variables.

These threats include the following.

• Low reliability of measures: reliability < 0.7

• Low statistical power: inadequate sample size, alpha set too low

• ‘Fishing’: analyzing and re-analyzing data; making multiple

comparisons with the aim of finding significant results

• Mismatch of statistics to sample characteristics

Internal Validity

Internal validity concerns the level of control over causation that the

researcher has in an experiment. It is the degree to which results are due to a

causal relationship between variables and that the effect of the IV on the DV is

not due to any variables (extraneous variables) other than the independent

variable. An experiment that has strong internal validity can unambiguously

attribute the effects on the dependent variable to the treatment of the

independent variable

Threats to internal validity compromise our confidence in saying that a

causal relationship exists between the independent and dependent variables The

most common threats to internal validity are the following.

• History: An unanticipated event occurs while the experiment is in

progress

• Maturation: Normal developmental processes that affect subjects

differently over time

• Statistical Regression: Subjects at the extremes regress to the mean

during post-tests

• Selection: The groups are not equivalent at the beginning of the study

• Mortality: Subjects drop out of the study

• Testing: The pre-test sensitizes subjects to the post-test

• Instrumentation: Changes in the way the dependent variable is measured

• Compensatory Rivalry (“The John Henry Effect”): Social competition

motivates a group to over-perform and mask treatment effects

External Validity

External validity (or generalizability) is the degree to which the findings

can be applied to other people, settings, and times. External validity can be

approached in two ways: as population validity or as ecological validity.

• Population validity is the degree to which the results of an experiment

can be generalized to individuals who were not in the study.

• Ecological validity is the extent to which the results of an experiment

can be generalized to different settings.

Threats to external validity compromise confidence in stating whether the study’s

results are applicable to other people and settings. These threats include the

following:

• For population validity: not having a representative sample, randomly

selected from a target population.

• For ecological validity:

o The Hawthorne effect (also called reactivity): Outcomes are due to

the reaction of subjects to being studied and are not due to the

treatment.

o Experimenter effect: Outcomes are due to the characteristics of the

person conducting the study.

o Insufficient description of the conditions of the experiments: setting,

treatment, and measurement

Evaluating Validity

In evaluating the various validities of a between group experiments and

quasi-experiments is a complex undertaking. We recommend research

consumers form a judgment by a using a framework that considers three criteria:

1) theory and treatment, 2) sampling, and 3) measurement.

Theory and Treatment:

• Fidelity to a theory that is supported by a well-referenced research review

and is not subject to modification (through “fishing”) enhances confidence

about conclusion validity.

• A solid theory that is supported by a well-referenced research review and

leads to a hypothesis that clearly states a causal relationship between

variables enhances confidence about internal and external validity

• A theory-based treatment that is clearly described and strictly

implemented enhances confidence in internal and external validity.

Sampling:

• Adequate sample size and a good match between the sample and

statistical analysis enhances confidence about conclusion validity

• A detailed description of people in the sample (how many and subject

characteristics) enhances confidence about internal and external

validity.

• A detailed description of the setting enhances confidence about

internal and ecological external validity

• Control over sampling threats enhances confidence about internal and

external validity.

• Random assignment of the sample to the treatment condition

enhances confidence of internal and external validity

• Random selection of the sample from a population enhances

confidence enhances population external validity.

Measurement:

• Reliable measures (r = < 0.7) enhance confidence in conclusion

internal, and external validity.

• Valid measures enhance confidence in internal and external validity.

• Consistent measures enhance confidence in internal and external

validity

The table below summarizes these criteria and may serve as a template for

evaluating validity.

Criterion Conclusion Validity

Internal Validity External Validity (Population)

External Validity (Ecological)

Theory and Treatment

Clear statement of hypothesis that predicts how IV will affect DV No evidence of “fishing”

Well referenced research review leading to a solid causal theory or framework

Clear statement of hypothesis that predicts how IV will affect DV Evidence of history threat or testing threat?

Well referenced research review leading to a solid causal theory or framework Clear statement of hypothesis that predicts how IV will affect DV

Well referenced research review leading to a solid causal theory or framework Clear statement of hypothesis that predicts how IV will affect DV Detailed description of the treatment and conditions of the study, including history threat Evidence of Hawthorn Effect Experimenter Effect, or insufficient description?

Sampling

Adequate sample size determined by power analysis and match of sample to statistics

Adequate sample size determined by power analysis to avoid Type I Error. Detailed description of sample (size and subject characteristics) and setting. Random assignment to treatment condition Evidence of maturation, selection, regression, or compensatory rivalry threats?

Detailed description of sample (size and subject characteristics) Random selection from a population and random assignment to treatment

Detailed description of setting Random assignment to treatment Evidence of Hawthorn Effect, Experimenter Effect or insufficient description of the setting?

Measurement

Reliability of measures ≥ .7

Use of reliable, valid, and consistent measures for DV Evidence of instrumentation threat?

Use of reliable, valid, and consistent measures for DV

Use of reliable, valid , and consistent measures for DV

Rating H M L H M L H M L H M L

Summary of Criteria for Evaluating Studies

Summary

• There are three complementary approaches to statistical analysis in

experimental and quasi experimental research: Null Hypothesis

Significance Testing (NHST), Effect Size, and Power Analysis

• Inferential statistics build on descriptive statistics (means and standard

deviations) to determine the likelihood that results occurred by chance and

whether the results can be generalized.

• Inferential tests that are used in experiments and quasi-experiments are t-

tests, ANOVA/MANOVA, and ANCOVE/MANCOVA

• Inferential tests yield values that are converted to probabilities of results

having occurred by chance.

• Experiments and quasi-experiments are evaluated for internal validity,

statistical conclusion validity and external validity (defined as ecological

validity)

• There are several threats to internal validity that the researcher seeks to

control.

• Theory/treatment, sampling, and measurement are key elements in

evaluating validity of experiments and quasi-experiments.

Terms and Concepts

NHST statistical significance

Effect size estimates power analysis

p-value/probability alpha

t-test ANOVA/MANOVA

ANCOVA/MANCOVA table of critical values

degree of freedom one and two tailed tests

null hypothesis Type I and Type II Errors

internal validity threats to internal validity

statistical conclusion validity fishing

external validity

population validity ecological validity

Review, Consolidation, and Extension of Knowledge

1. In your own words, explain null hypothesis significance testing.

2. In your own explain Effect Size Estimates and Power Analysis.

3. In your own words explain the difference between statistical and

practical significance.

4 Read the data analysis/results sections section of the experimental study

you chose in the previous chapter and complete the Guide below.

5. Use the Guide as a template for writing a critique about 1,000 words of

the experimental or quasi-experimental between group study you selected.

See the Appendix for an exemplar.

__________________________________________________________

Guide to Reading and Critiquing an Experimental

and Quasi- Experimental Group Study

Research Review and Theory:

What is the purpose of the research review?

Does it establish an underlying theory (big ideas) for the research?

Purpose and Design:

What is the purpose of the study?

Is there a hypothesis or a research question? If so, what is it? If not, can

you infer the question from the text of the article?

What is the basic research design and type?

What are the dependent and independent variables? Identify each type of

variable in the study. ( IV=, DV=)

Sampling:

How is the sample selected?

How is the sample assigned to the treatment condition(s): random or non-

random/intact group?

Who is in the sample? What are the characteristics of the sample?

What is the sample size?

Data Collection:

What measures are used for the dependent variable?

Are these standardized measures? Adapted measures? Newly-created

measures?

What is the format of the measure (s)

Are there indications of validity and reliability of the measures? What are

they ( r-values)?

Data Analysis and Results:

What statistical tests are used to analyze the data?

Were the results (p-values) significant or non-significant?

What does the researcher conclude about the findings?

Evaluation of Validity:

How do you evaluate internal validity? What are threats to internal

validity?

How do you evaluate statistical conclusion validity? What is your

rationale?

How do you evaluate external population validity? Ecological

validity?What is you rationale?

__________________________________________________________

new chapter nine - media.usm.maine.edumedia.usm.maine.edu/~jbeaudry/research literacy/ch 9.pdf ·...

Documents