new chapter nine - media.usm.maine.edumedia.usm.maine.edu/~jbeaudry/research literacy/ch 9.pdf ·...
TRANSCRIPT
CHAPTER NINE
DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN
GROUP EXPERIMENTS
Chapter Objectives:
• Understand Null Hypothesis Significance Testing (NHST)
• Understand statistical significance and probabilities (p-values)
• Understand Type I and Type II Errors in hypothesis testing
• Understand t-tests, ANOVA / MANOVA, and ANCOVA / MANCOVA
and how they are used in analyzing experiments and quasi-experiments.
• Understand Effect Size Estimates
• Understand the difference between statistical and practical significance
• Understand Power Analysis
• Understand how validity (quality) of experiments and quasi-experiments is
evaluated
• Understand internal validity and threats
• Understand statistical conclusion validity
I. Data Analysis
Data analysis in experimental and quasi experiment research is a multi-
layered process that begins with collecting data on the dependent variable after
the experiment has been implemented and ends with making inferences about
the statistical soundness and the meaning of the results. There are three
complementary approaches to statistical analysis.
Null Hypothesis Significance Testing (NHST) is the term applied to the set
of procedures that (1) establishes statistical significance of results and (2)
confirms or disconfirms an experiment’s hypothesis.
Effect Size refers to the magnitude of differences on the dependent
variable after a treatment
Power Analysis refers to the probability of avoiding errors in conclusions
drawn from NHST.
Null Hypothesis Significance Testing: Significance Testing
• Statistical significance is a low probability that results are due to
sampling error or chance and a high probability that results are due to
the treatment.
Significance testing is the first step in NHST. It is important to note that
‘significance’ as used here does not mean importance. It merely reports on the
probability that outcomes are not due to error, and thus affirms that the outcomes
are real and are due to a treatment. Significance is based on probability.
• p represents the level of probability that the differences are due to
various sources of error (e.g., sampling). In most cases the lower the p-
value, the better.
• Alpha is the p-value that the researcher sets at the beginning of the
study as an acceptable level of probability that differences are due to
sampling error.
• p ≤ .05 is the conventional alpha used in educational research. This
means there is less than a 5% probability the differences are due to
sampling error and a high probability the differences are due to the
treatment.
Probabilities are derived from the normal distribution curve. Remember from
Chapter Six that 95% of values will randomly occur within the first two standard
deviations (plus or minus 2 standard deviations) on the curve.
Figure 1. Normal Distribution Curve p= .05
By establishing alpha at .05, the researcher is setting a standard that rules out
the probability that 95% of randomly occurring results (those occurring within the
shaded area) are due to sampling error or chance. Quantitative researchers use
inferential statistical tests to arrive at a p-value.
• Inferential statistics use probability theory to make inferences from data
about the likelihood that results occurred by chance and whether the
results can be generalized from the subjects in the sample to all of the
subjects in the population from which the sample was drawn.
The inferential tests use the means and standard deviations on measured
outcomes to calculate the statistical significance in experiments are the t-test,
ANOVA / MANOVA, and ANCOVA / MANCOVA.
t-tests
t-tests compare outcomes of two groups or two variables. There are two
types of t-tests: 1) t-tests for independent means that analyze the difference in
means between two groups and 2) t-test for correlated means that analyze
differences in means for the same group before and after a treatment. The t-test
yields a t-value that leads to a p value. For example, in a post-test only/ true
experiment with randomly assigned treatment and control groups, the researcher
set alpha at p ≤ .05. He uses a t-test to compare math scores of students who
were instructed with an innovative method to those who had traditional instruction
that yielded t- 2.29. He then derives a p-value from the value to see if the alpha
has been achieved. The summary data display will look like this:
N=60 Treatment Control T
Mean
SD
2.89
4.0
2.54
4.1
2.29*
df = 58 * p =.05
The number of subjects in the study is designated by “n”, n=60. The (*) that
appears next to the t-value directs us to the bottom of the chart where we see df
(58); this stands for degrees of freedom (the number of subjects number of
groups: 60-2) and is used to find p =.05 on a Table of Critical t-Values table of t-
values. A segment of that table for degrees of freedom from 40 to 75 is
presented below.
df Probability value (p)
.10 .05 .02 .01 .005 .002 .001
40 1.68 2.02 2.42 2.70 2.97 3.31 3.55
45 1.68 2.01 2.41 2.69 2.95 3.28 3.52
50 1.68 2.01 2.40 2.68 2.94 3.26 3.50
55 1.67 2.00 2.40 2.67 2.93 3.25 3.48
60 1.67 2.00 2.39 2.66 2.91 3.23 3.46
65 1.67 2.00 2.39 2.65 2.91 3.22 3.45
70 1.67 1.99 2.38 2.65 2.90 3.21 3.44
75 1.67 1.99 2.38 2.64 2.89 3.20 3.43 Table 1. Critical t-values (df = 40 - 75)
In our example df=58. When we look down the table we look at df=60, which is
close enough. When we scan across the table for a p= .05, the t-value listed is
2.00. Any t-value above that would signify statistical significance. Since t=2.29 in
our example, that indicates that the results meet the criteria for ≤ .05 and are
significant.
ANOVA / MANOVA
ANOVA (Analysis of Variance) and MANOVA (Multivariate Analysis of
Variance) also compare outcomes between groups; ANOVA is more robust than
t-tests in that it can compare outcomes on two or more groups. ANOVA is used
to analyze differences between two or more variables and differences within
groups. MANOVA (Multivariate Analysis of Variance) is used when testing more
than two groups when there are more than two dependent variables. Both the
ANOVA and MANOVA yield an F-value that leads to a p-value.
For example a researcher uses ANOVA to determine whether there are
differences in results from three approaches to reinforcing skills in math. The
sample is selected from two fourth grade classrooms in a school; students are
randomly assigned to three groups after receiving direct instruction. Group # 1
works in cooperative leaning groups; group #2 receives instruction from an
interactive computer program; and group #3 uses concept mapping strategies
Group #1 Cooperative
(n = 15)
Group #2 Computer
(n = 14)
Group #3 Concept Maps
(n = 15)
Mean Scores
SD
80
4
80.5
4.1
86 *
6
*F (7.14); p =. 002
The ANOVA generated a critical F-value of 7.14, which resulted in p = .002,
which is below the .05 alpha. The results are statistically significant for the
concept-mapping group. The conversion from the F to p is based a Table of
Critical F-values that, like the table of t-values, lists degrees of freedom and p-
values.
ANCOVA / MANCOVA
ANCOVA (Analysis of Covariance) and MANCOVA (Multiple Analysis of
Co Variance) are predominantly used in quasi-experimental studies, when there
is a likelihood that the groups are non-equivalent at the start; they are also used
in other experimental designs in order to ensure the equivalence of the groups.
These tests provide more certainly that the outcomes of a quasi experiment are
not being affected by the differences, or variances, that exist in the subjects
before the experiment begins. An ANCOVA or MANCOVA is performed at the
end of the experiment to level the playing field; it is like a golf handicap that
accommodates differences in competitors before the round begins. ANCOVA
and MANCOVA yield a F-value that leads to a p-value.
For example, in a quasi-experimental study that uses two intact
classrooms to investigates the effect of an innovative approach to language arts
on the two variables of writing and reading scores, the researcher uses
MANCOVA to “level the playing field” between the two non-equivalent groups.
The display of data will look like this:
Treatment Control Pre Test Scores Writing Reading
M SD 30.87 12.85 25.17 12.71
M SD 23.35 13.84 19.71 11.02
Post Test Scores Writing Reading
M SD 36.32 12.89 36.89 13.79
M SD 28.61 12.16 26.42 10.05
*F (11); p= .0018
** F (20); p= .0001
The MANCOVA generated critical F-values of 11 and 20, which led to p= .0018
and p= .0001. These values are below the .05 alpha; the results are statistically
significant for both reading and writing. These F-values tell you that there are
statistically significant differences they do not specify exactly which groups are
different; researchers follow up with t-tests to make these comparisons.
One- and Two-Tailed Tests
Researchers chose between using a one or two tailed inferential test.
A one tailed test uses one end, or tail, of the curve to generate a p ≤ .05; a two-
tailed test uses two ends, or tails, with p ≤.025 on each end of the curve of the
curve, to generate p≤ .05. The illustration below shows these probabilities.
Figure 2. One tailed tests at p=.05
Figure 3. Two-tailed test
The most objective test is the two-tailed test. It represents a higher standard and
degree of difficulty for results because the probability must be distributed to both
ends of the curve. A one-tailed test is more lenient because all of the probability
for the hypothesis test is on one side of the distribution curve. In the final analysis
what matters is the probability (p) value, and moreover, a theory that provides a
rationale for making a well-founded prediction.
Null Hypothesis Significance Testing: Hypothesis Testing
Hypothesis testing is the next step in process of null hypothesis
significance testing (NHST) and uses statistical significance to confirm or
disconfirm a hypothesis. As a reminder, here are the definitions that are useful in
this discussion.
• The hypothesis operationalizes the theory in terms of the independent
and dependent variables. A hypothesis makes a prediction about the
effect of the independent variable on the dependent variable and can be
stated as either a directional or non-directional hypothesis.
• The directional hypothesis predicts that the treatment will result in a
change on and that the change will be a positive result of the
experiment.
• The non-directional hypothesis predicts that a treatment will result in a
change in outcomes, but does not predict the direction of the change:
whether it will be positive or negative.
In hypothesis testing, the researcher uses a construct known as the null
hypothesis.
• The null hypothesis predicts there will be no significant difference
between the experimental group(s) and control group(s) as a result of
the treatment.
To confirm that a hypothesis is true, the research has to demonstrate that the null
is false. This is where significance testing comes into play. The researcher uses
the p-value derived from significance testing as the probability level that the null
is true. If p ≤ .05 there is little probability the null is true; therefore the researcher
can reject the null hypothesis and confirm that the hypothesis is true. In effect,
the null hypothesis is set up as a “straw man;” it is easier to prove something
false than it is to prove something true. Below is a graphic that summarizes the
multi-layered process of data analysis through NHST.
Summary of Null Hypothesis Significance Testing
Figure 3. Summary of Null Hypothesis Significance Testing
Type I and Type II Errors
Hypothesis testing is often used in decision-making, and researchers have
to guard against making inferential errors.
• Type-I error means that the researcher erroneously rejects the null
hypothesis. This is a false positive that leads to the incorrect
conclusion that the hypothesis has been confirmed.
• Type-II error occurs researcher erroneously accepts the null
hypothesis. This is a false positive that leads to the incorrect
conclusion that the hypothesis has not been confirmed.
Figure 4. Type I and Type II Errors; Decision-making about the null
hypothesis
A Type I error is called the alpha error because it usually occurs when the alpha
may have set too high (p=. 05), leaving too much room for error. The researcher
can avoid this error by setting a lower alpha (p= .025 or p ≤.01). The Type II Error
is called the Beta error. To avoid this error, the researcher has to follow a set of
statistical procedures before the study begins. This is called power analysis and is
described in the section below.
Effect Size Estimates and Power Analysis
NHST is not without its critics. As Thompson stated, "statistical
significance is not sufficiently useful to be invoked as the sole criterion for
evaluating the noteworthiness" of research (2002a, p. 66). Statistical significance
does tell whether differences exist, but it does not tell the size, or magnitude, of
those differences; nor does it provide insight into the ultimate meaningfulness of
the research. And it may obscure Type I and Type II Errors. Researchers use
effect size estimates and power analysis to address those concerns.
Effect Size Estimates
Effect size estimates convey "the magnitude of an effect or the strength of
a relationship" (APA, 2001, p. 25); they are meaningful estimations of impact.
Effect Size Estimates provide information that statistical significance does not;
they provide evidence about what is termed practical significance.
• Practical significance answers the question: Are the differences big
enough to have real meaning and to guide decision-making in
practical situations?
Expressed as ‘ES’ or ‘d,’ effect size is an essential statistical calculation in social
research. The fifth edition of the APA Publication Manual (2001) states, “The
general principle to be followed … is to provide the reader not only with
information about statistical significance but also with enough information to
assess the magnitude of the observed effect or relationship” (pp. 25-26).
There are several ways to calculate Effect Size. The formula commonly
used in educational research is called the Glass Δ (delta). This formula simply
calculates the difference between the means of the treatment and control groups
and divides the answer by the standard deviation of the control group (Glass,
1976; Cohen, 1968).
Glass Δ = Mean (experiment) – Mean (control)
Standard deviation (control)
The Glass Δ can be translated into units of standard deviation gain: the greater
the effect size, the greater the gain in SD units. By way of explanation: an effect
size of 1.0 is equivalent to a one standard deviation change in outcomes on a
normal distribution curve. An effect size of 0.5 is equivalent to ½ standard
deviation change; this would mean that the average pupil experiencing a
treatment would improve by almost one half ( ½ ) a standard deviation on a
standardized measure. For example a student would move from the 50th
percentile to the 67th percentile on the math SAT, from a score of around 520 to a
score of 570, as the result of a test preparation program.
Cohen developed the following guidelines for interpreting effect size. ES=0.5
is considered adequate for establishing difference of sufficient magnitude
< 0.1 = trivial effect (1/10 SD gain) 0.1 - 0.3 = small effect (up to ⅓ SD gain) 0.3 - 0.5 = moderate effect (up to ⅕ SD gain) > 0.5 = large effect (above ½ SD gain)
Just as researchers usually have as their goal to achieve an alpha of p ≤ .05,
they usually strive for ES = 0.5. Effect Size is not only used to determine practical
significance; it is also used in calculating statistical power.
Statistical Power and Power Analysis
Statistical Power tells the likelihood that the researcher will avoid making a
Type II error; accepting as true a null hypothesis that is false. Power is expressed
as a statistic of probability; the higher the power, the greater the probability of
avoiding error. A power analysis allows a researcher to calculate the sample size
that will avoid a Type II error, given the desired alpha and effect size. A power of
0.80 is considered the acceptable threshold for avoiding error. Poor power
cannot be corrected after an experiment is completed; to have use, it must occur
before the experiment begins.
1. Before the experiment begins, the research sets the desired alpha at .05
and the desired effect size at 0.50.
2. The researcher conducts a statistical power analysis to determine the
sample size necessary to reach power= 0.80.
3. The power analysis determines that a sample size of 65 subjects for each
experimental group would be adequate to reach a 0.80 power.
4. If the researcher cannot have access to a sample of that size, she can
increase power by lowering the alpha to .025, by raising the ES to 0.80, or
by doing both.
The concept of power is crucial to the conduct of responsible and sound research.
An understanding of statistical power enables educators as consumers of
research to ask the right questions about what “the research says” and to make
informed judgments about effective interventions they might use in their own
practice.
II. Validity of True and Quasi Experiments
In evaluating the quality of an experiment, a reader has to consider the
overall validity of the study
• Validity is the “approximate truth of propositions, inferences, or
conclusions” (Trochim).
There are three types of validity that taken together lead to the overall validity of
an experiment.
• Conclusion validity answers the question: Is there a relationship
between the independent and dependent variables?
• Internal validity answers the question: Is the relationship causal?
• External validity answers the question: Can we generalize findings to
other people and settings?
There are barriers to achieving each validity component that researchers have to
be mindful of before they can make a judgment about the quality of the validity
profile of the study.
• Threats to validity are factors that interfere with a study and
compromise our confidence about its results.
Conclusion Validity
Conclusion Validity is the degree that it is reasonable to conclude that
there is a relationship between variables. Threats to conclusion validity
compromise our confidence that a relationship does exist between variables.
These threats include the following.
• Low reliability of measures: reliability < 0.7
• Low statistical power: inadequate sample size, alpha set too low
• ‘Fishing’: analyzing and re-analyzing data; making multiple
comparisons with the aim of finding significant results
• Mismatch of statistics to sample characteristics
Internal Validity
Internal validity concerns the level of control over causation that the
researcher has in an experiment. It is the degree to which results are due to a
causal relationship between variables and that the effect of the IV on the DV is
not due to any variables (extraneous variables) other than the independent
variable. An experiment that has strong internal validity can unambiguously
attribute the effects on the dependent variable to the treatment of the
independent variable
Threats to internal validity compromise our confidence in saying that a
causal relationship exists between the independent and dependent variables The
most common threats to internal validity are the following.
• History: An unanticipated event occurs while the experiment is in
progress
• Maturation: Normal developmental processes that affect subjects
differently over time
• Statistical Regression: Subjects at the extremes regress to the mean
during post-tests
• Selection: The groups are not equivalent at the beginning of the study
• Mortality: Subjects drop out of the study
• Testing: The pre-test sensitizes subjects to the post-test
• Instrumentation: Changes in the way the dependent variable is measured
• Compensatory Rivalry (“The John Henry Effect”): Social competition
motivates a group to over-perform and mask treatment effects
External Validity
External validity (or generalizability) is the degree to which the findings
can be applied to other people, settings, and times. External validity can be
approached in two ways: as population validity or as ecological validity.
• Population validity is the degree to which the results of an experiment
can be generalized to individuals who were not in the study.
• Ecological validity is the extent to which the results of an experiment
can be generalized to different settings.
Threats to external validity compromise confidence in stating whether the study’s
results are applicable to other people and settings. These threats include the
following:
• For population validity: not having a representative sample, randomly
selected from a target population.
• For ecological validity:
o The Hawthorne effect (also called reactivity): Outcomes are due to
the reaction of subjects to being studied and are not due to the
treatment.
o Experimenter effect: Outcomes are due to the characteristics of the
person conducting the study.
o Insufficient description of the conditions of the experiments: setting,
treatment, and measurement
Evaluating Validity
In evaluating the various validities of a between group experiments and
quasi-experiments is a complex undertaking. We recommend research
consumers form a judgment by a using a framework that considers three criteria:
1) theory and treatment, 2) sampling, and 3) measurement.
Theory and Treatment:
• Fidelity to a theory that is supported by a well-referenced research review
and is not subject to modification (through “fishing”) enhances confidence
about conclusion validity.
• A solid theory that is supported by a well-referenced research review and
leads to a hypothesis that clearly states a causal relationship between
variables enhances confidence about internal and external validity
• A theory-based treatment that is clearly described and strictly
implemented enhances confidence in internal and external validity.
Sampling:
• Adequate sample size and a good match between the sample and
statistical analysis enhances confidence about conclusion validity
• A detailed description of people in the sample (how many and subject
characteristics) enhances confidence about internal and external
validity.
• A detailed description of the setting enhances confidence about
internal and ecological external validity
• Control over sampling threats enhances confidence about internal and
external validity.
• Random assignment of the sample to the treatment condition
enhances confidence of internal and external validity
• Random selection of the sample from a population enhances
confidence enhances population external validity.
Measurement:
• Reliable measures (r = < 0.7) enhance confidence in conclusion
internal, and external validity.
• Valid measures enhance confidence in internal and external validity.
• Consistent measures enhance confidence in internal and external
validity
The table below summarizes these criteria and may serve as a template for
evaluating validity.
Criterion Conclusion Validity
Internal Validity External Validity (Population)
External Validity (Ecological)
Theory and Treatment
Clear statement of hypothesis that predicts how IV will affect DV No evidence of “fishing”
Well referenced research review leading to a solid causal theory or framework
Clear statement of hypothesis that predicts how IV will affect DV Evidence of history threat or testing threat?
Well referenced research review leading to a solid causal theory or framework Clear statement of hypothesis that predicts how IV will affect DV
Well referenced research review leading to a solid causal theory or framework Clear statement of hypothesis that predicts how IV will affect DV Detailed description of the treatment and conditions of the study, including history threat Evidence of Hawthorn Effect Experimenter Effect, or insufficient description?
Sampling
Adequate sample size determined by power analysis and match of sample to statistics
Adequate sample size determined by power analysis to avoid Type I Error. Detailed description of sample (size and subject characteristics) and setting. Random assignment to treatment condition Evidence of maturation, selection, regression, or compensatory rivalry threats?
Detailed description of sample (size and subject characteristics) Random selection from a population and random assignment to treatment
Detailed description of setting Random assignment to treatment Evidence of Hawthorn Effect, Experimenter Effect or insufficient description of the setting?
Measurement
Reliability of measures ≥ .7
Use of reliable, valid, and consistent measures for DV Evidence of instrumentation threat?
Use of reliable, valid, and consistent measures for DV
Use of reliable, valid , and consistent measures for DV
Rating H M L H M L H M L H M L
Summary of Criteria for Evaluating Studies
Summary
• There are three complementary approaches to statistical analysis in
experimental and quasi experimental research: Null Hypothesis
Significance Testing (NHST), Effect Size, and Power Analysis
• Inferential statistics build on descriptive statistics (means and standard
deviations) to determine the likelihood that results occurred by chance and
whether the results can be generalized.
• Inferential tests that are used in experiments and quasi-experiments are t-
tests, ANOVA/MANOVA, and ANCOVE/MANCOVA
• Inferential tests yield values that are converted to probabilities of results
having occurred by chance.
• Experiments and quasi-experiments are evaluated for internal validity,
statistical conclusion validity and external validity (defined as ecological
validity)
• There are several threats to internal validity that the researcher seeks to
control.
• Theory/treatment, sampling, and measurement are key elements in
evaluating validity of experiments and quasi-experiments.
Terms and Concepts
NHST statistical significance
Effect size estimates power analysis
p-value/probability alpha
t-test ANOVA/MANOVA
ANCOVA/MANCOVA table of critical values
degree of freedom one and two tailed tests
null hypothesis Type I and Type II Errors
internal validity threats to internal validity
statistical conclusion validity fishing
external validity
population validity ecological validity
Review, Consolidation, and Extension of Knowledge
1. In your own words, explain null hypothesis significance testing.
2. In your own explain Effect Size Estimates and Power Analysis.
3. In your own words explain the difference between statistical and
practical significance.
4 Read the data analysis/results sections section of the experimental study
you chose in the previous chapter and complete the Guide below.
5. Use the Guide as a template for writing a critique about 1,000 words of
the experimental or quasi-experimental between group study you selected.
See the Appendix for an exemplar.
__________________________________________________________
Guide to Reading and Critiquing an Experimental
and Quasi- Experimental Group Study
Research Review and Theory:
What is the purpose of the research review?
Does it establish an underlying theory (big ideas) for the research?
Purpose and Design:
What is the purpose of the study?
Is there a hypothesis or a research question? If so, what is it? If not, can
you infer the question from the text of the article?
What is the basic research design and type?
What are the dependent and independent variables? Identify each type of
variable in the study. ( IV=, DV=)
Sampling:
How is the sample selected?
How is the sample assigned to the treatment condition(s): random or non-
random/intact group?
Who is in the sample? What are the characteristics of the sample?
What is the sample size?
Data Collection:
What measures are used for the dependent variable?
Are these standardized measures? Adapted measures? Newly-created
measures?
What is the format of the measure (s)
Are there indications of validity and reliability of the measures? What are
they ( r-values)?
Data Analysis and Results:
What statistical tests are used to analyze the data?
Were the results (p-values) significant or non-significant?
What does the researcher conclude about the findings?
Evaluation of Validity:
How do you evaluate internal validity? What are threats to internal
validity?
How do you evaluate statistical conclusion validity? What is your
rationale?
How do you evaluate external population validity? Ecological
validity?What is you rationale?
__________________________________________________________