Download - Power and Effect Size
POWER AND
EFFECT SIZE
Previous Weeks
A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information
from the past week
Most of this is a repeat – but a few new tests have been added Important that you are familiar with these tests,
know when they are appropriate to use, and how to run (most of) them in SPSS Excused from running ANCOVA, RM ANOVA
When to use specific statistical tests…
# of IV (format)
# of DV (format)
Examining…
Test/Notes
1 (continuou
s)
1 (continuous)
AssociationPearson
Correlation (r)
1(continuou
s)
1(continuous)
PredictionSimple Linear Regression (m
+ b)
Multiple1
(continuous)Prediction
Multiple Linear Regression (m
+ b)
# of IV (format)
# of DV (format)
Examining…
Test/Notes
1 (grouping, 2 levels)
1(continuous)
Group differences
When one group is a ‘known’ population =
One-Sample t-test
1 (grouping, 2 levels)
1(continuous)
Group differences
When both groups are
independent = Independent
Samples t-test
1 (grouping, 2 levels)
1(continuous)
Group differences
When both groups are
dependent = Paired Samples t-
test
1 (grouping, ∞ levels)
1(continuous)
Group differences
One-Way ANOVA, with Post-Hoc (F
ratio)
# of IV (format)
# of DV (format)
Examining…
Test/Notes
∞ (grouping, ∞ levels)
1(continuou
s)
Group Differences
and interactions
Factorial ANOVA with Post-Hoc
and/or Estimated Marginal Means
(F ratio)
∞ (grouping, ∞ levels)
1(continuou
s)
Group Differences, interactions,
controlling for confounders
ANCOVA with Estimated
Marginal Means (F ratio)
Analysis of Co-Variance
∞ (grouping, ∞ levels)
1(continuou
s)
Group Differences, interactions,
controlling for confounders in a related
sample
Repeated Measures ANOVA
with Estimated Marginal Means
(F ratio)(e.g.,
longitudinal)
Tonight… A break from learning a new statistical ‘test’
Focus will be on two critical statistical ‘concepts’ Statistical Power
Related to Alpha/Statistical Significance
Brief overview of Effect Size Statistically significant results vs Meaningful results
First, a quick review of error in testing…
Example Hypothesis Pretend my masters thesis topic is the influence of
exercise on body composition I believe people that exercise more, will have lower %BF To study this:
I draw a sample and group subjects by how much they exercise –High and Low Exercise Groups (this is my IV)
I also assess %BF in each subject as a continuous variable (DV)
I plan to see if the two groups have different mean %BF
My hypotheses (HO and HA): HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups
Example Continued
Now I’m going to run my statistical test, get my test statistic, and calculate a p-value I’ve set alpha at the standard 0.05 level By the way, what statistical test should I
use…?
My final decision on my hypotheses is going to be based on that p-value: I could reject the null hypothesis (accept HA) I could accept the null hypothesis (reject HA)
Statistical Errors…
Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make:
Type I Error We could reject the null hypothesis although
it was really true (should have accepted null) Type II Error
We could fail to reject the null hypothesis when it was really untrue (should have rejected null)
There are really 4 potential outcomes, based on what is “true” and what we “decide”
Our Decision
Reject HO Accept HO
What is True
HO Type I Error Correct
HA Correct Type II Error
HA: There is a difference in %BF between the groupsHO: There is not a difference in %BF between the groups
Statistical Errors…
Remember – My final decision is based on the p-
value
Our Decision
Reject HO Accept HO
What is True
HO Type I Error Correct
HA Correct Type II Error
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Statistical Errors…
In my analysis, I find: High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08
What is my decision? Accept HO
There is NOT a difference in %BF between the groups
Why is that my decision? The means ARE different? I can’t be confident that the 4% difference between
the two groups is not due to random sampling error
Is it possible I’ve made an error in my decision?
Possible Error…?
If I did make an error, what type would it be? Type II Error
When you find a p-value greater than alpha The only possible error is Type II error
When you find a p-value less than alpha The only possible error is Type I error
Our p = 0.08, we accepted HO
The only possible error is Type II
Our Decision
Reject HO Accept HO
What is True
HO Type I Error Correct
HA Correct Type II Error
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Possible Error…?
Compare Type I and Type II error like this: The only concern when you find statistical significance
(p < 0.05) is Type I Error Is the difference between groups REAL or due to Random
Sampling Error Thankfully, the p-value tells you exactly what the probability
of that random sampling error is In other words, the p-value tells you how likely Type I error is
But, does the p-value tell you how likely Type II error is? The probability of Type II error is better provided by
Power
Possible Error…? Probability of Type II error is provided by Power
Statistical Power, also known as β (actually 1 – β) We will not discuss the specific calculation of power in this
class SPSS can calculate this for you
Power (Beta) is related to Alpha, but: Alpha is the probability of having Type I error
Lower number is better (i.e., 0.05 vs 0.01 vs 0.001) Power is the probability of NOT having Type II error
The probability of being right (correctly rejecting the null hypothesis)
Higher number is better (typical goal is 0.80)
Let’s continue this in the context of my ‘thesis’ example
Statistical Errors… In my analysis, I found:
High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08 Decided to accept the null
What do I do when I don’t find statistical significance?
What happens when the result does not reflect expectations?
First, consider the situation
Should it be statistically significant?
The most obvious thing you need to consider is if you REALLY should have found a statistically significant result? Just because you wanted your test to be significant doesn’t
mean it should be This wouldn’t be Type II error – it would just be the correct
decision!
In my example, researchers have shown in several studies that exercise does influence %BF This result ‘should’ be statistically significant, right? If the answer is yes, then you need to consider power
In my ‘thesis’ This result ‘should’ be statistically significant, right? Probably an issue with Statistical Power
This scenario plays out at least once a year between myself and a grad student working on a thesis or research project How can I increase the chance that I will find statistically
significant results? Why was this analysis not statistically significant? What can I do to decrease the chance of Type II error?
Several different factors influence power Your ability to detect a true difference
How can I increase Power?
1) Increase Alpha level Changing alpha from 0.05 to 0.10 will increase
your power (better chance of finding significant results)
Downsides to increasing your alpha level? This will increase the chance of Type I error!
This is rarely acceptable in practice Only really an option when working in a new area:
Researchers are unsure of how to measure a new variable
Researchers are unaware of confounders to control for
How can I increase Power?
2) Increase N Sample size is directly used when calculating p-
values
Including more subjects will increase your chance of finding statistically significant results
Downsides to increasing sample size? More subjects means more time/money
More subjects is ALWAYS a better option if possible
How can I increase Power? 3) Use fewer groups/variables (simpler designs)
Related to sample size but different ‘Use fewer groups’ NOT ‘Use less subjects’
↑ groups negatively effects your degrees of freedom Remember, df is calculated with # groups and # subjects
Lots of variables, groups and interactions make it more difficult to find statistically significant differences The purpose of the Family-wise error rate is to make it
harder to find significant results! Downsides to fewer groups/variables?
Sometimes you NEED to make several comparisons and test for interactions - unavoidable
How can I increase Power? 4) Measure variables more accurately
If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error
More measurement error decreases confidence in the result
For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error.
More of an internal validity problem than statistical problem
Downsides to measuring more accurately? None – if you can afford the best tools
How can I increase Power? 5) Decrease subject variability
Subjects will have various characteristics that may also be correlated with your variables SES, sex, race/ethnicity, age, etc… These variables can confound your results, making it
harder to find statistically significant results When planning your sample (to enhance power), select
subjects that are very similar to each other This is a reason why repeated measures tests and paired
samples are more likely to have statistically significant results
Downside to decreasing subject variability? Will decrease your external validity – generalizability If you only test women, your results do not apply to men
How can I increase Power? 6) Increase magnitude of the mean difference
If your groups are not different enough, make them more different!
For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people? Compare a ‘very’ high exercise to a ‘very’ low exercise
group Sampling at the extremes, getting rid of the middle
group Downsides to using the extremes?
Similar to decreasing subject variability, this will decrease your external validity
Questions on Power/Increasing Power?
The Catch-22 of Power and P-values
I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance The larger your sample, the more likely you’ll find
statistically significant results Sometimes miniscule differences between groups or tiny
correlations are ‘significant’ This becomes relevant once sample size grows to 100~150
subjects per group Once you approach 1000 subjects, it’s hard not to find p <
0.05 Example from most highly cited paper in Psych, 2004…
This paper was the first to find a link between playing video games/TV and aggression in children:
Every correlation in this table except 1 has p < 0.05 Do you remember what a correlation of 0.10 looks like?
r = 0.10
Do you see a relationship between these two
variables?
What now?
This realization has led scientists to begin to avoid p-values (or at least avoid just reporting p-values) Moving towards reporting with 95% confidence intervals Especially in areas of research where large samples are
common (epidemiology, psychology, sociology, etc..)
Some people interpret ‘statistically significant’ as being ‘important’ We’ve mentioned several times this is NOT true Statistically significant just means it’s likely not Type I error Can have ‘important’ results that aren’t statistically
significant
Effect Size To get an idea of how ‘important’ a difference or
association is, we can use Effect Size There are over 40 different types of effect size
Depends on statistical test used SPSS will NOT always calculate effect size
Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference Not impacted by statistical significance Effect size can stay the same even if p-value changes Present the two together when possible
The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it
Effect Size
Understanding effect size from correlations and regressions is easy (and you already know it): r2, coefficient of determination
% Variance accounted for Pearson correlations between %BF and 3
variables: r = 0.54, r = -0.92, r = 0.70
Which of the three correlations has the most important association with %BF? r2 = 0.29, r2 = 0.85, r2 = 0.49
Interpreting Effect Size
Usually, guidelines are given for interpreting the effect size Help you to know how important the effect is Only a guide, you can use your own brain to
compare In general, r2 is interpreted as:
0.01 or smaller, a Trivial Effect 0.01 to 0.09, a Small Effect 0.09 to 0.25, a Moderate Effect > 0.25, a Large Effect
Effect Size in Regression Two regression equations contain 4 predictors
of %BF. Each ‘model’ is statistically significant. Here are their r2 values: 0.29 and 0.15
Which has the largest effect size? Do either or the regression models have a large effect size? 0.29 model is the most important, and has a
‘large effect size’. 0.15 model is of ‘moderate’ importance.
Effect Size for Group Differences
Effect size in t-tests and ANOVA’s is a bit more complicated
In general, effect size is a ratio of the mean difference between two groups and the standard deviation Does this remind you of anything we’ve previously seen? Z-score = (Score – Mean)/SD
Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by E.g., effect size of 1 means the two groups are different
by 1 standard deviation (this would be a big difference)!
Example
When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d < 0.1 Trivial effect 0.1-0.3 Small effect 0.3-0.5 Medium effect > 0.5 Large effect
The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d
Paired-Samples t-test Output
Mean difference = 2.9, Std. Deviation = 5.2
Cohen’s d = 0.55, a large effect size Essentially, the weight loss program
reduced body weight by just about half a standard deviation
Other example
I sample a group of 100 ISU students and find their average IQ is 103. Recall, the population mean for IQ is 100, SD
= 15. I run a one-sample t-test and find it to be
statistically significant (p < 0.05) However, effect size is…
0.2, or Small Effect Interpretation: While this difference is likely
not due to random sampling error – it’s not very important either
Other types of effect sizes
SPSS will not calculate Cohen’s d for t-tests However, it will calculate effect size for
ANOVA’s (if you request it) Not Cohen’s d, but Partial Eta Squared (η2) Similar to r2, interpreted the same way (same
scale)
Here is last week’s cancer example Does Tumor Size and Lymph Node Involvement
effect Survival Time I’ll re-run and request effect size…
Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually How would you describe the effect of Tumor Size, or our interaction? Trivial to Small Effect – How did we get a significant p-value? Other factors not in our model are also very important
Notice that the r2 is equal to the η2 of the full model The advantage of η2 is that you can evaluate
individual effects
Effect Size Summary
Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: Correlation and Regression: r2
T-tests: Cohen’s d ANOVA: Partial eta squared (η2) and/or r2
You are responsible for knowing: The general theory behind effect size/why to use
them What tests they are associated with How to interpret them
QUESTIONS ON POWER?EFFECT SIZE?
Upcoming…
In-class activity Homework:
Cronk – Read Appendix A (pg. 115-19) on Effect Size
Holcomb Exercises 21 and 22 No out-of-class SPSS work this week
Things are slowing down - next week we’ll discuss non-parametric tests Chi-Square and Odds Ratio