Download - Power and Effect Size

POWER AND

EFFECT SIZE

Previous Weeks

A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information

from the past week

Most of this is a repeat – but a few new tests have been added Important that you are familiar with these tests,

know when they are appropriate to use, and how to run (most of) them in SPSS Excused from running ANCOVA, RM ANOVA

When to use specific statistical tests…

# of IV (format)

# of DV (format)

Examining…

Test/Notes

1 (continuou

s)

1 (continuous)

AssociationPearson

Correlation (r)

1(continuou

s)

1(continuous)

PredictionSimple Linear Regression (m

+ b)

Multiple1

(continuous)Prediction

Multiple Linear Regression (m

+ b)

# of IV (format)

# of DV (format)

Examining…

Test/Notes

1 (grouping, 2 levels)

1(continuous)

Group differences

When one group is a ‘known’ population =

One-Sample t-test


1(continuous)

Group differences

When both groups are

independent = Independent

Samples t-test


1(continuous)

Group differences

When both groups are

dependent = Paired Samples t-

test

1 (grouping, ∞ levels)

1(continuous)

Group differences

One-Way ANOVA, with Post-Hoc (F

ratio)

# of IV (format)

# of DV (format)

Examining…

Test/Notes

∞ (grouping, ∞ levels)

1(continuou

s)

Group Differences

and interactions

Factorial ANOVA with Post-Hoc

and/or Estimated Marginal Means

(F ratio)


1(continuou

s)

Group Differences, interactions,

controlling for confounders

ANCOVA with Estimated

Marginal Means (F ratio)

Analysis of Co-Variance


1(continuou

s)

Group Differences, interactions,

controlling for confounders in a related

sample

Repeated Measures ANOVA

with Estimated Marginal Means

(F ratio)(e.g.,

longitudinal)

Tonight… A break from learning a new statistical ‘test’

Focus will be on two critical statistical ‘concepts’ Statistical Power

Related to Alpha/Statistical Significance

Brief overview of Effect Size Statistically significant results vs Meaningful results

First, a quick review of error in testing…

Example Hypothesis Pretend my masters thesis topic is the influence of

exercise on body composition I believe people that exercise more, will have lower %BF To study this:

I draw a sample and group subjects by how much they exercise –High and Low Exercise Groups (this is my IV)

I also assess %BF in each subject as a continuous variable (DV)

I plan to see if the two groups have different mean %BF

My hypotheses (HO and HA): HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups

Example Continued

Now I’m going to run my statistical test, get my test statistic, and calculate a p-value I’ve set alpha at the standard 0.05 level By the way, what statistical test should I

use…?

My final decision on my hypotheses is going to be based on that p-value: I could reject the null hypothesis (accept HA) I could accept the null hypothesis (reject HA)

Statistical Errors…

Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make:

Type I Error We could reject the null hypothesis although

it was really true (should have accepted null) Type II Error

We could fail to reject the null hypothesis when it was really untrue (should have rejected null)

There are really 4 potential outcomes, based on what is “true” and what we “decide”

Our Decision

Reject HO Accept HO

What is True

HO Type I Error Correct

HA Correct Type II Error

HA: There is a difference in %BF between the groupsHO: There is not a difference in %BF between the groups


Remember – My final decision is based on the p-

value

Our Decision

Reject HO Accept HO

What is True



If p </= 0.05, our decision is reject HO

If p > 0.05, our decision is accept HO


In my analysis, I find: High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08

What is my decision? Accept HO

There is NOT a difference in %BF between the groups

Why is that my decision? The means ARE different? I can’t be confident that the 4% difference between

the two groups is not due to random sampling error

Is it possible I’ve made an error in my decision?

Possible Error…?

If I did make an error, what type would it be? Type II Error

When you find a p-value greater than alpha The only possible error is Type II error

When you find a p-value less than alpha The only possible error is Type I error

Our p = 0.08, we accepted HO

The only possible error is Type II

Our Decision

Reject HO Accept HO

What is True



If p </= 0.05, our decision is reject HO

If p > 0.05, our decision is accept HO

Possible Error…?

Compare Type I and Type II error like this: The only concern when you find statistical significance

(p < 0.05) is Type I Error Is the difference between groups REAL or due to Random

Sampling Error Thankfully, the p-value tells you exactly what the probability

of that random sampling error is In other words, the p-value tells you how likely Type I error is

But, does the p-value tell you how likely Type II error is? The probability of Type II error is better provided by

Power

Possible Error…? Probability of Type II error is provided by Power

Statistical Power, also known as β (actually 1 – β) We will not discuss the specific calculation of power in this

class SPSS can calculate this for you

Power (Beta) is related to Alpha, but: Alpha is the probability of having Type I error

Lower number is better (i.e., 0.05 vs 0.01 vs 0.001) Power is the probability of NOT having Type II error

The probability of being right (correctly rejecting the null hypothesis)

Higher number is better (typical goal is 0.80)

Let’s continue this in the context of my ‘thesis’ example

Statistical Errors… In my analysis, I found:

High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08 Decided to accept the null

What do I do when I don’t find statistical significance?

What happens when the result does not reflect expectations?

First, consider the situation

Should it be statistically significant?

The most obvious thing you need to consider is if you REALLY should have found a statistically significant result? Just because you wanted your test to be significant doesn’t

mean it should be This wouldn’t be Type II error – it would just be the correct

decision!

In my example, researchers have shown in several studies that exercise does influence %BF This result ‘should’ be statistically significant, right? If the answer is yes, then you need to consider power

In my ‘thesis’ This result ‘should’ be statistically significant, right? Probably an issue with Statistical Power

This scenario plays out at least once a year between myself and a grad student working on a thesis or research project How can I increase the chance that I will find statistically

significant results? Why was this analysis not statistically significant? What can I do to decrease the chance of Type II error?

Several different factors influence power Your ability to detect a true difference

How can I increase Power?

1) Increase Alpha level Changing alpha from 0.05 to 0.10 will increase

your power (better chance of finding significant results)

Downsides to increasing your alpha level? This will increase the chance of Type I error!

This is rarely acceptable in practice Only really an option when working in a new area:

Researchers are unsure of how to measure a new variable

Researchers are unaware of confounders to control for

How can I increase Power?

2) Increase N Sample size is directly used when calculating p-

values

Including more subjects will increase your chance of finding statistically significant results

Downsides to increasing sample size? More subjects means more time/money

More subjects is ALWAYS a better option if possible

How can I increase Power? 3) Use fewer groups/variables (simpler designs)

Related to sample size but different ‘Use fewer groups’ NOT ‘Use less subjects’

↑ groups negatively effects your degrees of freedom Remember, df is calculated with # groups and # subjects

Lots of variables, groups and interactions make it more difficult to find statistically significant differences The purpose of the Family-wise error rate is to make it

harder to find significant results! Downsides to fewer groups/variables?

Sometimes you NEED to make several comparisons and test for interactions - unavoidable

How can I increase Power? 4) Measure variables more accurately

If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error

More measurement error decreases confidence in the result

For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error.

More of an internal validity problem than statistical problem

Downsides to measuring more accurately? None – if you can afford the best tools

How can I increase Power? 5) Decrease subject variability

Subjects will have various characteristics that may also be correlated with your variables SES, sex, race/ethnicity, age, etc… These variables can confound your results, making it

harder to find statistically significant results When planning your sample (to enhance power), select

subjects that are very similar to each other This is a reason why repeated measures tests and paired

samples are more likely to have statistically significant results

Downside to decreasing subject variability? Will decrease your external validity – generalizability If you only test women, your results do not apply to men

How can I increase Power? 6) Increase magnitude of the mean difference

If your groups are not different enough, make them more different!

For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people? Compare a ‘very’ high exercise to a ‘very’ low exercise

group Sampling at the extremes, getting rid of the middle

group Downsides to using the extremes?

Similar to decreasing subject variability, this will decrease your external validity

Questions on Power/Increasing Power?

The Catch-22 of Power and P-values

I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance The larger your sample, the more likely you’ll find

statistically significant results Sometimes miniscule differences between groups or tiny

correlations are ‘significant’ This becomes relevant once sample size grows to 100~150

subjects per group Once you approach 1000 subjects, it’s hard not to find p <

0.05 Example from most highly cited paper in Psych, 2004…

This paper was the first to find a link between playing video games/TV and aggression in children:

Every correlation in this table except 1 has p < 0.05 Do you remember what a correlation of 0.10 looks like?

r = 0.10

Do you see a relationship between these two

variables?

What now?

This realization has led scientists to begin to avoid p-values (or at least avoid just reporting p-values) Moving towards reporting with 95% confidence intervals Especially in areas of research where large samples are

common (epidemiology, psychology, sociology, etc..)

Some people interpret ‘statistically significant’ as being ‘important’ We’ve mentioned several times this is NOT true Statistically significant just means it’s likely not Type I error Can have ‘important’ results that aren’t statistically

significant

Effect Size To get an idea of how ‘important’ a difference or

association is, we can use Effect Size There are over 40 different types of effect size

Depends on statistical test used SPSS will NOT always calculate effect size

Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference Not impacted by statistical significance Effect size can stay the same even if p-value changes Present the two together when possible

The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it

Effect Size

Understanding effect size from correlations and regressions is easy (and you already know it): r2, coefficient of determination

% Variance accounted for Pearson correlations between %BF and 3

variables: r = 0.54, r = -0.92, r = 0.70

Which of the three correlations has the most important association with %BF? r2 = 0.29, r2 = 0.85, r2 = 0.49

Interpreting Effect Size

Usually, guidelines are given for interpreting the effect size Help you to know how important the effect is Only a guide, you can use your own brain to

compare In general, r2 is interpreted as:

0.01 or smaller, a Trivial Effect 0.01 to 0.09, a Small Effect 0.09 to 0.25, a Moderate Effect > 0.25, a Large Effect

Effect Size in Regression Two regression equations contain 4 predictors

of %BF. Each ‘model’ is statistically significant. Here are their r2 values: 0.29 and 0.15

Which has the largest effect size? Do either or the regression models have a large effect size? 0.29 model is the most important, and has a

‘large effect size’. 0.15 model is of ‘moderate’ importance.

Effect Size for Group Differences

Effect size in t-tests and ANOVA’s is a bit more complicated

In general, effect size is a ratio of the mean difference between two groups and the standard deviation Does this remind you of anything we’ve previously seen? Z-score = (Score – Mean)/SD

Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by E.g., effect size of 1 means the two groups are different

by 1 standard deviation (this would be a big difference)!

Example

When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d < 0.1 Trivial effect 0.1-0.3 Small effect 0.3-0.5 Medium effect > 0.5 Large effect

The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d

Paired-Samples t-test Output

Mean difference = 2.9, Std. Deviation = 5.2

Cohen’s d = 0.55, a large effect size Essentially, the weight loss program

reduced body weight by just about half a standard deviation

Other example

I sample a group of 100 ISU students and find their average IQ is 103. Recall, the population mean for IQ is 100, SD

= 15. I run a one-sample t-test and find it to be

statistically significant (p < 0.05) However, effect size is…

0.2, or Small Effect Interpretation: While this difference is likely

not due to random sampling error – it’s not very important either

Other types of effect sizes

SPSS will not calculate Cohen’s d for t-tests However, it will calculate effect size for

ANOVA’s (if you request it) Not Cohen’s d, but Partial Eta Squared (η2) Similar to r2, interpreted the same way (same

scale)

Here is last week’s cancer example Does Tumor Size and Lymph Node Involvement

effect Survival Time I’ll re-run and request effect size…

Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually How would you describe the effect of Tumor Size, or our interaction? Trivial to Small Effect – How did we get a significant p-value? Other factors not in our model are also very important

Notice that the r2 is equal to the η2 of the full model The advantage of η2 is that you can evaluate

individual effects

Effect Size Summary

Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: Correlation and Regression: r2

T-tests: Cohen’s d ANOVA: Partial eta squared (η2) and/or r2

You are responsible for knowing: The general theory behind effect size/why to use

them What tests they are associated with How to interpret them

QUESTIONS ON POWER?EFFECT SIZE?

Upcoming…

In-class activity Homework:

Cronk – Read Appendix A (pg. 115-19) on Effect Size

Holcomb Exercises 21 and 22 No out-of-class SPSS work this week

Things are slowing down - next week we’ll discuss non-parametric tests Chi-Square and Odds Ratio

Download - Power and Effect Size

Top Related