statistics toolbox in - laura k gray, ph.d, p.stat., p.ag. · 2018. 10. 7. · parametric versus...

Statistics Toolbox in

Professional Development Opportunity for theFlow Cytometry Core Facility

October 12, 2018

LKG ConsultingEmail: [email protected]

Website: www.consultinglkg.com

A Review of Analysis Techniques

for Scientific Research

The goal of this workshop is to give you the knowledge & tools to be confident in your ability to

collect & analyze you’re data as well as correctly interpret your results…

…Think of me as your new resource!

https://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjxg6rJjoDVAhWn6oMKHTrVCzwQjRwIBw&url=https://www.hopkinton.k12.ma.us/Page/3475&psig=AFQjCNFseWpJ2jdd6KDgrFSTPAyt-9tStw&ust=1499824107310577

Laura Gray-Steinhauer

www.ualberta.ca/~lkgray

BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005)

MSc in Forest Biology and Management (UofA, 2008)

PhD in Forest Biology and Management (UofA, 2011)

Designated Professional Statistician with The Statistical Society of Canada (2014)

Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…

A little about me…

http://www.ualberta.ca/~lkgray

Workshop Schedule

8:15 – 8:30 Arrive to the Lab & Start up the computers

8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals)

8:45 – 9:15Statistics ToolboxRefresh useful vocabulary, introduce a decision tree to plan your analysis path

9:15 – 9:45Hypothesis TestingRefresher on p-values, Type 1 and Type 2 error, and statistical power

9:45 – 10:00 Break

10:00 – 11:00Parametric versus Non-Parametric TestsTesting for parametric assumptions, ANOVA, Permutational ANOVA

11:00 – 11:30Multivariate StatisticsIntroduction to principle component analysis (PCA)

11:30 – 1:00 Work period (questions are welcome)

After 1:00 Enjoy your weekend!

This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today.

Feel free to reach out to me via email with more questions: [email protected].

Workbook

• Yours to keep!

• R code is identified by Century Gothic font (everything else is Arial)

• Arbitrary object names are bold to indicate these could change depending on what you name your variables.

• Referenced data is provided at www.ualberta.ca/~lkgray

• Please contact me to obtain permission to redistribute content outside of the workshop attendees.

Topics Included:

• Descriptive statistics• Confidence intervals• Data distributions• Parametric assumptions• T-tests• ANOVA• ANCOVA• Non-parametric tests

Topics Included:

• Permutational ANOVA & T-tests• Z-test for Proportions• Chi-squared test• Outlier tests and treatments• Correlation• Linear regression• Multiple linear regression• Akaike Information Criterion

Topics Included:

• Non-linear regression• Logistic regression• Binomial ANOVA• Principle component analysis

(PCA)• Discriminant analysis• Multivariate analysis of variance

(MANOVA)

http://www.ualberta.ca/~lkgray

R Project Website

https://cran.r-project.org/index.html

https://cran.r-project.org/index.html

https://www.rstudio.com/

RStudio (IDE: Integrated Development Environment)Preferred among programmers, we will use it in this workshop

https://www.rstudio.com/

Statistics Toolbox

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”

Aaron Levenstein (Author)

Statistical Vocabulary

Statistical Term Real World Research World

Population Class of thingsE.g. Cancer patients

What you want to learn aboutE.g. Cancer patients in Alberta

Sample Group representing a classE.g. 1000 cancer patients in

Alberta

What you actually studyE.g. 1000 cancer patients from 10

treatment centres in Alberta

Experimental Unit Individual thingE.g. each of the 1000 cancer

patients

Individual research subjectE.g. Cancer patients n=1000

Hospital populations n=10

(depends on research question)

Dependent

Variable

Property of thingsE.g. white blood cell count

What you measure about

subjectsE.g. white blood cell count

Independent

Variable

Environment of thingsE.g. Treatment options, climate,

etc.

What you think might influence

dependent variableE.g. Amount of treatment, combination of

treatments, etc.

Data Values of variables What you record/information you

collect

• Experiment – any controlled process of study which results in data collection, and which the outcome is unknown

• Descriptive statistics – numerical/graphical summary of data

• Inferential statistics – predict or control the values of variables (make conclusions with)

• Statistical inference – to makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken

• Parameter – an unknown value (needs to be estimated) used to represent a population characteristic (e.g. population mean)

• Statistic – estimation of parameter (e.g. mean of a sample)

• Sampling distribution (aka. Probability distribution or Probability density function) –probability associated with each possible value of a variable

• Error - difference between an observed value (or calculated) value and its true (or expected) value

Other important statistical terms Also see Appendix 1 in your workbook

Statistics ToolboxWhat is the goal of

my analysis?

What kind of data do I have to answer my

research question?

How many variables do I

want to include in my analysis?

Does my data meet the analysis assumptions?

Analysis Goal Parametric

Assumptions Met

Non-Parametric

Alternative if fail assumptions

Binomial

Binary data/Event likelihood

Describe data

characteristics

Mean

Standard deviation

Standard error

Etc.

Median

Quartiles

Percentiles

Proportions

Probability distributions are always appropriate to describe data.

Graphics are always appropriate to describe data.

Compare 2

distinct/independent

groups

T-test

Paired t-test

Wilcox Rank-Sum Test

Klomogorov-Smirnov Test

Permutational T-test

Z-Test for proportions

Compare > 2

distinct/independent

groups

ANOVA

Multi-Way ANOVA

ANCOVA

Blocking

Kruskall Wallace Test

Friedman Rank Test

Permutational ANOVA

Chi-Squared Test

Binomial ANOVA

Estimate the degree

of association

between 2 variables

Pearson’s

correlation

Spearman rank correlation

Kendall’s rank correlation

Logistic regression

Predict outcome

based on relationship

Linear regression

Multiple linear

regression

Non-linear regression Logistic regression

Odds Ratio

Statistics Toolbox

If you have a continuous response variable…

… and one predictor variable

Predictor is categorical Predictor is continuous

Two treatment levels

> Two treatment levels

T-Test Permutational T-Test

KlomogorovSmirniov (KS) Test

Wilcox Test

One-Way ANOVA

KruskallWallace Test

Freidman Rank Test

Pearson’s Correlation

Spearman’s Rank Correlation

Kendall’s Rank Correlation

Linear/Non-linear Regression

What you get

Non-ParametricParametric Regression

• P-value indicating if 2 groups are significantly different

• P-value indicating there is a significant effect of “treatment”.

• Need pairwise comparisons to find where the difference between groups occurs.

• Correlation coefficient indicating direction and magnitude of relationship.

• “Goodness of fit” indicting how well predictor is linked to response (R2 or AIC).

Binomial

… and two or more predictor variables


Two or moretreatment levels for each predictor

Multi-Way ANOVA Permutational ANOVA

Multiple Regression

What you get

• P-value indicating if there is a significant effect of each treatment.

• Size of a significant effect (no interactions).

• Need to consider the possibility of interactions.

• Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.

• Also get the effect of the blocking term and/or undesired covariate.

• Do not need to consider the interaction between treatments and blocks and/or covariates.

• Fit of how well predictors are linked to response variable (Adjusted R2, AIC)

• P-values to indicate which predictors significantly affect the response variable.

Blocking

ANCOVA

Blocking

Non-ParametricParametric Regression Binomial

If you have a categorical response variable…

… and one predictor variable


Two or moretreatment levels

Binomial ANOVA

Logistic Regression

What you get

• P-value indicating if there is a significant effect of each treatment.

• Size of a significant effect (no interactions).

• Need to consider the possibility of interactions.

• Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.

• P-value indicating there is a significant effect of “treatment”.

• Need pairwise comparisons to find where the difference between groups occurs.

… and two or more predictor variables

Predictor is continuous

Twotreatment levels

Z-Test for Proportions

Two or moretreatment levels

Chi-squared Test

• Fit of how well predictors are linked to response variable (Adjusted R2, AIC)

• P-values to indicate which predictors significantly affect the response variable.

• P-value indicating if 2 groups are significantly different

Non-ParametricParametric Regression Binomial

Example research questions:

Do yield of different lentil varieties differ at 2 farms?

Do the varieties differ among themselves?

Does the density of the plants impact their average height?

The Lentil datasets (You are now a farmer)

Farm 1

Farm 2

Plot1 Variety in each

Individual lentil plants

A

A

A

A

A

A

A

B

B

B

B

B

A

B

C

C

C

C

C

C

C

C

Datasets Available in R

• Over 100+ datasets available for you to use• We will use:

• iris: The famous (Fisher's or Anderson's) iris data set gives the sepal and

petal measurements for 50 flowers from each of Iris setosa, versicolor, and

virginica.

• USArrests: Data on US Arrests for violent crimes by US State.

Hypothesis Testing

“Statistics are not substitute for judgment.”Henry Clay (Former US Senator)

Formal hypotheses testing

A Bsample

population

A

BIs this a difference

due to random

chance?

Me

an h

eigh

t

Population sample

𝐻𝑜: ҧ𝑥𝐴 = ҧ𝑥𝐵𝐻1: ҧ𝑥𝐴 ≠ ҧ𝑥𝐵

If actual p < , reject null hypothesis (𝐻𝑜) and accept alternative hypothesis (𝐻1)

“Is this difference due to random chance?”

AB

Mea

n h

eig

ht

Population sample

P-value – the probability the observed value or larger is due to random chance

Theory: We can never really prove if the 2 samples are truly different or the same – only

ask if what we observe (or a greater difference) is due to random chance

How to interpret p-values:

P-value = 0.05 – “Yes, 1 out of 20 times.”

P-value = 0.01 – “Yes, 1 out of 100 times.”

The lower the probability a difference is due to random chance – the more likely is the result of an

effect (what we test for)

In other words: “Is random chance a plausible explanation?”

Type I Error – reject the null hypothesis (H0) when it is actually true

Type II Error – failing to reject the null hypothesis (H0) when it is not true

Remember rejection or acceptance of a p-value (and therefore the chance you will

make an error) depends on the arbitrary -level you choose

• -level will probability of making a Type I Error, but this

the probability of making a Type II Error

The -level you choose is completely up to you (typically it is set at 0.05), however, it

should be chosen with consideration of the consequences of making a Type I or a Type

II Error.

Based on your study, would you rather err on the side of false positives or false

negatives?

Null hypothesis is true Alternative hypothesis

is true

Fail to reject

the null

hypothesis☺

Correct

Decision

Incorrect

Decision

False Negative

Type II Error

Reject the

null

hypothesis

Incorrect

Decision

False Positive

Type I Error

☺ Correct Decision

Example: Will current forests adequately protect genetic resources under climate change?

Birch Mountain Wildlands

HO: Range of the current climate for the BMW protected area = Range of the BMW protected area under climate change

Ha: Range of the current climate for the BMW protected area ≠ Range of the BMW protected area under climate change

If we reject HO: Climates ranges are different, therefore genetic resources are not adequately protected and new

protected areas need to be created

Consequences if I make:

• Type I Error: Climates are actually the same and genetic resources are

indeed adequately protected in the BMW protected area – we created

new parks when we didn’t need to

• Type II Error: Climates are different and genetic resources are

vulnerable – we didn’t create new protected areas and we should have

From an ecological standpoint it is better to make a Type I Error, but from

an economic standpoint it is better to make a Type II Error

Which standpoint should I take?

Power is your ability to reject the null hypothesis when it is false (i.e.

your ability to detect an effect when there is one).

There are many ways to increase power:

1. Increase your sample size (sample more of the population)

2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I

Error!

3. Use a one-tailed test (you know the direction of the expected effect)

4. Use a paired test (control and treatment are same sample)

Given you are testing whether or not what you observed or greater

is due to random chance, more data gives you a better

understanding of what is truly happening within the population,

therefore sample size will the probability of making a

Type 2 Error

Statistical Power

BREAK9:45 – 10:00

Go grab a coffee. Next we will cover specific tools in your new tool box.

Statistics ToolboxParametric versus Non-Parametric Tests

“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”

Andrew Lang (Scottish poet)

Univariate Test Options

Type Parametric Non-Parametric

Characteristics • Analysis to test group means• Based on raw data• More statistical power than non-

parametric tests

• Analysis to test group medians• Based on ranked data• Less statistical power

Assumptions • Independent samples• Normality (data OR errors)• Homogeneity of variances

• Independent samples

When to use? • Parametric assumptions are met• Non-Normal, BUT larger sample

size (CLT), however equal variances must be met

• Parametric assumptions are not met

• Medians better represent your data (skewed data distribution)

• Small sample-size• Ordinal data, ranked data, or

outliers that you can’t remove

Examples • T-test• ANOVA (One-way, Two-way,

Paired)

• Wilcox Rank Sum Test• Kruskal-Wallis Test• Permutational Tests (non-

traditional)

Assumption #1: Independence of samples

“Your samples have to come from a randomized or randomly sampled design.”

• Meaning rows in your data do NOT influence one another

• Address this with experimental design (3 main things to consider)

1. Avoid pseudoreplication and potential confounding factors by designing

your experiment in a randomized design

2. Avoid systematic arrangements which are distinct pattern in how

treatments are laid out. • If your treatments effect one another – the individual treatment effects

could be masked or overinflated

3. Maintain temporal independence

• If you need to take multiple samples from one individual over time record

and test your data considering the change in time (e.g. paired tests)

NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at

least 2 reps per treatment to execute and ANOVA

Rule of Thumb: You need more rows then columns

The Normal Distribution

𝑠2 =σ𝑖=1𝑛 𝑥𝑖 − ҧ𝑥

2

𝑛 − 1

SD = 𝑠2

Based on this curve:

• 68.27% of observations are within 1 stdev of ҧ𝑥• 95.45% of observations are within 2 stdev of ҧ𝑥• 99.73% of observations are within 3 stdev of ҧ𝑥

For confidence intervals:

• 95% of observations are within 1.96 stdev of ҧ𝑥

The base of parametric statistics

Assumption #2: Data/Experimental errors are normally distributed

B

ҧ𝑥𝐵𝐹𝑎𝑟𝑚1

C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1

ҧ𝑥𝑐𝐹𝑎𝑟𝑚1

FARM 1 C A

ҧ𝑥𝐴𝐹𝑎𝑟𝑚1

Residuals

“If I was to repeat my sample repeatedly and calculate the means, those

means would be normally distributed.”

Determine if the assumption is met by:

1. Looking at the residuals of your sample

2. Shaprio–wilks Test for Normality – if your data is mainly unique values

3. D'Agostino-Pearson normality test – if you have lots of repeated values

4. Lilliefors normality test – mean and variance are unknown

t-distribution (sampling distribution)

Normal distribution

Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central population value.”

Therefore:

• When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far

off the true population mean

Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?

Central Limit Theorem: “Sample means tend to cluster around the central population value.”

Therefore:

• When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far

off the true population mean

What does this mean?

• For large N, the assumption for Normality can be relaxed

• You may have decreased power to detect a difference among groups, BUT

your test is not really compromised if your residuals are not normal

• Assumption of Normality is important when:

1. Very small N

2. Data is highly non-normal

3. Significant outliers are present

4. Small effect size

Assumption #3: Equal variances between groups/treatments

0 4 8 12 16 20 24

ҧ𝑥𝐴 = 12𝑠𝐴 = 4

ҧ𝑥𝐵 = 12𝑠𝐵 = 6

Let’s say 5% of the A data fall above this threshold

But >5% of the B data fall above the same threshold

So with larger variances, you can expect a greater number of observations at the extremes of the

distributions

This can have real implications on inferences we make from comparisons between groups

B




FARM 1 C A


Residuals

“Does the know probability of observations between my two samples hold true?”

Determine if the assumption is met by:

1. Looking at the residuals of your sample

2. Bartlett Test

Assumption #3: Equal variances between treatments

Assumption #3: Equal variances between treatmentsTesting for Equal Variances – Residual Plots

Pre

dic

ted

val

ues

Observed (original units)

Pre

dic

ted

val

ues


Pre

dic

ted

val

ues


Pre

dic

ted

val

ues


• NORMAL distribution: equal number of points along observed

• EQUAL variances: equal spread on either side of the meanpredictedvalue=0

• Good to go!

0

0

0

0

• NON-NORMAL distribution: unequal number of points along observed

• EQUAL variances: equal spread on either side of the meanpredicted value=0

• Optional to fix

• NORMAL/NON NORMAL: look at histogram or test

• UNEQUAL variances: cone shape – away from or towards zero

• This needs to be fixed for ANOVA (transformations)

• OUTLIERS: points that deviate from the majority of data points

• This needs to be fixed for ANOVA (transformations or removal)

http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=SKch0bLHuslgeM&tbnid=2oYLwWRcd13DRM:&ved=0CAcQjRw&url=http://launchandrelease.com/practical-crowdfunding-lessons-funding-ratio/&ei=8R8zVKODH4j2igK-zIGgCQ&bvm=bv.76943099,d.cGU&psig=AFQjCNE0gUQPXHXLOyU3W2fv31Iw01F9Xw&ust=1412722941719175

• Treatment – predictor variable (e.g. variety, fertilization, irrigation, etc.)

• Treatment level – groups within treatments (e.g. A,B,C or Control, 1xN, 2xN)

• Covariate – undesired, uncontrolled predictor variable, confounding

• F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛

=𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡

𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟

• P-value – probability that the observed difference or larger in the treatment means is due to random chance

Analysis of Variance (ANOVA) – Vocabulary

B




FARM 1 C A


Residuals

Analysis of Variance (ANOVA)

F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛

=𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡

𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟= 𝑠𝑖𝑔𝑛𝑎𝑙

𝑛𝑜𝑖𝑠𝑒

SIGNAL

NOISE

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿

2

𝑛 − 1∗ 𝑟 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =

σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖

𝑛

Analysis of Variance (ANOVA)

Think of Pac-man!

• All of the dots on the board represent the Total

Variation in your study

• Every treatment you use in your analysis is a

different Pac-man player on the board

• The amount of dots each player eat represents

variation between (e.g. the amount of variation each treatment can explain)

• The amount of dots left on the board after all

players have died represented the variation within

• If players have a big effect they will eat more dots, reducing dots left on the board

(lowering variation within), increasing the F-value

• A large F-value indicates a significant difference

𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙

𝑛𝑜𝑖𝑠𝑒𝐹 =

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛

http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjzktWxyY7VAhVL4IMKHczABdIQjRwIBw&url=http://www.majorgeeks.com/files/details/team_pacman.html&psig=AFQjCNGAJ9M5F3dBw8HfsXyfIkv1EQoJtg&ust=1500320964517303

F Distribution (family of distributions)

𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙

𝑛𝑜𝑖𝑠𝑒𝐹 =

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿

2

𝑛 − 1∗ 𝑛𝑜𝑏𝑝𝑡

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖

𝑛

pro

bab

ility

𝑠𝑖𝑔𝑛𝑎𝑙 > 𝑛𝑜𝑖𝑠𝑒𝑠𝑖𝑔𝑛𝑎𝑙 < 𝑛𝑜𝑖𝑠𝑒

P-value(percentiles, probabilities)

pf(F, 𝑑𝑓1, 𝑑𝑓2)qf(p, 𝑑𝑓1, 𝑑𝑓2)

0.500 0.95

∞

∝= 0.05

How to report results from an ANOVA

Source of Variation df Sum of Squares Mean Squares F-value P-value

Variety (A) 2 20 10 0.0263 0.9741

Farm (B) 1 435243 43243 1125.7085

Interaction plots – Different story under different conditions

1.

2.

3.

Farm1 Farm2

A

B

• VARIETY is significant (*)

• FARM is significant (*)

• FARM2 has better yield than FARM1

• No Interaction

• VARIETY is not significant

• FARM is significant (*)

• VARIETY A is better on FARM2 and VARIETY B is better on

FARM1

• Significant Interaction

A

B

Farm1 Farm2

Farm1 Farm2

Farm1 Farm2

4.

• VARIETY is significant (*)

• FARM is significant (*) – small difference

• Main effects are significant, BUT hard to interpret with

overall means

• Significant Interaction

A

B

AvgFarm1

AvgFarm2

Yie

ldY

ield

Yie

ldY

ield

• VARIETY is not significant

• FARM is not significant

• Cannot distinguish a difference between VARIETY or FARM

• No Interaction

AB

Interaction plots – Different story under different conditions

• An interaction detects non-parallel lines

• Difficult to interpret interaction plots for more than a 2-

WAY ANOVA

• If the interaction effect is NOT significant then you can

just interpret the main effects

• BUT if you find a significant interaction you don’t want

to interpret main effects because the combination of

treatment levels results in different outcomes

Pairwise comparisons – What to do when you have an interactiona.k.a Pairwise t-tests

Number of comparisons:

𝐶 =𝑡 𝑡 − 1

2𝑡 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑙𝑒𝑣𝑒𝑙𝑠

Lentil Example: 3 VARITIES (A, B, and C)

A – BA – CB – C

𝐶 =𝑡(𝑡 − 1)

2=3(2)

2= 𝟑

Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I

Error at all

Experiment-wise Type I Error for = 0.05:𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.95𝐶

Lentil Example: 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.953

𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.87𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 𝟎. 𝟏𝟑

Significantly increased probability of making an error!

Therefore pairwise comparisons leads to compromised experiment-wise -level

You can correct for multiple comparisons by calculating an adjusted p-value

(Bonferroni, Holms, etc.)

Pairwise comparisons – Tukey Honest Differences (Test)a.k.a Pairwise t-tests with adjusted p-values

If we have a significant

interaction effect – use these

values

If we have NO significant

interaction effect – we can just

look at the main effects

Only need to consider relevant pairwise comparisons – think about it logically

How to report a significant difference in a graph

W X Y Z

W - NS * NS

X - * NS

Y - NS

Z -A

A

B

A,B

W X Y Z

Same letter = non significant

Different letter = significant

Create a matrix of significance

and use it to code your graph

Permutational Non-parametric tests

• PNPT make NO Assumptions therefore any data can be used

• PNPT work with absolute differences a.k.a distances

• Smaller values indicate similarity

• Makes the calculations equivalent to sum-of-squares

𝐷 =𝑠𝑖𝑔𝑛𝑎𝑙

𝑛𝑜𝑖𝑠𝑒𝐷 =

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠

Calculating D (delta) & its distribution

• For our test we can compare D to an expected distribution of D the

same way we do when we calculate an F-value

• Use permutations (iterations) to generate the distribution of D from

our raw data

• Therefore shape of D distribution is dependent on your data

Permutational Non-parametric testsDetermining the distribution of D

• After you permute this process 5000 times (your choice) a distribution of D will emerge

• Shape depends on your data – may be normal or not (doesn’t matter)

102 3

Permutational Non-parametric testsDetermining the distribution of D

• After you permute this process 5000 times (your choice) a distribution of D will emerge

• Shape depends on your data – may be normal or not (doesn’t matter)

102 3

4921 D calculations < 10

from permutations

79 D calculations ≥ 10

from permutations

P-value:

79/5000 = 0.0158

Permutational ANOVA

Permutational ANOVA in R:library(lmPerm)

summary(aovp(YIELD~FARM*VARIETY, seqs=T))

Pairwise Permutational ANOVA in R:out1=aovp(YIELD~FARM*VARIETY, seqs=T)

TukeyHSD(out1)

Permutational Non-parametric tests in R

The option seqs=T calculates

sequential sum of squares (similar to regular ANOVA)

Good choice for balanced

designs

You can change the

maximum number of

iterations with the maxIter=

option

Permutational Non-parametric tests

• For parametric tests we know Normal, T-distribution, F-distribution look like

• Therefore we can use the standard calculations (t-value) to calculate statistics

• When we violate the known distribution we need some other curve to work with

• Hard to estimate a theoretical distribution that fits your data

• Best solution is to permute your data to generate a distribution

• Permutational non-parametric statistics are just as powerful as parametric tests

• This technique is similar to bootstrapping• But bootstrap samples data rather than changing all observation classes

If permutational techniques are so good

why not always use them?

• You say permutational non-parametric tests are as

powerful as parametric statistics – YES They Are!

• But they are still fairly new to statistical practices

• Still unknown or not understood among many uses

• Best practice is to stick with parametric statistics when

you can, but when you can’t permutational tests are

great options!

Extension of the Statistics ToolboxMultivariate Tests (Rotation-Based)

“Definition of Statistics: The science of producing unreliable facts from reliable figures.”

Evan Esar (Humorist & Writer)

• Population – Class of things (What you want to learn about)

• Sample – group representing a class (What you actually study)

• Experimental unit – individual research subject (e.g. location, entity, etc.)

• Response variable – property of thing that you believe is the result of predictors (What you actually measure – e.g. lentil height, patient response)

a.k.a dependent variable

• Predictor variable(s) – environment of things which you believe is influencing a response variable (e.g. climate, topography, drug combination, etc.)

a.k.a independent variable

• Error - difference between an observed value (or calculated) value and its true (or expected) value

A Reminder from Univariate Statisics….

Experimental Unit

(row)

In Multivariate statistics:

• Variables can be either numeric or categorical (depends on the

technique)

• Focus is often placed on graphical representation of results

Frequency of species,

Climate variables,

Soil characteristics,

Nutrient concentrations

Drug levels, etc.

Regions,

Ecosystems,

Forest types,

Treatments, etc.

Rotation-based Methods Data.I

DType

Varable1

Variable2

Variable3

Variable4

…

1

2

3

4

5

6

…

Final results based on multiple

variables give different inferences

than 2 variables

Data.ID

TypeVarable

1Variable

2Variable

3Variable

4…

1

2

3

4

5

6

…Variable 1

Var

iab

le 2

Find an equation to rotate data

to so that axis explains multiple

variables

Variable 1,2

Var

iab

le 3

Variable 1, 2, 4, 9, 10V

aria

ble

3, 6

, 8Repeat rotation process to

achieve analysis objective

Rotation-based Methods

1. Rotate so that new axis explains the greatest amount of variation within the data

Principal Component Analysis (PCA) Factor Analysis

2. Rotate so that the variation between groups is maximized

Discriminatn Analysis (DISCRIM)Multivariate Analysis of Variance (MANOVA)

3. Rotate so that one dataset explains the most variation in another dataset

Canonical Correspondence Analysis (CCA)

Objective of Rotation-based Methods

Z1 = a11X1 + a12X2 + … + a1nXn

First principal component

(column vector)

Column vectors of original

variables

• PCA Objective: Find linear combinations of the original variables X1, X2, …, Xn to

produce components Z1, Z2, …, Zn that are uncorrelated in order of their

importance, and that describe the variation in the original data.

• Principle components are the linear combinations of the original variables

• Principle component 1 is NOT a replacement for variable 1 – All variables are

used to calculate each principal component

For each component:

The constraint that a112 + a12

2 + … + a1n2 = 1 ensures Var(Z1) is as large as possible

Coefficients for linear model

The Math Behind PCA

• Z2 is calculated using the same formula and constraint on a2n values

However, there is an addition condition that Z1 and Z2 have zero

correlation for the data

• The correlation condition continues for all successional principle

components i.e. Z3 is uncorrelated with both Z1 and Z2

• The number of principal components calculated will match the number

of predictor variable included in the analysis

• The amount of variation explained decreases with each successional

principal component

• Generally you base your inferences on the first two or three

components because they explain the most variation in your data

• Typically when you include a lot of predictor variables the last

couple of principal components explain very little (< 1%) of the

variation in your data – not useful variables

The Math Behind PCA

PCA in R:princomp(dataMatrix,cor=T/F) (stats package)

Define whether the PCs should be calculated using the correlation or covariance

matrix (derived within the function from the data)

You tend to use the covariance matrix when the variable scales are similar

and the correlation matrix when variables are on different scales

Default is to use the correlation matrix because it standardizes to data before it

calculated the PCs, removing the effect of the different units

Data matrix of predictor variables

You will assign the results back to a

class once the PCs have been

calculated

PCA in R

PCA in R

Loadings – these are the correlations between the original predictor variables

and the principal components

Identifies which of the original variables are driving the principal component

Example:

Comp.1 – is negatively related to Murder,

Assault, and Rape

Comp.2 – is negatively related to UrbanPop

Eigenvectors

PCA in R

Scores – these are the calculated principal components Z1, Z2, …, Zn

These are the values we plot to make inferences

PCA in R

Variance – summary of the output displays the variance explained by each

principal component

Identifies how much weight you should put in your principal components

Example:

Comp.1 – 62 %

Comp.2 – 25%

Comp.3 – 9%

Comp.4 – 4 %

Eignenvalues

divided by the

number of PCs

PCA in R

Data points considering

Comp.1 and Comp.2 scores

(displays row names)

Direction of the arrows +/-

indicate the trend of points

(towards the arrow indicates

more of the variable)

If vector arrows are

perpendicular then the

variables are not correlated

If you original variables do not have some level of correlation then PCA will NOT work

for your analysis – i.e. You wont learn anything!

PCA in R - Biplot

WORK PERIOD11:30 – 1:00

Follow the Workbook Examples for the Analyses You are Interested In.Any questions?

Statistics ToolboxWhat is the goal of

my analysis?

What kind of data do I have to answer my

research question?

How many variables do I

want to include in my analysis?

Does my data meet the analysis assumptions?

… to characterize my data.

… to find if there is a significant

difference between my groups

… to see what predictor

conditions are associated with

my groups.

… normally distributed?

… equal variances?

… multiple response variables?

… continuous or discrete?

… binary data?

… single treatment or multiple

treatment?

Thank You for Attending the Stats Workshop

I you have any further questions please feel free to contact me

Flow Cytometry Core Facility

LKG ConsultingEmail: [email protected]

Website: www.consultinglkg.com

statistics toolbox in - laura k gray, ph.d, p.stat., p.ag. · 2018. 10. 7. · parametric versus...

Documents