statistics toolbox in - laura k gray, ph.d, p.stat., p.ag. · 2018. 10. 7. · parametric versus...
TRANSCRIPT
-
Statistics Toolbox in
Professional Development Opportunity for theFlow Cytometry Core Facility
October 12, 2018
LKG ConsultingEmail: [email protected]
Website: www.consultinglkg.com
A Review of Analysis Techniques
for Scientific Research
-
The goal of this workshop is to give you the knowledge & tools to be confident in your ability to
collect & analyze you’re data as well as correctly interpret your results…
…Think of me as your new resource!
https://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjxg6rJjoDVAhWn6oMKHTrVCzwQjRwIBw&url=https://www.hopkinton.k12.ma.us/Page/3475&psig=AFQjCNFseWpJ2jdd6KDgrFSTPAyt-9tStw&ust=1499824107310577
-
Laura Gray-Steinhauer
www.ualberta.ca/~lkgray
BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005)
MSc in Forest Biology and Management (UofA, 2008)
PhD in Forest Biology and Management (UofA, 2011)
Designated Professional Statistician with The Statistical Society of Canada (2014)
Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…
A little about me…
http://www.ualberta.ca/~lkgray
-
Workshop Schedule
8:15 – 8:30 Arrive to the Lab & Start up the computers
8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals)
8:45 – 9:15Statistics ToolboxRefresh useful vocabulary, introduce a decision tree to plan your analysis path
9:15 – 9:45Hypothesis TestingRefresher on p-values, Type 1 and Type 2 error, and statistical power
9:45 – 10:00 Break
10:00 – 11:00Parametric versus Non-Parametric TestsTesting for parametric assumptions, ANOVA, Permutational ANOVA
11:00 – 11:30Multivariate StatisticsIntroduction to principle component analysis (PCA)
11:30 – 1:00 Work period (questions are welcome)
After 1:00 Enjoy your weekend!
This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today.
Feel free to reach out to me via email with more questions: [email protected].
-
Workbook
• Yours to keep!
• R code is identified by Century Gothic font (everything else is Arial)
• Arbitrary object names are bold to indicate these could change depending on what you name your variables.
• Referenced data is provided at www.ualberta.ca/~lkgray
• Please contact me to obtain permission to redistribute content outside of the workshop attendees.
Topics Included:
• Descriptive statistics• Confidence intervals• Data distributions• Parametric assumptions• T-tests• ANOVA• ANCOVA• Non-parametric tests
Topics Included:
• Permutational ANOVA & T-tests• Z-test for Proportions• Chi-squared test• Outlier tests and treatments• Correlation• Linear regression• Multiple linear regression• Akaike Information Criterion
Topics Included:
• Non-linear regression• Logistic regression• Binomial ANOVA• Principle component analysis
(PCA)• Discriminant analysis• Multivariate analysis of variance
(MANOVA)
http://www.ualberta.ca/~lkgray
-
R Project Website
https://cran.r-project.org/index.html
https://cran.r-project.org/index.html
-
https://www.rstudio.com/
RStudio (IDE: Integrated Development Environment)Preferred among programmers, we will use it in this workshop
https://www.rstudio.com/
-
Statistics Toolbox
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”
Aaron Levenstein (Author)
-
Statistical Vocabulary
Statistical Term Real World Research World
Population Class of thingsE.g. Cancer patients
What you want to learn aboutE.g. Cancer patients in Alberta
Sample Group representing a classE.g. 1000 cancer patients in
Alberta
What you actually studyE.g. 1000 cancer patients from 10
treatment centres in Alberta
Experimental Unit Individual thingE.g. each of the 1000 cancer
patients
Individual research subjectE.g. Cancer patients n=1000
Hospital populations n=10
(depends on research question)
Dependent
Variable
Property of thingsE.g. white blood cell count
What you measure about
subjectsE.g. white blood cell count
Independent
Variable
Environment of thingsE.g. Treatment options, climate,
etc.
What you think might influence
dependent variableE.g. Amount of treatment, combination of
treatments, etc.
Data Values of variables What you record/information you
collect
-
• Experiment – any controlled process of study which results in data collection, and which the outcome is unknown
• Descriptive statistics – numerical/graphical summary of data
• Inferential statistics – predict or control the values of variables (make conclusions with)
• Statistical inference – to makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken
• Parameter – an unknown value (needs to be estimated) used to represent a population characteristic (e.g. population mean)
• Statistic – estimation of parameter (e.g. mean of a sample)
• Sampling distribution (aka. Probability distribution or Probability density function) –probability associated with each possible value of a variable
• Error - difference between an observed value (or calculated) value and its true (or expected) value
Other important statistical terms Also see Appendix 1 in your workbook
-
Statistics ToolboxWhat is the goal of
my analysis?
What kind of data do I have to answer my
research question?
How many variables do I
want to include in my analysis?
Does my data meet the analysis assumptions?
-
Analysis Goal Parametric
Assumptions Met
Non-Parametric
Alternative if fail assumptions
Binomial
Binary data/Event likelihood
Describe data
characteristics
Mean
Standard deviation
Standard error
Etc.
Median
Quartiles
Percentiles
Proportions
Probability distributions are always appropriate to describe data.
Graphics are always appropriate to describe data.
Compare 2
distinct/independent
groups
T-test
Paired t-test
Wilcox Rank-Sum Test
Klomogorov-Smirnov Test
Permutational T-test
Z-Test for proportions
Compare > 2
distinct/independent
groups
ANOVA
Multi-Way ANOVA
ANCOVA
Blocking
Kruskall Wallace Test
Friedman Rank Test
Permutational ANOVA
Chi-Squared Test
Binomial ANOVA
Estimate the degree
of association
between 2 variables
Pearson’s
correlation
Spearman rank correlation
Kendall’s rank correlation
Logistic regression
Predict outcome
based on relationship
Linear regression
Multiple linear
regression
Non-linear regression Logistic regression
Odds Ratio
Statistics Toolbox
-
If you have a continuous response variable…
… and one predictor variable
Predictor is categorical Predictor is continuous
Two treatment levels
> Two treatment levels
T-Test Permutational T-Test
KlomogorovSmirniov (KS) Test
Wilcox Test
One-Way ANOVA
KruskallWallace Test
Freidman Rank Test
Pearson’s Correlation
Spearman’s Rank Correlation
Kendall’s Rank Correlation
Linear/Non-linear Regression
What you get
Non-ParametricParametric Regression
• P-value indicating if 2 groups are significantly different
• P-value indicating there is a significant effect of “treatment”.
• Need pairwise comparisons to find where the difference between groups occurs.
• Correlation coefficient indicating direction and magnitude of relationship.
• “Goodness of fit” indicting how well predictor is linked to response (R2 or AIC).
Binomial
-
… and two or more predictor variables
Predictor is categorical Predictor is continuous
Two or moretreatment levels for each predictor
Multi-Way ANOVA Permutational ANOVA
Multiple Regression
What you get
• P-value indicating if there is a significant effect of each treatment.
• Size of a significant effect (no interactions).
• Need to consider the possibility of interactions.
• Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.
• Also get the effect of the blocking term and/or undesired covariate.
• Do not need to consider the interaction between treatments and blocks and/or covariates.
• Fit of how well predictors are linked to response variable (Adjusted R2, AIC)
• P-values to indicate which predictors significantly affect the response variable.
Blocking
ANCOVA
Blocking
Non-ParametricParametric Regression Binomial
-
If you have a categorical response variable…
… and one predictor variable
Predictor is categorical Predictor is continuous
Two or moretreatment levels
Binomial ANOVA
Logistic Regression
What you get
• P-value indicating if there is a significant effect of each treatment.
• Size of a significant effect (no interactions).
• Need to consider the possibility of interactions.
• Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.
• P-value indicating there is a significant effect of “treatment”.
• Need pairwise comparisons to find where the difference between groups occurs.
… and two or more predictor variables
Predictor is continuous
Twotreatment levels
Z-Test for Proportions
Two or moretreatment levels
Chi-squared Test
• Fit of how well predictors are linked to response variable (Adjusted R2, AIC)
• P-values to indicate which predictors significantly affect the response variable.
• P-value indicating if 2 groups are significantly different
Non-ParametricParametric Regression Binomial
-
Example research questions:
Do yield of different lentil varieties differ at 2 farms?
Do the varieties differ among themselves?
Does the density of the plants impact their average height?
The Lentil datasets (You are now a farmer)
Farm 1
Farm 2
Plot1 Variety in each
Individual lentil plants
A
A
A
A
A
A
A
B
B
B
B
B
A
B
C
C
C
C
C
C
C
C
-
Datasets Available in R
• Over 100+ datasets available for you to use• We will use:
• iris: The famous (Fisher's or Anderson's) iris data set gives the sepal and
petal measurements for 50 flowers from each of Iris setosa, versicolor, and
virginica.
• USArrests: Data on US Arrests for violent crimes by US State.
-
Hypothesis Testing
“Statistics are not substitute for judgment.”Henry Clay (Former US Senator)
-
Formal hypotheses testing
A Bsample
population
A
BIs this a difference
due to random
chance?
Me
an h
eigh
t
Population sample
𝐻𝑜: ҧ𝑥𝐴 = ҧ𝑥𝐵𝐻1: ҧ𝑥𝐴 ≠ ҧ𝑥𝐵
If actual p < , reject null hypothesis (𝐻𝑜) and accept alternative hypothesis (𝐻1)
-
“Is this difference due to random chance?”
AB
Mea
n h
eig
ht
Population sample
P-value – the probability the observed value or larger is due to random chance
Theory: We can never really prove if the 2 samples are truly different or the same – only
ask if what we observe (or a greater difference) is due to random chance
How to interpret p-values:
P-value = 0.05 – “Yes, 1 out of 20 times.”
P-value = 0.01 – “Yes, 1 out of 100 times.”
The lower the probability a difference is due to random chance – the more likely is the result of an
effect (what we test for)
In other words: “Is random chance a plausible explanation?”
-
Type I Error – reject the null hypothesis (H0) when it is actually true
Type II Error – failing to reject the null hypothesis (H0) when it is not true
Remember rejection or acceptance of a p-value (and therefore the chance you will
make an error) depends on the arbitrary -level you choose
• -level will probability of making a Type I Error, but this
the probability of making a Type II Error
The -level you choose is completely up to you (typically it is set at 0.05), however, it
should be chosen with consideration of the consequences of making a Type I or a Type
II Error.
Based on your study, would you rather err on the side of false positives or false
negatives?
Null hypothesis is true Alternative hypothesis
is true
Fail to reject
the null
hypothesis☺
Correct
Decision
Incorrect
Decision
False Negative
Type II Error
Reject the
null
hypothesis
Incorrect
Decision
False Positive
Type I Error
☺ Correct Decision
-
Example: Will current forests adequately protect genetic resources under climate change?
Birch Mountain Wildlands
HO: Range of the current climate for the BMW protected area = Range of the BMW protected area under climate change
Ha: Range of the current climate for the BMW protected area ≠ Range of the BMW protected area under climate change
If we reject HO: Climates ranges are different, therefore genetic resources are not adequately protected and new
protected areas need to be created
Consequences if I make:
• Type I Error: Climates are actually the same and genetic resources are
indeed adequately protected in the BMW protected area – we created
new parks when we didn’t need to
• Type II Error: Climates are different and genetic resources are
vulnerable – we didn’t create new protected areas and we should have
From an ecological standpoint it is better to make a Type I Error, but from
an economic standpoint it is better to make a Type II Error
Which standpoint should I take?
-
Power is your ability to reject the null hypothesis when it is false (i.e.
your ability to detect an effect when there is one).
There are many ways to increase power:
1. Increase your sample size (sample more of the population)
2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I
Error!
3. Use a one-tailed test (you know the direction of the expected effect)
4. Use a paired test (control and treatment are same sample)
Given you are testing whether or not what you observed or greater
is due to random chance, more data gives you a better
understanding of what is truly happening within the population,
therefore sample size will the probability of making a
Type 2 Error
Statistical Power
-
BREAK9:45 – 10:00
Go grab a coffee. Next we will cover specific tools in your new tool box.
-
Statistics ToolboxParametric versus Non-Parametric Tests
“He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”
Andrew Lang (Scottish poet)
-
Univariate Test Options
Type Parametric Non-Parametric
Characteristics • Analysis to test group means• Based on raw data• More statistical power than non-
parametric tests
• Analysis to test group medians• Based on ranked data• Less statistical power
Assumptions • Independent samples• Normality (data OR errors)• Homogeneity of variances
• Independent samples
When to use? • Parametric assumptions are met• Non-Normal, BUT larger sample
size (CLT), however equal variances must be met
• Parametric assumptions are not met
• Medians better represent your data (skewed data distribution)
• Small sample-size• Ordinal data, ranked data, or
outliers that you can’t remove
Examples • T-test• ANOVA (One-way, Two-way,
Paired)
• Wilcox Rank Sum Test• Kruskal-Wallis Test• Permutational Tests (non-
traditional)
-
Assumption #1: Independence of samples
“Your samples have to come from a randomized or randomly sampled design.”
• Meaning rows in your data do NOT influence one another
• Address this with experimental design (3 main things to consider)
1. Avoid pseudoreplication and potential confounding factors by designing
your experiment in a randomized design
2. Avoid systematic arrangements which are distinct pattern in how
treatments are laid out. • If your treatments effect one another – the individual treatment effects
could be masked or overinflated
3. Maintain temporal independence
• If you need to take multiple samples from one individual over time record
and test your data considering the change in time (e.g. paired tests)
NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at
least 2 reps per treatment to execute and ANOVA
Rule of Thumb: You need more rows then columns
-
The Normal Distribution
𝑠2 =σ𝑖=1𝑛 𝑥𝑖 − ҧ𝑥
2
𝑛 − 1
SD = 𝑠2
Based on this curve:
• 68.27% of observations are within 1 stdev of ҧ𝑥• 95.45% of observations are within 2 stdev of ҧ𝑥• 99.73% of observations are within 3 stdev of ҧ𝑥
For confidence intervals:
• 95% of observations are within 1.96 stdev of ҧ𝑥
The base of parametric statistics
-
Assumption #2: Data/Experimental errors are normally distributed
B
ҧ𝑥𝐵𝐹𝑎𝑟𝑚1
C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1
ҧ𝑥𝑐𝐹𝑎𝑟𝑚1
FARM 1 C A
ҧ𝑥𝐴𝐹𝑎𝑟𝑚1
Residuals
“If I was to repeat my sample repeatedly and calculate the means, those
means would be normally distributed.”
Determine if the assumption is met by:
1. Looking at the residuals of your sample
2. Shaprio–wilks Test for Normality – if your data is mainly unique values
3. D'Agostino-Pearson normality test – if you have lots of repeated values
4. Lilliefors normality test – mean and variance are unknown
-
t-distribution (sampling distribution)
Normal distribution
Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central population value.”
Therefore:
• When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far
off the true population mean
-
Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?
Central Limit Theorem: “Sample means tend to cluster around the central population value.”
Therefore:
• When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far
off the true population mean
What does this mean?
• For large N, the assumption for Normality can be relaxed
• You may have decreased power to detect a difference among groups, BUT
your test is not really compromised if your residuals are not normal
• Assumption of Normality is important when:
1. Very small N
2. Data is highly non-normal
3. Significant outliers are present
4. Small effect size
-
Assumption #3: Equal variances between groups/treatments
0 4 8 12 16 20 24
ҧ𝑥𝐴 = 12𝑠𝐴 = 4
ҧ𝑥𝐵 = 12𝑠𝐵 = 6
Let’s say 5% of the A data fall above this threshold
But >5% of the B data fall above the same threshold
So with larger variances, you can expect a greater number of observations at the extremes of the
distributions
This can have real implications on inferences we make from comparisons between groups
-
B
ҧ𝑥𝐵𝐹𝑎𝑟𝑚1
C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1
ҧ𝑥𝑐𝐹𝑎𝑟𝑚1
FARM 1 C A
ҧ𝑥𝐴𝐹𝑎𝑟𝑚1
Residuals
“Does the know probability of observations between my two samples hold true?”
Determine if the assumption is met by:
1. Looking at the residuals of your sample
2. Bartlett Test
Assumption #3: Equal variances between treatments
-
Assumption #3: Equal variances between treatmentsTesting for Equal Variances – Residual Plots
Pre
dic
ted
val
ues
Observed (original units)
Pre
dic
ted
val
ues
Observed (original units)
Pre
dic
ted
val
ues
Observed (original units)
Pre
dic
ted
val
ues
Observed (original units)
• NORMAL distribution: equal number of points along observed
• EQUAL variances: equal spread on either side of the meanpredictedvalue=0
• Good to go!
0
0
0
0
• NON-NORMAL distribution: unequal number of points along observed
• EQUAL variances: equal spread on either side of the meanpredicted value=0
• Optional to fix
• NORMAL/NON NORMAL: look at histogram or test
• UNEQUAL variances: cone shape – away from or towards zero
• This needs to be fixed for ANOVA (transformations)
• OUTLIERS: points that deviate from the majority of data points
• This needs to be fixed for ANOVA (transformations or removal)
http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=SKch0bLHuslgeM&tbnid=2oYLwWRcd13DRM:&ved=0CAcQjRw&url=http://launchandrelease.com/practical-crowdfunding-lessons-funding-ratio/&ei=8R8zVKODH4j2igK-zIGgCQ&bvm=bv.76943099,d.cGU&psig=AFQjCNE0gUQPXHXLOyU3W2fv31Iw01F9Xw&ust=1412722941719175
-
• Treatment – predictor variable (e.g. variety, fertilization, irrigation, etc.)
• Treatment level – groups within treatments (e.g. A,B,C or Control, 1xN, 2xN)
• Covariate – undesired, uncontrolled predictor variable, confounding
• F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛
=𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡
𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟
• P-value – probability that the observed difference or larger in the treatment means is due to random chance
Analysis of Variance (ANOVA) – Vocabulary
-
B
ҧ𝑥𝐵𝐹𝑎𝑟𝑚1
C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1
ҧ𝑥𝑐𝐹𝑎𝑟𝑚1
FARM 1 C A
ҧ𝑥𝐴𝐹𝑎𝑟𝑚1
Residuals
Analysis of Variance (ANOVA)
F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛
=𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡
𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟= 𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒
SIGNAL
NOISE
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿
2
𝑛 − 1∗ 𝑟 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =
σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖
𝑛
-
Analysis of Variance (ANOVA)
Think of Pac-man!
• All of the dots on the board represent the Total
Variation in your study
• Every treatment you use in your analysis is a
different Pac-man player on the board
• The amount of dots each player eat represents
variation between (e.g. the amount of variation each treatment can explain)
• The amount of dots left on the board after all
players have died represented the variation within
• If players have a big effect they will eat more dots, reducing dots left on the board
(lowering variation within), increasing the F-value
• A large F-value indicates a significant difference
𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒𝐹 =
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛
http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjzktWxyY7VAhVL4IMKHczABdIQjRwIBw&url=http://www.majorgeeks.com/files/details/team_pacman.html&psig=AFQjCNGAJ9M5F3dBw8HfsXyfIkv1EQoJtg&ust=1500320964517303
-
F Distribution (family of distributions)
𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒𝐹 =
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿
2
𝑛 − 1∗ 𝑛𝑜𝑏𝑝𝑡
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖
𝑛
pro
bab
ility
𝑠𝑖𝑔𝑛𝑎𝑙 > 𝑛𝑜𝑖𝑠𝑒𝑠𝑖𝑔𝑛𝑎𝑙 < 𝑛𝑜𝑖𝑠𝑒
P-value(percentiles, probabilities)
pf(F, 𝑑𝑓1, 𝑑𝑓2)qf(p, 𝑑𝑓1, 𝑑𝑓2)
0.500 0.95
∞
∝= 0.05
-
How to report results from an ANOVA
Source of Variation df Sum of Squares Mean Squares F-value P-value
Variety (A) 2 20 10 0.0263 0.9741
Farm (B) 1 435243 43243 1125.7085
-
How to report results from an ANOVA
Source of Variation df Sum of Squares Mean Squares F-value P-value
Variety (A) 2 20 10 0.0263 0.9741
Farm (B) 1 435243 43243 1125.7085
-
How to report results from an ANOVA
Source of Variation df Sum of Squares Mean Squares F-value P-value
Variety (A) 2 20 10 0.0263 0.9741
Farm (B) 1 435243 43243 1125.7085
-
Interaction plots – Different story under different conditions
1.
2.
3.
Farm1 Farm2
A
B
• VARIETY is significant (*)
• FARM is significant (*)
• FARM2 has better yield than FARM1
• No Interaction
• VARIETY is not significant
• FARM is significant (*)
• VARIETY A is better on FARM2 and VARIETY B is better on
FARM1
• Significant Interaction
A
B
Farm1 Farm2
Farm1 Farm2
Farm1 Farm2
4.
• VARIETY is significant (*)
• FARM is significant (*) – small difference
• Main effects are significant, BUT hard to interpret with
overall means
• Significant Interaction
A
B
AvgFarm1
AvgFarm2
Yie
ldY
ield
Yie
ldY
ield
• VARIETY is not significant
• FARM is not significant
• Cannot distinguish a difference between VARIETY or FARM
• No Interaction
AB
-
Interaction plots – Different story under different conditions
• An interaction detects non-parallel lines
• Difficult to interpret interaction plots for more than a 2-
WAY ANOVA
• If the interaction effect is NOT significant then you can
just interpret the main effects
• BUT if you find a significant interaction you don’t want
to interpret main effects because the combination of
treatment levels results in different outcomes
-
Pairwise comparisons – What to do when you have an interactiona.k.a Pairwise t-tests
Number of comparisons:
𝐶 =𝑡 𝑡 − 1
2𝑡 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑙𝑒𝑣𝑒𝑙𝑠
Lentil Example: 3 VARITIES (A, B, and C)
A – BA – CB – C
𝐶 =𝑡(𝑡 − 1)
2=3(2)
2= 𝟑
Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I
Error at all
Experiment-wise Type I Error for = 0.05:𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.95𝐶
Lentil Example: 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.953
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.87𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 𝟎. 𝟏𝟑
Significantly increased probability of making an error!
Therefore pairwise comparisons leads to compromised experiment-wise -level
You can correct for multiple comparisons by calculating an adjusted p-value
(Bonferroni, Holms, etc.)
-
Pairwise comparisons – Tukey Honest Differences (Test)a.k.a Pairwise t-tests with adjusted p-values
If we have a significant
interaction effect – use these
values
If we have NO significant
interaction effect – we can just
look at the main effects
Only need to consider relevant pairwise comparisons – think about it logically
-
How to report a significant difference in a graph
W X Y Z
W - NS * NS
X - * NS
Y - NS
Z -A
A
B
A,B
W X Y Z
Same letter = non significant
Different letter = significant
Create a matrix of significance
and use it to code your graph
-
Permutational Non-parametric tests
• PNPT make NO Assumptions therefore any data can be used
• PNPT work with absolute differences a.k.a distances
• Smaller values indicate similarity
• Makes the calculations equivalent to sum-of-squares
𝐷 =𝑠𝑖𝑔𝑛𝑎𝑙
𝑛𝑜𝑖𝑠𝑒𝐷 =
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
Calculating D (delta) & its distribution
• For our test we can compare D to an expected distribution of D the
same way we do when we calculate an F-value
• Use permutations (iterations) to generate the distribution of D from
our raw data
• Therefore shape of D distribution is dependent on your data
-
Permutational Non-parametric testsDetermining the distribution of D
• After you permute this process 5000 times (your choice) a distribution of D will emerge
• Shape depends on your data – may be normal or not (doesn’t matter)
102 3
-
Permutational Non-parametric testsDetermining the distribution of D
• After you permute this process 5000 times (your choice) a distribution of D will emerge
• Shape depends on your data – may be normal or not (doesn’t matter)
102 3
4921 D calculations < 10
from permutations
79 D calculations ≥ 10
from permutations
P-value:
79/5000 = 0.0158
-
Permutational ANOVA
Permutational ANOVA in R:library(lmPerm)
summary(aovp(YIELD~FARM*VARIETY, seqs=T))
Pairwise Permutational ANOVA in R:out1=aovp(YIELD~FARM*VARIETY, seqs=T)
TukeyHSD(out1)
Permutational Non-parametric tests in R
The option seqs=T calculates
sequential sum of squares (similar to regular ANOVA)
Good choice for balanced
designs
You can change the
maximum number of
iterations with the maxIter=
option
-
Permutational Non-parametric tests
• For parametric tests we know Normal, T-distribution, F-distribution look like
• Therefore we can use the standard calculations (t-value) to calculate statistics
• When we violate the known distribution we need some other curve to work with
• Hard to estimate a theoretical distribution that fits your data
• Best solution is to permute your data to generate a distribution
• Permutational non-parametric statistics are just as powerful as parametric tests
• This technique is similar to bootstrapping• But bootstrap samples data rather than changing all observation classes
-
If permutational techniques are so good
why not always use them?
• You say permutational non-parametric tests are as
powerful as parametric statistics – YES They Are!
• But they are still fairly new to statistical practices
• Still unknown or not understood among many uses
• Best practice is to stick with parametric statistics when
you can, but when you can’t permutational tests are
great options!
-
Extension of the Statistics ToolboxMultivariate Tests (Rotation-Based)
“Definition of Statistics: The science of producing unreliable facts from reliable figures.”
Evan Esar (Humorist & Writer)
-
• Population – Class of things (What you want to learn about)
• Sample – group representing a class (What you actually study)
• Experimental unit – individual research subject (e.g. location, entity, etc.)
• Response variable – property of thing that you believe is the result of predictors (What you actually measure – e.g. lentil height, patient response)
a.k.a dependent variable
• Predictor variable(s) – environment of things which you believe is influencing a response variable (e.g. climate, topography, drug combination, etc.)
a.k.a independent variable
• Error - difference between an observed value (or calculated) value and its true (or expected) value
A Reminder from Univariate Statisics….
-
Experimental Unit
(row)
In Multivariate statistics:
• Variables can be either numeric or categorical (depends on the
technique)
• Focus is often placed on graphical representation of results
Frequency of species,
Climate variables,
Soil characteristics,
Nutrient concentrations
Drug levels, etc.
Regions,
Ecosystems,
Forest types,
Treatments, etc.
Rotation-based Methods Data.I
DType
Varable1
Variable2
Variable3
Variable4
…
1
2
3
4
5
6
…
-
Final results based on multiple
variables give different inferences
than 2 variables
Data.ID
TypeVarable
1Variable
2Variable
3Variable
4…
1
2
3
4
5
6
…Variable 1
Var
iab
le 2
Find an equation to rotate data
to so that axis explains multiple
variables
Variable 1,2
Var
iab
le 3
Variable 1, 2, 4, 9, 10V
aria
ble
3, 6
, 8Repeat rotation process to
achieve analysis objective
Rotation-based Methods
-
1. Rotate so that new axis explains the greatest amount of variation within the data
Principal Component Analysis (PCA) Factor Analysis
2. Rotate so that the variation between groups is maximized
Discriminatn Analysis (DISCRIM)Multivariate Analysis of Variance (MANOVA)
3. Rotate so that one dataset explains the most variation in another dataset
Canonical Correspondence Analysis (CCA)
Objective of Rotation-based Methods
-
Z1 = a11X1 + a12X2 + … + a1nXn
First principal component
(column vector)
Column vectors of original
variables
• PCA Objective: Find linear combinations of the original variables X1, X2, …, Xn to
produce components Z1, Z2, …, Zn that are uncorrelated in order of their
importance, and that describe the variation in the original data.
• Principle components are the linear combinations of the original variables
• Principle component 1 is NOT a replacement for variable 1 – All variables are
used to calculate each principal component
For each component:
The constraint that a112 + a12
2 + … + a1n2 = 1 ensures Var(Z1) is as large as possible
Coefficients for linear model
The Math Behind PCA
-
• Z2 is calculated using the same formula and constraint on a2n values
However, there is an addition condition that Z1 and Z2 have zero
correlation for the data
• The correlation condition continues for all successional principle
components i.e. Z3 is uncorrelated with both Z1 and Z2
• The number of principal components calculated will match the number
of predictor variable included in the analysis
• The amount of variation explained decreases with each successional
principal component
• Generally you base your inferences on the first two or three
components because they explain the most variation in your data
• Typically when you include a lot of predictor variables the last
couple of principal components explain very little (< 1%) of the
variation in your data – not useful variables
The Math Behind PCA
-
PCA in R:princomp(dataMatrix,cor=T/F) (stats package)
Define whether the PCs should be calculated using the correlation or covariance
matrix (derived within the function from the data)
You tend to use the covariance matrix when the variable scales are similar
and the correlation matrix when variables are on different scales
Default is to use the correlation matrix because it standardizes to data before it
calculated the PCs, removing the effect of the different units
Data matrix of predictor variables
You will assign the results back to a
class once the PCs have been
calculated
PCA in R
-
PCA in R
-
Loadings – these are the correlations between the original predictor variables
and the principal components
Identifies which of the original variables are driving the principal component
Example:
Comp.1 – is negatively related to Murder,
Assault, and Rape
Comp.2 – is negatively related to UrbanPop
Eigenvectors
PCA in R
-
Scores – these are the calculated principal components Z1, Z2, …, Zn
These are the values we plot to make inferences
PCA in R
-
Variance – summary of the output displays the variance explained by each
principal component
Identifies how much weight you should put in your principal components
Example:
Comp.1 – 62 %
Comp.2 – 25%
Comp.3 – 9%
Comp.4 – 4 %
Eignenvalues
divided by the
number of PCs
PCA in R
-
Data points considering
Comp.1 and Comp.2 scores
(displays row names)
Direction of the arrows +/-
indicate the trend of points
(towards the arrow indicates
more of the variable)
If vector arrows are
perpendicular then the
variables are not correlated
If you original variables do not have some level of correlation then PCA will NOT work
for your analysis – i.e. You wont learn anything!
PCA in R - Biplot
-
WORK PERIOD11:30 – 1:00
Follow the Workbook Examples for the Analyses You are Interested In.Any questions?
-
Statistics ToolboxWhat is the goal of
my analysis?
What kind of data do I have to answer my
research question?
How many variables do I
want to include in my analysis?
Does my data meet the analysis assumptions?
… to characterize my data.
… to find if there is a significant
difference between my groups
… to see what predictor
conditions are associated with
my groups.
… normally distributed?
… equal variances?
… multiple response variables?
… continuous or discrete?
… binary data?
… single treatment or multiple
treatment?
-
Thank You for Attending the Stats Workshop
I you have any further questions please feel free to contact me
Flow Cytometry Core Facility
LKG ConsultingEmail: [email protected]
Website: www.consultinglkg.com