statistics toolbox in - laura k gray, ph.d, p.stat., p.ag. · 2018. 10. 7. · parametric versus...

68
Statistics Toolbox in Professional Development Opportunity for the Flow Cytometry Core Facility October 12, 2018 LKG Consulting Email: [email protected] Website: www.consultinglkg.com A Review of Analysis Techniques for Scientific Research

Upload: others

Post on 02-Feb-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • Statistics Toolbox in

    Professional Development Opportunity for theFlow Cytometry Core Facility

    October 12, 2018

    LKG ConsultingEmail: [email protected]

    Website: www.consultinglkg.com

    A Review of Analysis Techniques

    for Scientific Research

  • The goal of this workshop is to give you the knowledge & tools to be confident in your ability to

    collect & analyze you’re data as well as correctly interpret your results…

    …Think of me as your new resource!

    https://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjxg6rJjoDVAhWn6oMKHTrVCzwQjRwIBw&url=https://www.hopkinton.k12.ma.us/Page/3475&psig=AFQjCNFseWpJ2jdd6KDgrFSTPAyt-9tStw&ust=1499824107310577

  • Laura Gray-Steinhauer

    www.ualberta.ca/~lkgray

    BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005)

    MSc in Forest Biology and Management (UofA, 2008)

    PhD in Forest Biology and Management (UofA, 2011)

    Designated Professional Statistician with The Statistical Society of Canada (2014)

    Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…

    A little about me…

    http://www.ualberta.ca/~lkgray

  • Workshop Schedule

    8:15 – 8:30 Arrive to the Lab & Start up the computers

    8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals)

    8:45 – 9:15Statistics ToolboxRefresh useful vocabulary, introduce a decision tree to plan your analysis path

    9:15 – 9:45Hypothesis TestingRefresher on p-values, Type 1 and Type 2 error, and statistical power

    9:45 – 10:00 Break

    10:00 – 11:00Parametric versus Non-Parametric TestsTesting for parametric assumptions, ANOVA, Permutational ANOVA

    11:00 – 11:30Multivariate StatisticsIntroduction to principle component analysis (PCA)

    11:30 – 1:00 Work period (questions are welcome)

    After 1:00 Enjoy your weekend!

    This may be A LOT of information to absorb OR we may not cover the specific topic you came to learn in class today.

    Feel free to reach out to me via email with more questions: [email protected].

  • Workbook

    • Yours to keep!

    • R code is identified by Century Gothic font (everything else is Arial)

    • Arbitrary object names are bold to indicate these could change depending on what you name your variables.

    • Referenced data is provided at www.ualberta.ca/~lkgray

    • Please contact me to obtain permission to redistribute content outside of the workshop attendees.

    Topics Included:

    • Descriptive statistics• Confidence intervals• Data distributions• Parametric assumptions• T-tests• ANOVA• ANCOVA• Non-parametric tests

    Topics Included:

    • Permutational ANOVA & T-tests• Z-test for Proportions• Chi-squared test• Outlier tests and treatments• Correlation• Linear regression• Multiple linear regression• Akaike Information Criterion

    Topics Included:

    • Non-linear regression• Logistic regression• Binomial ANOVA• Principle component analysis

    (PCA)• Discriminant analysis• Multivariate analysis of variance

    (MANOVA)

    http://www.ualberta.ca/~lkgray

  • R Project Website

    https://cran.r-project.org/index.html

    https://cran.r-project.org/index.html

  • https://www.rstudio.com/

    RStudio (IDE: Integrated Development Environment)Preferred among programmers, we will use it in this workshop

    https://www.rstudio.com/

  • Statistics Toolbox

    “Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”

    Aaron Levenstein (Author)

  • Statistical Vocabulary

    Statistical Term Real World Research World

    Population Class of thingsE.g. Cancer patients

    What you want to learn aboutE.g. Cancer patients in Alberta

    Sample Group representing a classE.g. 1000 cancer patients in

    Alberta

    What you actually studyE.g. 1000 cancer patients from 10

    treatment centres in Alberta

    Experimental Unit Individual thingE.g. each of the 1000 cancer

    patients

    Individual research subjectE.g. Cancer patients n=1000

    Hospital populations n=10

    (depends on research question)

    Dependent

    Variable

    Property of thingsE.g. white blood cell count

    What you measure about

    subjectsE.g. white blood cell count

    Independent

    Variable

    Environment of thingsE.g. Treatment options, climate,

    etc.

    What you think might influence

    dependent variableE.g. Amount of treatment, combination of

    treatments, etc.

    Data Values of variables What you record/information you

    collect

  • • Experiment – any controlled process of study which results in data collection, and which the outcome is unknown

    • Descriptive statistics – numerical/graphical summary of data

    • Inferential statistics – predict or control the values of variables (make conclusions with)

    • Statistical inference – to makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken

    • Parameter – an unknown value (needs to be estimated) used to represent a population characteristic (e.g. population mean)

    • Statistic – estimation of parameter (e.g. mean of a sample)

    • Sampling distribution (aka. Probability distribution or Probability density function) –probability associated with each possible value of a variable

    • Error - difference between an observed value (or calculated) value and its true (or expected) value

    Other important statistical terms Also see Appendix 1 in your workbook

  • Statistics ToolboxWhat is the goal of

    my analysis?

    What kind of data do I have to answer my

    research question?

    How many variables do I

    want to include in my analysis?

    Does my data meet the analysis assumptions?

  • Analysis Goal Parametric

    Assumptions Met

    Non-Parametric

    Alternative if fail assumptions

    Binomial

    Binary data/Event likelihood

    Describe data

    characteristics

    Mean

    Standard deviation

    Standard error

    Etc.

    Median

    Quartiles

    Percentiles

    Proportions

    Probability distributions are always appropriate to describe data.

    Graphics are always appropriate to describe data.

    Compare 2

    distinct/independent

    groups

    T-test

    Paired t-test

    Wilcox Rank-Sum Test

    Klomogorov-Smirnov Test

    Permutational T-test

    Z-Test for proportions

    Compare > 2

    distinct/independent

    groups

    ANOVA

    Multi-Way ANOVA

    ANCOVA

    Blocking

    Kruskall Wallace Test

    Friedman Rank Test

    Permutational ANOVA

    Chi-Squared Test

    Binomial ANOVA

    Estimate the degree

    of association

    between 2 variables

    Pearson’s

    correlation

    Spearman rank correlation

    Kendall’s rank correlation

    Logistic regression

    Predict outcome

    based on relationship

    Linear regression

    Multiple linear

    regression

    Non-linear regression Logistic regression

    Odds Ratio

    Statistics Toolbox

  • If you have a continuous response variable…

    … and one predictor variable

    Predictor is categorical Predictor is continuous

    Two treatment levels

    > Two treatment levels

    T-Test Permutational T-Test

    KlomogorovSmirniov (KS) Test

    Wilcox Test

    One-Way ANOVA

    KruskallWallace Test

    Freidman Rank Test

    Pearson’s Correlation

    Spearman’s Rank Correlation

    Kendall’s Rank Correlation

    Linear/Non-linear Regression

    What you get

    Non-ParametricParametric Regression

    • P-value indicating if 2 groups are significantly different

    • P-value indicating there is a significant effect of “treatment”.

    • Need pairwise comparisons to find where the difference between groups occurs.

    • Correlation coefficient indicating direction and magnitude of relationship.

    • “Goodness of fit” indicting how well predictor is linked to response (R2 or AIC).

    Binomial

  • … and two or more predictor variables

    Predictor is categorical Predictor is continuous

    Two or moretreatment levels for each predictor

    Multi-Way ANOVA Permutational ANOVA

    Multiple Regression

    What you get

    • P-value indicating if there is a significant effect of each treatment.

    • Size of a significant effect (no interactions).

    • Need to consider the possibility of interactions.

    • Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.

    • Also get the effect of the blocking term and/or undesired covariate.

    • Do not need to consider the interaction between treatments and blocks and/or covariates.

    • Fit of how well predictors are linked to response variable (Adjusted R2, AIC)

    • P-values to indicate which predictors significantly affect the response variable.

    Blocking

    ANCOVA

    Blocking

    Non-ParametricParametric Regression Binomial

  • If you have a categorical response variable…

    … and one predictor variable

    Predictor is categorical Predictor is continuous

    Two or moretreatment levels

    Binomial ANOVA

    Logistic Regression

    What you get

    • P-value indicating if there is a significant effect of each treatment.

    • Size of a significant effect (no interactions).

    • Need to consider the possibility of interactions.

    • Need pairwise comparisons with adjusted p-values to determine the difference among treatments with interactions.

    • P-value indicating there is a significant effect of “treatment”.

    • Need pairwise comparisons to find where the difference between groups occurs.

    … and two or more predictor variables

    Predictor is continuous

    Twotreatment levels

    Z-Test for Proportions

    Two or moretreatment levels

    Chi-squared Test

    • Fit of how well predictors are linked to response variable (Adjusted R2, AIC)

    • P-values to indicate which predictors significantly affect the response variable.

    • P-value indicating if 2 groups are significantly different

    Non-ParametricParametric Regression Binomial

  • Example research questions:

    Do yield of different lentil varieties differ at 2 farms?

    Do the varieties differ among themselves?

    Does the density of the plants impact their average height?

    The Lentil datasets (You are now a farmer)

    Farm 1

    Farm 2

    Plot1 Variety in each

    Individual lentil plants

    A

    A

    A

    A

    A

    A

    A

    B

    B

    B

    B

    B

    A

    B

    C

    C

    C

    C

    C

    C

    C

    C

  • Datasets Available in R

    • Over 100+ datasets available for you to use• We will use:

    • iris: The famous (Fisher's or Anderson's) iris data set gives the sepal and

    petal measurements for 50 flowers from each of Iris setosa, versicolor, and

    virginica.

    • USArrests: Data on US Arrests for violent crimes by US State.

  • Hypothesis Testing

    “Statistics are not substitute for judgment.”Henry Clay (Former US Senator)

  • Formal hypotheses testing

    A Bsample

    population

    A

    BIs this a difference

    due to random

    chance?

    Me

    an h

    eigh

    t

    Population sample

    𝐻𝑜: ҧ𝑥𝐴 = ҧ𝑥𝐵𝐻1: ҧ𝑥𝐴 ≠ ҧ𝑥𝐵

    If actual p < , reject null hypothesis (𝐻𝑜) and accept alternative hypothesis (𝐻1)

  • “Is this difference due to random chance?”

    AB

    Mea

    n h

    eig

    ht

    Population sample

    P-value – the probability the observed value or larger is due to random chance

    Theory: We can never really prove if the 2 samples are truly different or the same – only

    ask if what we observe (or a greater difference) is due to random chance

    How to interpret p-values:

    P-value = 0.05 – “Yes, 1 out of 20 times.”

    P-value = 0.01 – “Yes, 1 out of 100 times.”

    The lower the probability a difference is due to random chance – the more likely is the result of an

    effect (what we test for)

    In other words: “Is random chance a plausible explanation?”

  • Type I Error – reject the null hypothesis (H0) when it is actually true

    Type II Error – failing to reject the null hypothesis (H0) when it is not true

    Remember rejection or acceptance of a p-value (and therefore the chance you will

    make an error) depends on the arbitrary -level you choose

    • -level will probability of making a Type I Error, but this

    the probability of making a Type II Error

    The -level you choose is completely up to you (typically it is set at 0.05), however, it

    should be chosen with consideration of the consequences of making a Type I or a Type

    II Error.

    Based on your study, would you rather err on the side of false positives or false

    negatives?

    Null hypothesis is true Alternative hypothesis

    is true

    Fail to reject

    the null

    hypothesis☺

    Correct

    Decision

    Incorrect

    Decision

    False Negative

    Type II Error

    Reject the

    null

    hypothesis

    Incorrect

    Decision

    False Positive

    Type I Error

    ☺ Correct Decision

  • Example: Will current forests adequately protect genetic resources under climate change?

    Birch Mountain Wildlands

    HO: Range of the current climate for the BMW protected area = Range of the BMW protected area under climate change

    Ha: Range of the current climate for the BMW protected area ≠ Range of the BMW protected area under climate change

    If we reject HO: Climates ranges are different, therefore genetic resources are not adequately protected and new

    protected areas need to be created

    Consequences if I make:

    • Type I Error: Climates are actually the same and genetic resources are

    indeed adequately protected in the BMW protected area – we created

    new parks when we didn’t need to

    • Type II Error: Climates are different and genetic resources are

    vulnerable – we didn’t create new protected areas and we should have

    From an ecological standpoint it is better to make a Type I Error, but from

    an economic standpoint it is better to make a Type II Error

    Which standpoint should I take?

  • Power is your ability to reject the null hypothesis when it is false (i.e.

    your ability to detect an effect when there is one).

    There are many ways to increase power:

    1. Increase your sample size (sample more of the population)

    2. Increase your alpha value (e.g. from 0.01 to 0.05) – watch for Type I

    Error!

    3. Use a one-tailed test (you know the direction of the expected effect)

    4. Use a paired test (control and treatment are same sample)

    Given you are testing whether or not what you observed or greater

    is due to random chance, more data gives you a better

    understanding of what is truly happening within the population,

    therefore sample size will the probability of making a

    Type 2 Error

    Statistical Power

  • BREAK9:45 – 10:00

    Go grab a coffee. Next we will cover specific tools in your new tool box.

  • Statistics ToolboxParametric versus Non-Parametric Tests

    “He uses statistics as a drunken man uses lamp posts, for support rather than illumination.”

    Andrew Lang (Scottish poet)

  • Univariate Test Options

    Type Parametric Non-Parametric

    Characteristics • Analysis to test group means• Based on raw data• More statistical power than non-

    parametric tests

    • Analysis to test group medians• Based on ranked data• Less statistical power

    Assumptions • Independent samples• Normality (data OR errors)• Homogeneity of variances

    • Independent samples

    When to use? • Parametric assumptions are met• Non-Normal, BUT larger sample

    size (CLT), however equal variances must be met

    • Parametric assumptions are not met

    • Medians better represent your data (skewed data distribution)

    • Small sample-size• Ordinal data, ranked data, or

    outliers that you can’t remove

    Examples • T-test• ANOVA (One-way, Two-way,

    Paired)

    • Wilcox Rank Sum Test• Kruskal-Wallis Test• Permutational Tests (non-

    traditional)

  • Assumption #1: Independence of samples

    “Your samples have to come from a randomized or randomly sampled design.”

    • Meaning rows in your data do NOT influence one another

    • Address this with experimental design (3 main things to consider)

    1. Avoid pseudoreplication and potential confounding factors by designing

    your experiment in a randomized design

    2. Avoid systematic arrangements which are distinct pattern in how

    treatments are laid out. • If your treatments effect one another – the individual treatment effects

    could be masked or overinflated

    3. Maintain temporal independence

    • If you need to take multiple samples from one individual over time record

    and test your data considering the change in time (e.g. paired tests)

    NOTE: ANOVA needs to have at least 1 degree of freedom – this means you need at

    least 2 reps per treatment to execute and ANOVA

    Rule of Thumb: You need more rows then columns

  • The Normal Distribution

    𝑠2 =σ𝑖=1𝑛 𝑥𝑖 − ҧ𝑥

    2

    𝑛 − 1

    SD = 𝑠2

    Based on this curve:

    • 68.27% of observations are within 1 stdev of ҧ𝑥• 95.45% of observations are within 2 stdev of ҧ𝑥• 99.73% of observations are within 3 stdev of ҧ𝑥

    For confidence intervals:

    • 95% of observations are within 1.96 stdev of ҧ𝑥

    The base of parametric statistics

  • Assumption #2: Data/Experimental errors are normally distributed

    B

    ҧ𝑥𝐵𝐹𝑎𝑟𝑚1

    C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1

    ҧ𝑥𝑐𝐹𝑎𝑟𝑚1

    FARM 1 C A

    ҧ𝑥𝐴𝐹𝑎𝑟𝑚1

    Residuals

    “If I was to repeat my sample repeatedly and calculate the means, those

    means would be normally distributed.”

    Determine if the assumption is met by:

    1. Looking at the residuals of your sample

    2. Shaprio–wilks Test for Normality – if your data is mainly unique values

    3. D'Agostino-Pearson normality test – if you have lots of repeated values

    4. Lilliefors normality test – mean and variance are unknown

  • t-distribution (sampling distribution)

    Normal distribution

    Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?

    Central Limit Theorem: “Sample means tend to cluster around the central population value.”

    Therefore:

    • When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far

    off the true population mean

  • Assumption #2: Data/Experimental errors are normally distributedYou may not need to worry about Normality?

    Central Limit Theorem: “Sample means tend to cluster around the central population value.”

    Therefore:

    • When sample size is large, you can assume that ҧ𝑥 is close to the value of 𝜇• With a small sample size you have a better chance to get a mean that is far

    off the true population mean

    What does this mean?

    • For large N, the assumption for Normality can be relaxed

    • You may have decreased power to detect a difference among groups, BUT

    your test is not really compromised if your residuals are not normal

    • Assumption of Normality is important when:

    1. Very small N

    2. Data is highly non-normal

    3. Significant outliers are present

    4. Small effect size

  • Assumption #3: Equal variances between groups/treatments

    0 4 8 12 16 20 24

    ҧ𝑥𝐴 = 12𝑠𝐴 = 4

    ҧ𝑥𝐵 = 12𝑠𝐵 = 6

    Let’s say 5% of the A data fall above this threshold

    But >5% of the B data fall above the same threshold

    So with larger variances, you can expect a greater number of observations at the extremes of the

    distributions

    This can have real implications on inferences we make from comparisons between groups

  • B

    ҧ𝑥𝐵𝐹𝑎𝑟𝑚1

    C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1

    ҧ𝑥𝑐𝐹𝑎𝑟𝑚1

    FARM 1 C A

    ҧ𝑥𝐴𝐹𝑎𝑟𝑚1

    Residuals

    “Does the know probability of observations between my two samples hold true?”

    Determine if the assumption is met by:

    1. Looking at the residuals of your sample

    2. Bartlett Test

    Assumption #3: Equal variances between treatments

  • Assumption #3: Equal variances between treatmentsTesting for Equal Variances – Residual Plots

    Pre

    dic

    ted

    val

    ues

    Observed (original units)

    Pre

    dic

    ted

    val

    ues

    Observed (original units)

    Pre

    dic

    ted

    val

    ues

    Observed (original units)

    Pre

    dic

    ted

    val

    ues

    Observed (original units)

    • NORMAL distribution: equal number of points along observed

    • EQUAL variances: equal spread on either side of the meanpredictedvalue=0

    • Good to go!

    0

    0

    0

    0

    • NON-NORMAL distribution: unequal number of points along observed

    • EQUAL variances: equal spread on either side of the meanpredicted value=0

    • Optional to fix

    • NORMAL/NON NORMAL: look at histogram or test

    • UNEQUAL variances: cone shape – away from or towards zero

    • This needs to be fixed for ANOVA (transformations)

    • OUTLIERS: points that deviate from the majority of data points

    • This needs to be fixed for ANOVA (transformations or removal)

    http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=SKch0bLHuslgeM&tbnid=2oYLwWRcd13DRM:&ved=0CAcQjRw&url=http://launchandrelease.com/practical-crowdfunding-lessons-funding-ratio/&ei=8R8zVKODH4j2igK-zIGgCQ&bvm=bv.76943099,d.cGU&psig=AFQjCNE0gUQPXHXLOyU3W2fv31Iw01F9Xw&ust=1412722941719175

  • • Treatment – predictor variable (e.g. variety, fertilization, irrigation, etc.)

    • Treatment level – groups within treatments (e.g. A,B,C or Control, 1xN, 2xN)

    • Covariate – undesired, uncontrolled predictor variable, confounding

    • F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛

    =𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡

    𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟

    • P-value – probability that the observed difference or larger in the treatment means is due to random chance

    Analysis of Variance (ANOVA) – Vocabulary

  • B

    ҧ𝑥𝐵𝐹𝑎𝑟𝑚1

    C Aҧ𝑥𝐴𝐵𝐶𝐹𝑎𝑟𝑚1

    ҧ𝑥𝑐𝐹𝑎𝑟𝑚1

    FARM 1 C A

    ҧ𝑥𝐴𝐹𝑎𝑟𝑚1

    Residuals

    Analysis of Variance (ANOVA)

    F-value – 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ𝑖𝑛

    =𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡

    𝑚𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟= 𝑠𝑖𝑔𝑛𝑎𝑙

    𝑛𝑜𝑖𝑠𝑒

    SIGNAL

    NOISE

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿

    2

    𝑛 − 1∗ 𝑟 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =

    σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖

    𝑛

  • Analysis of Variance (ANOVA)

    Think of Pac-man!

    • All of the dots on the board represent the Total

    Variation in your study

    • Every treatment you use in your analysis is a

    different Pac-man player on the board

    • The amount of dots each player eat represents

    variation between (e.g. the amount of variation each treatment can explain)

    • The amount of dots left on the board after all

    players have died represented the variation within

    • If players have a big effect they will eat more dots, reducing dots left on the board

    (lowering variation within), increasing the F-value

    • A large F-value indicates a significant difference

    𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙

    𝑛𝑜𝑖𝑠𝑒𝐹 =

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛

    http://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjzktWxyY7VAhVL4IMKHczABdIQjRwIBw&url=http://www.majorgeeks.com/files/details/team_pacman.html&psig=AFQjCNGAJ9M5F3dBw8HfsXyfIkv1EQoJtg&ust=1500320964517303

  • F Distribution (family of distributions)

    𝐹 =𝑠𝑖𝑔𝑛𝑎𝑙

    𝑛𝑜𝑖𝑠𝑒𝐹 =

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 =σ𝑖𝑛 ҧ𝑥𝑖 − ҧ𝑥𝐴𝐿𝐿

    2

    𝑛 − 1∗ 𝑛𝑜𝑏𝑝𝑡

    𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 =σ𝑖𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑖

    𝑛

    pro

    bab

    ility

    𝑠𝑖𝑔𝑛𝑎𝑙 > 𝑛𝑜𝑖𝑠𝑒𝑠𝑖𝑔𝑛𝑎𝑙 < 𝑛𝑜𝑖𝑠𝑒

    P-value(percentiles, probabilities)

    pf(F, 𝑑𝑓1, 𝑑𝑓2)qf(p, 𝑑𝑓1, 𝑑𝑓2)

    0.500 0.95

    ∝= 0.05

  • How to report results from an ANOVA

    Source of Variation df Sum of Squares Mean Squares F-value P-value

    Variety (A) 2 20 10 0.0263 0.9741

    Farm (B) 1 435243 43243 1125.7085

  • How to report results from an ANOVA

    Source of Variation df Sum of Squares Mean Squares F-value P-value

    Variety (A) 2 20 10 0.0263 0.9741

    Farm (B) 1 435243 43243 1125.7085

  • How to report results from an ANOVA

    Source of Variation df Sum of Squares Mean Squares F-value P-value

    Variety (A) 2 20 10 0.0263 0.9741

    Farm (B) 1 435243 43243 1125.7085

  • Interaction plots – Different story under different conditions

    1.

    2.

    3.

    Farm1 Farm2

    A

    B

    • VARIETY is significant (*)

    • FARM is significant (*)

    • FARM2 has better yield than FARM1

    • No Interaction

    • VARIETY is not significant

    • FARM is significant (*)

    • VARIETY A is better on FARM2 and VARIETY B is better on

    FARM1

    • Significant Interaction

    A

    B

    Farm1 Farm2

    Farm1 Farm2

    Farm1 Farm2

    4.

    • VARIETY is significant (*)

    • FARM is significant (*) – small difference

    • Main effects are significant, BUT hard to interpret with

    overall means

    • Significant Interaction

    A

    B

    AvgFarm1

    AvgFarm2

    Yie

    ldY

    ield

    Yie

    ldY

    ield

    • VARIETY is not significant

    • FARM is not significant

    • Cannot distinguish a difference between VARIETY or FARM

    • No Interaction

    AB

  • Interaction plots – Different story under different conditions

    • An interaction detects non-parallel lines

    • Difficult to interpret interaction plots for more than a 2-

    WAY ANOVA

    • If the interaction effect is NOT significant then you can

    just interpret the main effects

    • BUT if you find a significant interaction you don’t want

    to interpret main effects because the combination of

    treatment levels results in different outcomes

  • Pairwise comparisons – What to do when you have an interactiona.k.a Pairwise t-tests

    Number of comparisons:

    𝐶 =𝑡 𝑡 − 1

    2𝑡 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑙𝑒𝑣𝑒𝑙𝑠

    Lentil Example: 3 VARITIES (A, B, and C)

    A – BA – CB – C

    𝐶 =𝑡(𝑡 − 1)

    2=3(2)

    2= 𝟑

    Probability of making a Type I Error in at least one comparison = 1 – probability of making no Type I

    Error at all

    Experiment-wise Type I Error for = 0.05:𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.95𝐶

    Lentil Example: 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.953

    𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 1 − 0.87𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 = 𝟎. 𝟏𝟑

    Significantly increased probability of making an error!

    Therefore pairwise comparisons leads to compromised experiment-wise -level

    You can correct for multiple comparisons by calculating an adjusted p-value

    (Bonferroni, Holms, etc.)

  • Pairwise comparisons – Tukey Honest Differences (Test)a.k.a Pairwise t-tests with adjusted p-values

    If we have a significant

    interaction effect – use these

    values

    If we have NO significant

    interaction effect – we can just

    look at the main effects

    Only need to consider relevant pairwise comparisons – think about it logically

  • How to report a significant difference in a graph

    W X Y Z

    W - NS * NS

    X - * NS

    Y - NS

    Z -A

    A

    B

    A,B

    W X Y Z

    Same letter = non significant

    Different letter = significant

    Create a matrix of significance

    and use it to code your graph

  • Permutational Non-parametric tests

    • PNPT make NO Assumptions therefore any data can be used

    • PNPT work with absolute differences a.k.a distances

    • Smaller values indicate similarity

    • Makes the calculations equivalent to sum-of-squares

    𝐷 =𝑠𝑖𝑔𝑛𝑎𝑙

    𝑛𝑜𝑖𝑠𝑒𝐷 =

    𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠

    𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠

    Calculating D (delta) & its distribution

    • For our test we can compare D to an expected distribution of D the

    same way we do when we calculate an F-value

    • Use permutations (iterations) to generate the distribution of D from

    our raw data

    • Therefore shape of D distribution is dependent on your data

  • Permutational Non-parametric testsDetermining the distribution of D

    • After you permute this process 5000 times (your choice) a distribution of D will emerge

    • Shape depends on your data – may be normal or not (doesn’t matter)

    102 3

  • Permutational Non-parametric testsDetermining the distribution of D

    • After you permute this process 5000 times (your choice) a distribution of D will emerge

    • Shape depends on your data – may be normal or not (doesn’t matter)

    102 3

    4921 D calculations < 10

    from permutations

    79 D calculations ≥ 10

    from permutations

    P-value:

    79/5000 = 0.0158

  • Permutational ANOVA

    Permutational ANOVA in R:library(lmPerm)

    summary(aovp(YIELD~FARM*VARIETY, seqs=T))

    Pairwise Permutational ANOVA in R:out1=aovp(YIELD~FARM*VARIETY, seqs=T)

    TukeyHSD(out1)

    Permutational Non-parametric tests in R

    The option seqs=T calculates

    sequential sum of squares (similar to regular ANOVA)

    Good choice for balanced

    designs

    You can change the

    maximum number of

    iterations with the maxIter=

    option

  • Permutational Non-parametric tests

    • For parametric tests we know Normal, T-distribution, F-distribution look like

    • Therefore we can use the standard calculations (t-value) to calculate statistics

    • When we violate the known distribution we need some other curve to work with

    • Hard to estimate a theoretical distribution that fits your data

    • Best solution is to permute your data to generate a distribution

    • Permutational non-parametric statistics are just as powerful as parametric tests

    • This technique is similar to bootstrapping• But bootstrap samples data rather than changing all observation classes

  • If permutational techniques are so good

    why not always use them?

    • You say permutational non-parametric tests are as

    powerful as parametric statistics – YES They Are!

    • But they are still fairly new to statistical practices

    • Still unknown or not understood among many uses

    • Best practice is to stick with parametric statistics when

    you can, but when you can’t permutational tests are

    great options!

  • Extension of the Statistics ToolboxMultivariate Tests (Rotation-Based)

    “Definition of Statistics: The science of producing unreliable facts from reliable figures.”

    Evan Esar (Humorist & Writer)

  • • Population – Class of things (What you want to learn about)

    • Sample – group representing a class (What you actually study)

    • Experimental unit – individual research subject (e.g. location, entity, etc.)

    • Response variable – property of thing that you believe is the result of predictors (What you actually measure – e.g. lentil height, patient response)

    a.k.a dependent variable

    • Predictor variable(s) – environment of things which you believe is influencing a response variable (e.g. climate, topography, drug combination, etc.)

    a.k.a independent variable

    • Error - difference between an observed value (or calculated) value and its true (or expected) value

    A Reminder from Univariate Statisics….

  • Experimental Unit

    (row)

    In Multivariate statistics:

    • Variables can be either numeric or categorical (depends on the

    technique)

    • Focus is often placed on graphical representation of results

    Frequency of species,

    Climate variables,

    Soil characteristics,

    Nutrient concentrations

    Drug levels, etc.

    Regions,

    Ecosystems,

    Forest types,

    Treatments, etc.

    Rotation-based Methods Data.I

    DType

    Varable1

    Variable2

    Variable3

    Variable4

    1

    2

    3

    4

    5

    6

  • Final results based on multiple

    variables give different inferences

    than 2 variables

    Data.ID

    TypeVarable

    1Variable

    2Variable

    3Variable

    4…

    1

    2

    3

    4

    5

    6

    …Variable 1

    Var

    iab

    le 2

    Find an equation to rotate data

    to so that axis explains multiple

    variables

    Variable 1,2

    Var

    iab

    le 3

    Variable 1, 2, 4, 9, 10V

    aria

    ble

    3, 6

    , 8Repeat rotation process to

    achieve analysis objective

    Rotation-based Methods

  • 1. Rotate so that new axis explains the greatest amount of variation within the data

    Principal Component Analysis (PCA) Factor Analysis

    2. Rotate so that the variation between groups is maximized

    Discriminatn Analysis (DISCRIM)Multivariate Analysis of Variance (MANOVA)

    3. Rotate so that one dataset explains the most variation in another dataset

    Canonical Correspondence Analysis (CCA)

    Objective of Rotation-based Methods

  • Z1 = a11X1 + a12X2 + … + a1nXn

    First principal component

    (column vector)

    Column vectors of original

    variables

    • PCA Objective: Find linear combinations of the original variables X1, X2, …, Xn to

    produce components Z1, Z2, …, Zn that are uncorrelated in order of their

    importance, and that describe the variation in the original data.

    • Principle components are the linear combinations of the original variables

    • Principle component 1 is NOT a replacement for variable 1 – All variables are

    used to calculate each principal component

    For each component:

    The constraint that a112 + a12

    2 + … + a1n2 = 1 ensures Var(Z1) is as large as possible

    Coefficients for linear model

    The Math Behind PCA

  • • Z2 is calculated using the same formula and constraint on a2n values

    However, there is an addition condition that Z1 and Z2 have zero

    correlation for the data

    • The correlation condition continues for all successional principle

    components i.e. Z3 is uncorrelated with both Z1 and Z2

    • The number of principal components calculated will match the number

    of predictor variable included in the analysis

    • The amount of variation explained decreases with each successional

    principal component

    • Generally you base your inferences on the first two or three

    components because they explain the most variation in your data

    • Typically when you include a lot of predictor variables the last

    couple of principal components explain very little (< 1%) of the

    variation in your data – not useful variables

    The Math Behind PCA

  • PCA in R:princomp(dataMatrix,cor=T/F) (stats package)

    Define whether the PCs should be calculated using the correlation or covariance

    matrix (derived within the function from the data)

    You tend to use the covariance matrix when the variable scales are similar

    and the correlation matrix when variables are on different scales

    Default is to use the correlation matrix because it standardizes to data before it

    calculated the PCs, removing the effect of the different units

    Data matrix of predictor variables

    You will assign the results back to a

    class once the PCs have been

    calculated

    PCA in R

  • PCA in R

  • Loadings – these are the correlations between the original predictor variables

    and the principal components

    Identifies which of the original variables are driving the principal component

    Example:

    Comp.1 – is negatively related to Murder,

    Assault, and Rape

    Comp.2 – is negatively related to UrbanPop

    Eigenvectors

    PCA in R

  • Scores – these are the calculated principal components Z1, Z2, …, Zn

    These are the values we plot to make inferences

    PCA in R

  • Variance – summary of the output displays the variance explained by each

    principal component

    Identifies how much weight you should put in your principal components

    Example:

    Comp.1 – 62 %

    Comp.2 – 25%

    Comp.3 – 9%

    Comp.4 – 4 %

    Eignenvalues

    divided by the

    number of PCs

    PCA in R

  • Data points considering

    Comp.1 and Comp.2 scores

    (displays row names)

    Direction of the arrows +/-

    indicate the trend of points

    (towards the arrow indicates

    more of the variable)

    If vector arrows are

    perpendicular then the

    variables are not correlated

    If you original variables do not have some level of correlation then PCA will NOT work

    for your analysis – i.e. You wont learn anything!

    PCA in R - Biplot

  • WORK PERIOD11:30 – 1:00

    Follow the Workbook Examples for the Analyses You are Interested In.Any questions?

  • Statistics ToolboxWhat is the goal of

    my analysis?

    What kind of data do I have to answer my

    research question?

    How many variables do I

    want to include in my analysis?

    Does my data meet the analysis assumptions?

    … to characterize my data.

    … to find if there is a significant

    difference between my groups

    … to see what predictor

    conditions are associated with

    my groups.

    … normally distributed?

    … equal variances?

    … multiple response variables?

    … continuous or discrete?

    … binary data?

    … single treatment or multiple

    treatment?

  • Thank You for Attending the Stats Workshop

    I you have any further questions please feel free to contact me

    Flow Cytometry Core Facility

    LKG ConsultingEmail: [email protected]

    Website: www.consultinglkg.com