treatment comparisons
DESCRIPTION
Treatment Comparisons. ANOVA can determine if there are differences among the treatments, but what is the nature of those differences? Are the treatments measured on a continuous scale? Look at response surfaces (linear regression, polynomials) - PowerPoint PPT PresentationTRANSCRIPT
Treatment Comparisons ANOVA can determine if there are differences
among the treatments, but what is the nature of those differences?
Are the treatments measured on a continuous scale? Look at response surfaces (linear regression, polynomials)
Is there an underlying structure to the treatments? Compare groups of treatments using orthogonal contrasts
or a limited number of preplanned mean comparison tests Use simultaneous confidence intervals on preplanned
comparisons Are the treatments unstructured?
Use appropriate multiple comparison tests (today’s topic)
Variety Trials In a breeding program, you need to examine
large numbers of selections and then narrow to the best
In the early stages, based on single plants or single rows of related plants. Seed and space are limited, so difficult to have replication
When numbers have been reduced and there is sufficient seed, you can conduct replicated yield trials and you want to be able to “pick the winner”
Comparison of Means Pairwise Comparisons
– Least Significant Difference (LSD) Simultaneous Confidence Intervals
– Tukey’s Honestly Significant Difference (HSD)– Dunnett Test (making all comparisons to a control)
• May be a one-sided or two-sided test– Bonferroni Inequality– Scheffé’s Test – can be used for unplanned comparisons
Other Multiple Comparison Tests - “Data Snooping”– Fisher’s Protected LSD (FPLSD)– Student-Newman-Keuls test (SNK)– Waller and Duncan’s Bayes LSD (BLSD)– False Discovery Rate Procedure
Often misused - intended to be used only for data from experiments with unstructured treatments
Multiple Comparison Tests Fixed Range Tests – a constant value is used for
all comparisons– Application
• Hypothesis Tests• Confidence Intervals
Multiple Range Tests – values used for comparison vary across a range of means– Application
• Hypothesis Tests
Type I vs Type II Errors Type I error - saying something is different when it is really the
same (false positive) (Paranoia)– the rate at which this type of error is made is the significance
level Type II error - saying something is the same when it is really
different (false negative) (Sloth)– the probability of committing this type of error is designated b– the probability that a comparison procedure will pick up a real
difference is called the power of the test and is equal to 1-b Type I and Type II error rates are inversely related to each other For a given Type I error rate, the rate of Type II error depends on
– sample size– variance– true differences among means
Nobody likes to be wrong... Protection against Type I is choosing a significance level Protection against Type II is a little harder because
– it depends on the true magnitude of the difference which is unknown
– choose a test with sufficiently high power Reasons for not using LSD to make all possible
comparisons– the chance for a Type I error increases dramatically as
the number of treatments increases
Pairwise Comparisons Making all possible pairwise comparisons
among t treatments– # of comparisons:
If you have 10 treatments and want to look at all possible pairwise comparisons – that would be t(t-1)/2 or 10(9)/2 = 45– that’s quite a few more than t-1 df = 9
t! t(t 1)t2 2!(t 2)! 2
Comparisonwise vs Experimentwise Error
Comparisonwise error rate ( = C)– measures the proportion of all differences that are
expected to be declared real when they are not
Experimentwise error rate (E)– the risk of making at least one Type I error among the
set (family) of comparisons in the experiment– measures the proportion of experiments in which one
or more differences are falsely declared to be significant
– the probability of being wrong increases as the number of means being compared increases
– Also called familywise error rate (FWE)
Experimentwise error rate (E)Probability of no Type I errors = (1-C)x
where x = number of pairwise comparisons
Max x = t(t-1)/2 , where t=number of treatments
Probability of at least one Type I error E = 1- (1- C)x
Comparisonwise error rate C = 1- (1- E)1/x
C = 1- (1- 0.05)1/45 ≈ 0.001
if t = 10: Max x = 45 E = 1-(1-0.05)45 = 90%
Comparisonwise vs Experimentwise Error
Dunn-Šidák MethodFix C to obtain
desired E
Least Significant Difference Calculating a t for testing the difference between two
means
– Any difference for which the tcalc > t would be declared significant
Further, is the smallest difference for which significance would be declared– Therefore
– For equal replication, where r is the number of observations forming each mean
1 2
2calc 1 2 Y Yt (Y Y ) / s
1 2
2Y Yt s
1 2
2Y YLSD t s
2*MSELSD tr
Do’s and Don’ts of using LSD LSD may be a valid test when
– Making a limited number of comparisons planned in advance of seeing the data • Comparing each treatment with the control*• Comparing adjacent ranked means
Unless the F test for treatments is significant**, the LSD should not be used for– Making all possible pairwise comparisons– Making more comparisons than df for treatments
** Some would say that LSD should never be used unless the F test from ANOVA is significant
* Dunnett’s test would give better control of experimentwise error
Pick the Winner A plant breeder wanted to measure resistance to
stem rust for six wheat varieties– planted 5 seeds of each variety in each of four pots– placed the 24 pots randomly on a greenhouse bench– inoculated with stem rust– measured seed yield per pot at maturity
Ranked Mean Yields (g/pot)
Mean Yield DifferenceVariety Rank
F 1 95.3 D 2 94.0 1.3 E 3 75.0 19.0 B 4 69.0 6.0 A 5 50.3 18.7 C 6 24.0 26.3
iY i 1 iY - Y
ANOVA
Source df MS FVariety 5 2,976.44 24.80**Error 18 120.00
Compute LSD at 5% and 1%
0.05,df 182*MSE 2*120LSD t 2.101 16.27r 4
0.01,df 182*MSE 2*120LSD t 2.878 22.29r 4
Back to the data...
Mean Yield DifferenceVariety Rank
F 1 95.3 D 2 94.0 1.3 E 3 75.0 19.0* B 4 69.0 6.0 A 5 50.3 18.7* C 6 24.0 26.3**
LSD=0.05 = 16.27LSD=0.01 = 22.29
iY i 1 iY - Y
Fisher’s protected LSD (FPLSD) Uses comparisonwise error rate Computed just like LSD but you don’t use it
unless the F for treatments tests significant
So in our example data, any difference between means that is greater than 16.27 is declared to be significant
2*MSELSD tr
Tukey’s Honestly Significant Difference (HSD) Uses an experimentwise error rate From a table of Studentized range values (see handout),
select a value of Q which depends on p (the number of means) and v (error df)
Compute:
For any pair of means, if the difference is greater than HSD, it is significant
Use the Tukey-Kramer test with unequal sample size
,MSEHSD Q
r p,v
,1 2
MSE 1 1HSD Q2 r r
p,v
Student-Newman-Keuls Test (SNK) Rank the means from high to low (or low to high)
Compute t-1 significant differences, SNKj , using the studentized values for the HSD
Comparisons are made sequentially, beginning with the largest range and proceeding to the smallest
Uses experimentwise for the extremes (Tukey’s HSD) Uses comparisonwise for adjacent means (LSD)
where j = 1, 2, ..., t-1k = 2, 3, ..., tk = number of means in the range
j ,MSESNK Q
r k,v
Rank 1 2 3 4 5 6 95.3 94.0 75.0 69.0 50.3 24.0iY
Student-Newman-Keuls Test (SNK) Begin with the largest range (highest vs lowest, k = t)
– If less than SNK, stop! No comparisons are significant– If greater than SNK, make comparisons for k = t-1
Continue in a stepwise manner (k = t-2, k = t-3, etc.)
When a comparison is not significant, all subsequent comparisons within that range are not significant
M1–M6
M1–M5 M2–M6
M1–M4 M2–M5 M3–M6
M1–M3 M2–M4 M3–M5
M1–M2 M2–M3
M4–M6
M3–M4 M4–M5 M5–M6
k=6
k=5
k=4
k=3
k=2
Using SNK with example data:
Mean YieldVariety Rank
F 1 95.3 a D 2 94.0 a E 3 75.0 b B 4 69.0 b A 5 50.3 c C 6 24.0 d
k 2 3 4 5 6 Q 2.97 3.61 4.00 4.28 4.49 SNK 16.27 19.77 21.91 23.44 24.59
5 4 3 2 1 = 15 comparisons
18 df for error
SNK=Q*se
iY
MSE 120se 5.477r 4
Compare F and E95.3 – 75.0 = 20.320.3 > 19.77, difference is significant
Waller-Duncan Bayes LSD (BLSD) Do ANOVA and compute F (MST/MSE) with q (treatment
df) and f (error df) Choose error weight ratio, k
– k =100 is comparable to a 5% significance level– k = 500 is comparable to a 1% test
Obtain tb from Minimum-Average-Risk table (Petersen A7) Compute
Any difference greater than BLSD is significant Does not provide complete control of experimentwise Type
I error Reduces Type II error
1.93 2*120 / 4 14.95 BLSDt 2*MSE / rbBLSD
Bonferroni Correction Theory
E X * C where X = number of pairwise comparisons
To get critical probability value for significanceC = E / X where E = maximum desired
experimentwise error rate
Alternatively, multiply observed probability value by X and compare to E (values >1 are set to 1)
Advantages– simple– strict control of Type I error
Disadvantage– very conservative, low power to detect differences
False Discovery Rate
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210.00
0.05
0.10
0.15
0.20
0.25
False Positive Procedure
Rank (i)
Prob
abili
ty
Reject H0
Bars show P values for simple t tests among means– Largest differences have the smallest P values
Line represents critical P values = (i/X)* E
i = 1 to XRanks for
-
More Options!
Х Duncan’s New Multiple Range Test– A multiple range test– Less conservative than the SNK test– Used to be popular, but no longer recommended
Dunnett’s Test– Compare all treatments against a control– Compare all treatments against the best treatment– Conservative (controls Type 1, not Type 2 error)
Scheffé’s Method– Considers all possible contrasts among a set of treatments– Can be used for comparisons that are not preplanned– Very conservative!
Picking the Winner FPLSD test is widely used, and widely abused BLSD is preferred by some because
– It is a single value and therefore easy to use– Larger when F indicates that the means are homogeneous and
small when means appear to be heterogeneous The False Discovery Rate (FDR) has nice features
– Good experimentwise Type I error control– Good power (Type II error control)– Common in genetic analyses where thousands of comparisons
are made Tukey’s HSD test
– Widely accepted and often recommended by statisticians– May be too conservative if Type II error has more serious
consequences than Type I error