approximate randomization tests february 5 th, 2013
TRANSCRIPT
Approximate Randomization tests
February 5th, 2013
Classic t-test
Why ar testing?
• Classic tests often assume a given distribution (student t, normal, …) of the variable
• This is ≈ok for recall, but not for precision or F-score
• Possible hypotheses to test with non-parametric tests is limited
Illustration
• 30,000 runs, 1000 instances, 500 of class A• True positives (TP): 400 (stdev:80)• False positives (FP): 60 (stdev: 15)• Assumption: true and false positives for class
A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.
Definitions
• Recall = truly predicted A / A in reference = truly predicted A / Cte
If A is normal, recall is normal.• Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal.
• F-score: non-linear combination of recall and precision Not normal.
Approximate randomization test
• No assumption on distribution• Can handle complicated statistics• Only assumption: independence between
shuffled elements• References:– Computer Intensive Methods for Testing
Hypotheses, Noreen, 1989.– More accurate tests for the statistical significance
of results differences, Yeh, 2000.
Basic idea
• Exact randomization test
Glass 1 Glass 2 Glass 3 Glass 4
Contents Polish Premium Russian Budget
Expert Polish Premium Budget Russian
Exact probability
H0: expert is independent of contents
P(ncorrect ≥ 2) = 7/24 = 0.29
Thus, do not reject H0 because the probability is larger than alpha=0.05.
Approximate probability
• The number of permutations is n! => quick increase of number of permutations
• If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1)– nge : number of times pseudostatistic ≥ actual
statistic– NS: number of shuffles– +1: correction for validity
DIFFERENT SETUPS
Translation to instances
• Each glass is an instance• Contents and expert are two labeling systems• Contents has an accuracy of 100%, expert has
an accuracy of 50%• Statistic is precision, f-score, recall, … instead
of accuracy
Stratified shuffling
• For labeled instances, it makes no sense to shuffle the class label of one instance to another
• Only shuffle labels per instance
MBT
• Assumpton of independence between instances
• Shuffle per sentence rather than per token
System 1 System 2
This DT NNS
is VBZ VB
nice JJ RB
. . .
Term extraction
• Shuffling extracted terms between output of two term extraction systems
Reference System 1 System 2
happy happy sad
good good
lively happy
angry
Script
• http://www.clips.ua.ac.be/~vincent/software.html#art• http://www.clips.ua.ac.be/scripts/art• Options:
– Exact and approximate randomization tests– Instance based, also for MBT– Term extraction based– Stratified Shuffling– Two sided / one-sided (check code!)
Remarks on usage
• It makes no sense to shuffle if exact randomization can be computed
• The value of p depends on NS. The larger NS, the lower p can be
• Validity check– Sign-test– Re-test: to alleviate bad randomization
Sign test
• Can be compared with P for accuracy• H0: correctness is
independent ofsystem i.e.P(groen) = 0.5
• Binomial test
System 1 System 2
Interpretation (1)Reference System 1 System 2
A A B
B A B
C A B
How much do these two systems differ based on precision for the A label?
- Maximally- Intermediate- Minimally
Interpretation (2)Labels PrecisionA
A B C System 1 System 2 Δ
AB AB AB 1/3 0 1/3
BA AB AB 0 1 -1
AB AB BA 1/2 0 1/2
BA BA AB 0 1/2 -1/2
BA AB BA 1/2 0 1/2
AB BA BA 1 0 1
BA BA BA 0 1/3 -1/3
AB BA AB 1/2 0 1/2
Conclusion
• Approximate randomization testing can be used for many applications.
• The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated.
• Difference can be computed in many ways as long as the shuffled elements are independent.