approximate randomization tests february 5 th, 2013

Approximate Randomization tests

February 5th, 2013

Classic t-test

Why ar testing?

• Classic tests often assume a given distribution (student t, normal, …) of the variable

• This is ≈ok for recall, but not for precision or F-score

• Possible hypotheses to test with non-parametric tests is limited

Illustration

• 30,000 runs, 1000 instances, 500 of class A• True positives (TP): 400 (stdev:80)• False positives (FP): 60 (stdev: 15)• Assumption: true and false positives for class

A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.

Definitions

• Recall = truly predicted A / A in reference = truly predicted A / Cte

If A is normal, recall is normal.• Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal.

• F-score: non-linear combination of recall and precision Not normal.

Approximate randomization test

• No assumption on distribution• Can handle complicated statistics• Only assumption: independence between

shuffled elements• References:– Computer Intensive Methods for Testing

Hypotheses, Noreen, 1989.– More accurate tests for the statistical significance

of results differences, Yeh, 2000.

Basic idea

• Exact randomization test

Glass 1 Glass 2 Glass 3 Glass 4

Contents Polish Premium Russian Budget

Expert Polish Premium Budget Russian

Exact probability

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24 = 0.29

Thus, do not reject H0 because the probability is larger than alpha=0.05.

Approximate probability

• The number of permutations is n! => quick increase of number of permutations

• If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1)– nge : number of times pseudostatistic ≥ actual

statistic– NS: number of shuffles– +1: correction for validity

DIFFERENT SETUPS

Translation to instances

• Each glass is an instance• Contents and expert are two labeling systems• Contents has an accuracy of 100%, expert has

an accuracy of 50%• Statistic is precision, f-score, recall, … instead

of accuracy

Stratified shuffling

• For labeled instances, it makes no sense to shuffle the class label of one instance to another

• Only shuffle labels per instance

MBT

• Assumpton of independence between instances

• Shuffle per sentence rather than per token

System 1 System 2

This DT NNS

is VBZ VB

nice JJ RB

. . .

Term extraction

• Shuffling extracted terms between output of two term extraction systems

Reference System 1 System 2

happy happy sad

good good

lively happy

angry

Script

• http://www.clips.ua.ac.be/~vincent/software.html#art• http://www.clips.ua.ac.be/scripts/art• Options:

– Exact and approximate randomization tests– Instance based, also for MBT– Term extraction based– Stratified Shuffling– Two sided / one-sided (check code!)

http://www.clips.ua.ac.be/~vincent/software.html%23art

http://www.clips.ua.ac.be/~vincent/software.html%23art

http://www.clips.ua.ac.be/scripts/art

http://www.clips.ua.ac.be/scripts/art

Remarks on usage

• It makes no sense to shuffle if exact randomization can be computed

• The value of p depends on NS. The larger NS, the lower p can be

• Validity check– Sign-test– Re-test: to alleviate bad randomization

Sign test

• Can be compared with P for accuracy• H0: correctness is

independent ofsystem i.e.P(groen) = 0.5

• Binomial test

System 1 System 2

Interpretation (1)Reference System 1 System 2

A A B

B A B

C A B

How much do these two systems differ based on precision for the A label?

- Maximally- Intermediate- Minimally

Interpretation (2)Labels PrecisionA

A B C System 1 System 2 Δ

AB AB AB 1/3 0 1/3

BA AB AB 0 1 -1

AB AB BA 1/2 0 1/2

BA BA AB 0 1/2 -1/2

BA AB BA 1/2 0 1/2

AB BA BA 1 0 1

BA BA BA 0 1/3 -1/3

AB BA AB 1/2 0 1/2

Conclusion

• Approximate randomization testing can be used for many applications.

• The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated.

• Difference can be computed in many ways as long as the shuffled elements are independent.

approximate randomization tests february 5 th, 2013

Documents

number of instances

instanceseach glass

classic tests

number of shuffles

classic ttest

accurate tests

lower p

class atrue positives