non-parametric tests - main <...

Non-parametric tests


B.H. Robbins Scholars Series

June 24, 2010

1 / 30


Outline

One Sample Test: Wilcoxon Signed-Rank

Two Sample Test: Wilcoxon–Mann–Whitney

Confidence Intervals

Summary

2 / 30


IntroductionI T-tests: tests for the means of continuous data

I One sample H0 : µ = µ0 versus HA : µ 6= µ0

I Two sample H0 : µ1 − µ2 = 0 versus HA : µ1 − µ2 6= 0

I Underlying these tests is the assumption that the data arisefrom a normal distribution

I T-tests do not actually require normally distributed data toperform reasonably well in most circumstances

I Parametric methods: assume the data arise from adistribution described by a few parameters (Normaldistribution with mean µ and variance σ2).

I Nonparametric methods: do not make parametric assumptions(most often based on ranks as opposed to raw values)

I We discuss non-parametric alternatives to the one and twosample t-tests.

3 / 30


Examples of when the parametric t-test goes wrong

I Extreme outliersI Example: t-test comparing two sets of measurements

I Sample 1: 1 2 3 4 5 6 7 8 9 10I Sample 2: 7 8 9 10 11 12 13 14 15 16 17 18 19 20

I Sample averages: 5.5 and 13.5, T-test p-value p = 0.000019I Example: t-test comparing two sets of measurements

I Sample 1: 1 2 3 4 5 6 7 8 9 10I Sample 2: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 200

I Sample averages: 5.5 and 25.9, T-test p-value p = 0.12

4 / 30



I T-statistic

t =x1 − x2

s√

1n1

+ 1n2

I For two sample tests

s2 =(n1 − 1)s2

1 + (n2 − 1)s22

n1 + n2 − 2

I In the first datasetI s2

1 = 9.2, s22 = 17.5

I In the second datasetI s2

1 = 9.2, s22 = 2335

5 / 30



I Upper detection limitsI Example: Fecal calprotectin was being evaluated as a possible

biomarker of Crohn’s disease severityI Median can be calculated (mean cannot)

6 / 30


Feca

l Cal

prot

ectin

No or Mild Activity Moderate or Severe Activity

0

500

1000

1500

2000

2500

n = 8 n = 18

Above Detection Limit

7 / 30


When to use non-parametric methods

I With correct assumptions (e.g., normal distribution),parametric methods will be more efficient / powerful thannon-parametric methods but often not as much as you mightthink1

I If the normality assumption grossly violated, nonparametrictests can be much more efficient and powerful than thecorresponding parametric test

I Non-parametric methods provide a well-foundationed way todeal with circumstance in which parametric methods performpoorly.

The large-sample efficiency of the Wilcoxon test compared to the t test is 3π

=0.9549.

8 / 30


Non-parametric methods

I Many non-parametric methods convert raw values to ranksand then analyze ranks

I In case of ties, midranks are used, e.g., if the raw data were105 120 120 121 the ranks would be 1 2.5 2.5 4

Parametric Test Nonparametric Counterpart

1-sample t Wilcoxon signed-rank2-sample t Wilcoxon 2-sample rank-sumk-sample ANOVA Kruskal-WallisPearson r Spearman ρ

9 / 30



One sample tests

I Non-parametric analogue to the one sample t-test.

I Almost always used on paired data where the column ofvalues represents differences (e.g., D = Ypost − Ypre).

I Sign test: the simplest test for the median difference beingzero in the population

I Examine all values of D after discarding those in which D=0I Count the number of positive DsI Tests H0 :Prob[D > 0] = 1

2 versus HA :Prob[D > 0] 6= 12

I Under H0 it is equally likely in the population to have a valuebelow zero as it is to have a value above zero

I Note that it ignores magnitudes completely → it is inefficient(low power)

10 / 30



One sample tests: Wilcoxon signed rank

I In the pre-post analysisI D = pre - postI Retain the sign of D ( +/-)I Rank = rank of |D| (absolute value of D)I Signed rank, SR = Sign * RankI Base analyses on SR

I Observations with zero differences are ignored

I Example: A pre-post study

Post Pre D Sign Rank of |D| Signed Rank

3.5 4 0.5 + 1.5 1.54.5 4 -0.5 - 1.5 -1.54 5 1.0 + 4.0 4.0

3.9 4.6 0.7 + 3.0 3.0

11 / 30



One sample tests

I A good approximation to an exact P-value (not discussed)may be obtained by computing

z =

∑SRi√∑SR2

i

,

where the signed rank for observation i is SRi .

I We can then compare |z | to the normal distribution.

I Here, z = 7√29.5

= 1.29 and by surfstat the 2-tailed P-value

is 0.197I If all differences are positive or all are negative, the exact

2-tailed P-value is 12n−1

I This implies that n must exceed 5 for any possibility ofsignificance at the α = 0.05 level for a 2-tailed test

12 / 30



One sample tests

I Sleep DatasetI Compare the effects of two soporific drugs.I Each subject receives Drug 1 and Drug 2I Study question: Is Drug 1 or Drug 2 more effective at

increasing sleep?I Dependent variable: Difference in hours of sleep comparing

Drug 2 to Drug 1I H0 : For any given subject, the difference in hours of sleep is

equally likely to be positive or negative

13 / 30



Subject Drug 1 Drug 2 Diff (2-1) Sign Rank

1 1.9 0.7 −1.2 - 32 −1.6 0.8 2.4 + 83 −0.2 1.1 1.3 + 4.54 −1.2 0.1 1.3 + 4.55 −0.1 −0.1 0.0 NA NA6 3.4 4.4 1.0 + 27 3.7 5.5 1.8 + 78 0.8 1.6 0.8 + 19 0.0 4.6 4.6 + 910 2.0 3.4 1.4 + 6

Table: Hours of extra sleep on drugs 1 and 2, differences, signs and ranksof sleep study data

14 / 30



One sample / paired test exampleI Approximate p-value calculation

9∑i=1

SRi = 39,

√√√√ 9∑i=1

SR2i = 16.86

Z = 2.31, and the two sided test yields a p-value equal to2*(1-.989556)= 0.0209

I Wilcoxon signed rank test statistical program output

Wilcoxon signed rank test

data: sleep.data

V = 42, p-value = 0.02077

alternative hypothesis: true location is not equal to 0

I Thus, we reject H0 and conclude Drug 2 increases sleep bymore hours than Drug 1 (p = 0.02)

15 / 30



One sample / paired test example

I We could also perform sign test on sleep dataI If drugs are equally effective, we should have same number of

positives and negatives (e.g., Prob(D>0)=.5).I Analogous to coin flip example from last time.I In the observed data: 1 negative and 8 positives (we throw out

1 ‘no change’)I One sided p-value: probability of observing 0 or 1 negativesI Two sided p-value: probability of observing 0, 1, 8, or 9

negativesI p = 0.04, → reject H0 at α = 0.05

16 / 30



Wilcoxon signed rank test

I Assumes the distribution of differences is symmetric

I When the distribution is symmetric, the signed rank test testswhether the median difference is zero

I In general it tests that, for two randomly chosen observationsi and j with values (differences) xi and xj , that the probabilitythat xi + xj > 0 is 1

2

I The estimator that corresponds exactly to the test in allsituations is the pseudomedian, the median of all possiblepairwise averages of xi and xj , so one could say that thesigned rank test tests H0: pseudomedian=0

17 / 30



I To test H0 : η = η0, where η is the population median (not adifference) and η0 is some constant, we create the n valuesxi − η0 and feed those to the signed rank test, assuming thedistribution is symmetric

I When all nonzero values are of the same sign, the test reducesto the sign test and the 2-tailed P-value is ( 1

2 )n−1 where n isthe number of nonzero values

18 / 30



Two sample WMW test

I The Wilcoxon–Mann–Whitney (WMW) 2-sample rank sumtest is for testing for equality of central tendency of twodistributions (for unpaired data)

I Ranking is done by combining the two samples and ignoringwhich sample each observation came from

I Example:

Females 120 118 121 119Males 124 120 133

Ranks for Females 3.5 1 5 2Ranks for Males 6 3.5 7

19 / 30



Two sample WMW test

I Doing a 2-sample t-test using these ranks as if they were rawdata and computing the P-value against 4+3-2=5 d.f. willwork quite well

I Loosely speaking the WMW test tests whether the populationmedians of the two groups are the same

I More accurately and more generally, it tests whetherobservations in one population tend to be larger thanobservations in the other

I Letting x1 and x2 respectively be randomly chosenobservations from populations one and two, WMW testsH0 : C = 1

2 , where C =Prob[x1 > x2]

20 / 30



Two sample WMW test

I Wilcoxon rank sum test statistic

W = R − n1(n1 + 1)

2

where R is the sum of the ranks in group 1

I Under H0, µw = n1n22 and σw =

√n1n2(n1+n2+1)

12 , and

z =W − µwσw

follow a N(0,1) distribution.

21 / 30



Two sample WMW test

I The C index (concordance probability) may be estimated bycomputing

C =R̄ − n1+1

2

n2,

where R̄ is the mean of the ranks in group 1

I For the above data R̄ = 2.875 and C = 2.875−2.53 = 0.125

I We estimate: probability that a randomly chosen female has avalue greater than a randomly chosen male is 0.125.

22 / 30



Two sample WMW test: Example

I Fecal calprotectin being evaluated as a possible biomarker ofdisease severity

I Calprotectin measured in 26 subjects, 8 observed to haveno/mild activity by endoscopy

I Calprotectin has upper detection limit at 2500 unitsI A type of missing data, but need to keep in analysis

23 / 30




I Study question: Are calprotectin levels different in subjectswith no or mild activity compared to subjects with moderateor severe activity?

I Statement of the null hypothesisI H0 : Populations with no/mild activity have the same

distribution of calprotectin as populations withmoderate/severe activity

I H0 : C = 12

24 / 30



Feca

l Cal

prot

ectin

No or Mild Activity Moderate or Severe Activity

0

500

1000

1500

2000

2500 22.5

9

22.5

13

6

22.5

5

10

22.5

7

18

22.5

8

15

12

22.5

14

4

11

2

1617

22.522.5

31

Above Detection Limit

25 / 30




I Stat program output

Wilcoxon rank sum test

data: calpro by endo2

W = 23.5, p-value = 0.006257

alternative hypothesis: true location shift is not equal to 0

I W = 59.5− 8∗92 = 23.5

I A common (but loose) interpretation: People withmoderate/severe activity have higher median fecal calprotectinlevels than people with no/mild activity (p = 0.006).

26 / 30



Confidence Intervals for medians

I Confidence intervals for the (one sample) medianI Ranks of the observations are used to give approximate

confidence intervals for the median (See Altman book)I e.g., if n = 12, the 3rd and 10th largest values give a 96.1%

confidence intervalI For larger sample sizes, the lower ranked value (r) and upper

ranked value (s) to select for an approximate 95% confidenceinterval for the population median is

r =n

2− 1.96 ∗

√n

2and s = 1 +

n

2+ 1.96 ∗

√n

2

I e.g., if n = 100 then r = 40.2 and s = 60.8, so we would pickthe 40th and 61st largest values from the sample to specify a95% confidence interval for the population median

27 / 30




I Confidence intervals for the difference in two medians (twosamples)

I Considers all possible differences between sample 1 and sample2

FemaleMale 120 118 121 119124 4 6 3 5120 0 2 -1 1133 13 15 12 14

I An estimate of the median difference (males - females) is themedian of these 12 differences, with the 3rd and 10th largestvalues giving an (approximate) 95% CI

I Median estimate = 4.5, 95% CI = [1, 13]

I Specific formulas found in Altman, pages 40-41

28 / 30




I BootstrapI General method, not just for mediansI Non-parametric, does not assume symmetryI Iterative method that repeatedly samples from the original dataI Algorithm for creating a 95% CI for the difference in two

medians

1. Sample with replacement from sample 1 and sample 22. Calculate the difference in medians, save result3. Repeat Steps 1 and 2 1000 times

I A (naive) 95% CI is given by the 25th and 97.5th largest valuesof your 1000 median differences

I For the male/female data, median estimate = 4.5, 95% CI =[-0.5, 14.5], which agrees with the conclusion from a WMWrank sum test (p = 0.11).

29 / 30


Summary

Summary: non-parametric tests

I Wilcoxon signed rank test: alternative to the one sample t-test

I Wilcoxon Mann Whitney or rank sum test: alternative to thetwo sample t-test

I Attractive when parametric assumptions are believed to beviolated

I Drawback: if based on ranks, tests do not provide insight intoeffect size

I Non-parametric tests are attractive if all we care about isgetting a P-value

30 / 30

non-parametric tests - main <...

Documents