the chi-square test - 2 peter shaw. this test is noteworthy.. because it works on nominal data. it...

31
The Chi- square test - 2 Peter Shaw

Upload: kerry-stokes

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

The Chi-square test - 2

Peter Shaw

This test is noteworthy..

Because it works on nominal data. It requires COUNTS OF

observations – how many fields were ploughed? How many plants were white and how many pink? How many people answered yes?

It does not work on ordinal or continuous data. If the data have units (Kg, cm, moles etc) you cannot use 2

Oddities about 2

I would genuinely advise that you do this one by hand not on a PC. The calculations are trivial, and I don’t trust a PC to run the correct model for me!

Think about 100 darts hitting a dartboard AT RANDOM:Where do they go?

1

2475

3530

35

75

205

All of these patterns could occur by chance but they are not equally likely.

Plausible Are you sure? Very unlikely

To go beyond gut reactions, you need to calculate how many events (dart hits etc) you would EXPECT in each category.

2% area

24% area

74% area

In this case we assume that expected number of hits = % area * total number of darts.

Real Life At Drax Power Station I set up seeding

trials on 6 mounds of industrial waste (back in 1991). Since 1995 orchids flowered in these plots - but only on the bases of each mound, never on the top.

Could this be due to chance? Yes! The question is how likely this is

to be chance. This is a chi-squared problem.

Orchids in many placesNo orchids

How to do it? 1: set up H1, H0:

H1: The distribution is non-random H0: The distribution is random

2: Define significance (p=0.05) 3: For each category calculate how

many events you would expect under

H0. Call this E (for expected). Call

the observed number of events O. 4: Calculate (O-E)2/E

Now find 2

2 = Σ(O-E)2/E

ie add up all the values of (O-E)2/E You need to find the df. This is N-1,

where N = number of categories (“zones” on your mental dartboard)

Compare your value of 2 with tabulated: large values are significant.

My orchid data: Drax has 72 experimental plots, of which 12 support

orchids. 36 plots are on the mound top, 36 at mound base. Expected values: 12 orchid plots out of 72 should be 50:50 mound

base: mound top, IF their distribution were random. Hence expected values are 6 orchid plots on mound

tops, 6 at mound base. Observed values are in fact 0 and 12 respectively.

2 = ?

2 = (0-6)2/6 + (12-6)2/6 = 6 + 6 = 12, with 1df. The critical value for 2 with 1 df at

p = 0.05 is 3.84

Calculated value > tabulated value, so result is significant. I reject H0 and accept that the distribution of plants appears to be non-random.

Plagiarism in psychology UGs??At a programme board in 2005 a paper was tabled giving resukts of a trial in which every single essay submitted to an anthropology module was carefully checked for any form of plagiarism. A series of quiet gasps went around as people saw the figures;

% badly plagiarised

Raw data

Biosciences 19 5/27

Psychology 39 12/31

O total E O-E(O-E)2/E

5.00 27.00 7.91 -2.91 1.07

12.00 31.00 9.09 2.91 0.93

sum 17.00 58.00 2.01

chi-sq = 2, df = 1, p>0.05 (in fact p>0.1)

This is a chi-squared question

In other words this pattern is non significant; it is easily within the expected range of random noise.

% v poor refs

Raw data

Biosciences 30 8/27

Psychology 3 1/31

Now try the other half of the results; the number of students whose referencing was missing or poor.

Until 1999 a doctor callled Harold Shipman worked as a GP in north Manchester. His mortality rates seemed a little high compared with a neighboring GP practice.The practice is that when deaths occur at home under GP supervision another medical practice often signs the death certificate.

Shipman’s practice Next GP practice along

Peopleserved 3100 9800

Death certssigned / year

47 14

One little catch: This concerns the “Expected” values. Remember that 2 involves the term (O-E)2/E. If you had a tiny

value of E, 2 would be huge. (As E -> 0, 2 -> infinity). This is not a problem most of the time. However, it is a problem if E gets too small, and there is a rule of thumb to guide you here:

If E <5, the 2 value may be unreliable.

The solution is normally simple: pool classes together until you get a big enough value of E.

Let’s take the dartboard example

We have 3 zones, comprising 2, 24 and 74% of the area.

Throw 100 darts randomly at these we expect them to contain 2, 24 and 74 darts respectively.

Our E values are hence 2, 24 and 74.

74% area

24%

2%

1

2475

3530

35

75

205

3 dartboards – let’s use the Chi-sq to assess their likelihood.

74242E

75241O

74242E

353035O

74242E

52075O

7426E

7525O

7426E

3565O

7426E

595O

2 = Σ (O-E)2/E= (25-26)2/26+(75-74)2/74=1/26 + 1/74

2 = 0.052 2 = 79.05 2 = 247.5

Your turn!

In Sheffield we surveyed tombstones with lichen cover. There were 2 types of stone: millstone and marble.

One day we found 80 marble and 120 millstone tombstones. 80 of the millstones ones had Lecanora conizeiodes but none of the marble ones did. Is this significant?

2 way chi-squareThis is very common, but needs a little thought. Here we have a distribution of counts in 2 crossed categories (eg M/F * did/did not gain a score, habitat type * present/absent). It is possible to test H0: random distribution. If H0 is rejected you may conclude that the distribution is not random, but you can’t go on to identify which observations / classes / treatments are responsible for this effect.

M F

Habitat 1 25 5

Habitat 2 10 15

Habitat 3 5 35

Obs: M F Sum

Habitat 1 25 5 30

Habitat 2 10 15 25

Habitat 3 5 35 40

Sum 40 55 95

What are the constraints on this? That you sampled a certain total number of individuals, and a certain total fell into each gender and a certain total into each habitat. Given these totals we can predict the expected number of observations under a random distribution

Expected M F Sum

Habitat 1 30*40/95 30*55/95 30

Habitat 2 25*40/95 25*55/95 25

Habitat 3 40*40/95 40*55/95 40

Sum 40 55 95

Expected

habitat 1 12.63158 17.36842 30

Habitat 2 10.52632 14.47368 25

Habitat 3 16.84211 23.15789 40

Sum 40 55 95

chi-squared

habitat 1 12.11074561 8.807814992

Habitat 2 0.026315789 0.019138756

Habitat 3 8.326480263 6.05562201

Sum 20.46354167 14.88257576

Answer: Chi-square = 35.34611742

Yates’ Correction (the continuity correction)This correction can probably be ignored under most circumstances – in fact I would never use it, preferring instead a home-grown Monte-Carlo approach (next slide..), but this correction could matter if you are using chi-square to look for associations between events and have small sample size.

Sp1 Present Absent sum

Sp2 PresentAbsentsum

a b a+bc d c+da+c b+d N

If you crank through the chi-square calculation on this association matrix you find that chi-square =

(ad-bc)2*N(a+b)(c+d)(a+c)(b+d)

Sp1 Present Absent sum

Sp2 PresentAbsentsum

0 20 2040 60 10040 80 120

If you crank through the chi-square calculation on this association matrix you find that chi-square =

(ad-bc)2*N(a+b)(c+d)(a+c)(b+d)

E Sp1 Present Absent sum

Sp2 PresentAbsentsum

6.7 13.3 2033.3 66.6 10040 80 120

O

Sum(O-E)2/E = 12 QED

(0*60-20*40)2*12020*100*40*80

= 12 QED

Yates’ correction, contd!

Yates’ correction is a fudge applied to this calculation in cases where E values are <5 or N < 100. It goes as follows:

(|ad-bc|-n/2)2*N(a+b)(c+d)(a+c)(b+d)

Essentially this reduces the calculated chi-square value by reducing the (ad-bc)2 term on the top line, making the test more conservative (=more likely to accept H0). But I wouldn’t do it that way…

More on small samples

Each time we have a PhD student who studies primates in the field, they end up coming to me asking about chi-square analyses on small datasets where the E values are incorrigibly low. Since some M. Res. Primatology students will read this, the situation needs addressing.

The good news: the E>4 rule is over cautious, and in my experience you can get away with E values as low as 2 and still get accurate confidence levels.How do I know? Because there is a back-door solution, not available in the books or major packages, which I use that allows me blithely to ignore E<5 and Yates’ corection. It is a Monte-Carlo empirical determination of significance.

Monte-Carlo Chi-squareRemember the dartboard? A Monte-Carlo determination involves first calculating your actual 2 and ‘writing this down’ (in a PC memory). Let’s say that your data involves 20 observations. Then I get a PC to randomly ‘throw 20 darts’ at a dartboard of the same construction as your E values, and calculate a 2 value. This is a random 2 value. This is stored, and a second set of random darts is thrown, and a second 2 value calculated. This is repeated >200 times, to give you an empirical insight into what would be expected from random positioning of your 20 observations. Al the PC then has to do is compare your real 2 value with these random ones to derive a safe, dependable significance level.

The catch is that you have to use a bit of old DOS code I wrote to do this, hence the reason why these students keep knocking on my door..

Ymke’s warthogs..Number of respondents

to questionnaireNumber of respondent

reporting problems with Wart hog

39 013 410 021 57 0

31 3

Sum: 121 12

Number of respondents to questionnaire

Number of respondent reporting problems

with Wart hog

expected wrthg

39 0 3.86776859513 4 1.28925619810 0 0.99173553721 5 2.0826446287 0 0.694214876

31 3 3.074380165121 12 12

Oh Dear!! Not an E value >4

So I loaded O and E values into my MC-chi-sq programme…

And the MC-output said:2 = 15.34 (correct!)It calculated 1000 random 2 values based on “throwing 12 darts” at a board divided into 6 zones whose relative sizes were 3.87, 1.29,… 3.07 and found that 95% of them were <11.72 and 99% of them were < 16.11.

So what was the significance of 2 = 15.34?The standard tables, crudely ignoring E<5 problems, give this a significance level of 0.01>p>0.005.

Chi-square and model fittingThis is a whole lecture in itself! There is one simple, neat, easy-to-understand way to use chi-square to see whether two variables are associated. All that matters is that you can plot a graph of the variables: consider the scattergraph of the relationship between two arbitrary variables:

Plant massg

Fertiliser, gmedian

median

How many points in each quadrant (sum = 40)?What is H0, and why?

H0: 10 in each

77 obs

76 obs

35 obs

35 obs

2 = 30.9, 1df ***