the chi-square test - 2 peter shaw. this test is noteworthy.. because it works on nominal data. it...
TRANSCRIPT
This test is noteworthy..
Because it works on nominal data. It requires COUNTS OF
observations – how many fields were ploughed? How many plants were white and how many pink? How many people answered yes?
It does not work on ordinal or continuous data. If the data have units (Kg, cm, moles etc) you cannot use 2
Oddities about 2
I would genuinely advise that you do this one by hand not on a PC. The calculations are trivial, and I don’t trust a PC to run the correct model for me!
Think about 100 darts hitting a dartboard AT RANDOM:Where do they go?
1
2475
3530
35
75
205
All of these patterns could occur by chance but they are not equally likely.
Plausible Are you sure? Very unlikely
To go beyond gut reactions, you need to calculate how many events (dart hits etc) you would EXPECT in each category.
2% area
24% area
74% area
In this case we assume that expected number of hits = % area * total number of darts.
Real Life At Drax Power Station I set up seeding
trials on 6 mounds of industrial waste (back in 1991). Since 1995 orchids flowered in these plots - but only on the bases of each mound, never on the top.
Could this be due to chance? Yes! The question is how likely this is
to be chance. This is a chi-squared problem.
Orchids in many placesNo orchids
How to do it? 1: set up H1, H0:
H1: The distribution is non-random H0: The distribution is random
2: Define significance (p=0.05) 3: For each category calculate how
many events you would expect under
H0. Call this E (for expected). Call
the observed number of events O. 4: Calculate (O-E)2/E
Now find 2
2 = Σ(O-E)2/E
ie add up all the values of (O-E)2/E You need to find the df. This is N-1,
where N = number of categories (“zones” on your mental dartboard)
Compare your value of 2 with tabulated: large values are significant.
My orchid data: Drax has 72 experimental plots, of which 12 support
orchids. 36 plots are on the mound top, 36 at mound base. Expected values: 12 orchid plots out of 72 should be 50:50 mound
base: mound top, IF their distribution were random. Hence expected values are 6 orchid plots on mound
tops, 6 at mound base. Observed values are in fact 0 and 12 respectively.
2 = ?
2 = (0-6)2/6 + (12-6)2/6 = 6 + 6 = 12, with 1df. The critical value for 2 with 1 df at
p = 0.05 is 3.84
Calculated value > tabulated value, so result is significant. I reject H0 and accept that the distribution of plants appears to be non-random.
Plagiarism in psychology UGs??At a programme board in 2005 a paper was tabled giving resukts of a trial in which every single essay submitted to an anthropology module was carefully checked for any form of plagiarism. A series of quiet gasps went around as people saw the figures;
% badly plagiarised
Raw data
Biosciences 19 5/27
Psychology 39 12/31
O total E O-E(O-E)2/E
5.00 27.00 7.91 -2.91 1.07
12.00 31.00 9.09 2.91 0.93
sum 17.00 58.00 2.01
chi-sq = 2, df = 1, p>0.05 (in fact p>0.1)
This is a chi-squared question
In other words this pattern is non significant; it is easily within the expected range of random noise.
% v poor refs
Raw data
Biosciences 30 8/27
Psychology 3 1/31
Now try the other half of the results; the number of students whose referencing was missing or poor.
Until 1999 a doctor callled Harold Shipman worked as a GP in north Manchester. His mortality rates seemed a little high compared with a neighboring GP practice.The practice is that when deaths occur at home under GP supervision another medical practice often signs the death certificate.
Shipman’s practice Next GP practice along
Peopleserved 3100 9800
Death certssigned / year
47 14
One little catch: This concerns the “Expected” values. Remember that 2 involves the term (O-E)2/E. If you had a tiny
value of E, 2 would be huge. (As E -> 0, 2 -> infinity). This is not a problem most of the time. However, it is a problem if E gets too small, and there is a rule of thumb to guide you here:
If E <5, the 2 value may be unreliable.
The solution is normally simple: pool classes together until you get a big enough value of E.
Let’s take the dartboard example
We have 3 zones, comprising 2, 24 and 74% of the area.
Throw 100 darts randomly at these we expect them to contain 2, 24 and 74 darts respectively.
Our E values are hence 2, 24 and 74.
74% area
24%
2%
1
2475
3530
35
75
205
3 dartboards – let’s use the Chi-sq to assess their likelihood.
74242E
75241O
74242E
353035O
74242E
52075O
7426E
7525O
7426E
3565O
7426E
595O
2 = Σ (O-E)2/E= (25-26)2/26+(75-74)2/74=1/26 + 1/74
2 = 0.052 2 = 79.05 2 = 247.5
Your turn!
In Sheffield we surveyed tombstones with lichen cover. There were 2 types of stone: millstone and marble.
One day we found 80 marble and 120 millstone tombstones. 80 of the millstones ones had Lecanora conizeiodes but none of the marble ones did. Is this significant?
2 way chi-squareThis is very common, but needs a little thought. Here we have a distribution of counts in 2 crossed categories (eg M/F * did/did not gain a score, habitat type * present/absent). It is possible to test H0: random distribution. If H0 is rejected you may conclude that the distribution is not random, but you can’t go on to identify which observations / classes / treatments are responsible for this effect.
M F
Habitat 1 25 5
Habitat 2 10 15
Habitat 3 5 35
Obs: M F Sum
Habitat 1 25 5 30
Habitat 2 10 15 25
Habitat 3 5 35 40
Sum 40 55 95
What are the constraints on this? That you sampled a certain total number of individuals, and a certain total fell into each gender and a certain total into each habitat. Given these totals we can predict the expected number of observations under a random distribution
Expected M F Sum
Habitat 1 30*40/95 30*55/95 30
Habitat 2 25*40/95 25*55/95 25
Habitat 3 40*40/95 40*55/95 40
Sum 40 55 95
Expected
habitat 1 12.63158 17.36842 30
Habitat 2 10.52632 14.47368 25
Habitat 3 16.84211 23.15789 40
Sum 40 55 95
chi-squared
habitat 1 12.11074561 8.807814992
Habitat 2 0.026315789 0.019138756
Habitat 3 8.326480263 6.05562201
Sum 20.46354167 14.88257576
Answer: Chi-square = 35.34611742
Yates’ Correction (the continuity correction)This correction can probably be ignored under most circumstances – in fact I would never use it, preferring instead a home-grown Monte-Carlo approach (next slide..), but this correction could matter if you are using chi-square to look for associations between events and have small sample size.
Sp1 Present Absent sum
Sp2 PresentAbsentsum
a b a+bc d c+da+c b+d N
If you crank through the chi-square calculation on this association matrix you find that chi-square =
(ad-bc)2*N(a+b)(c+d)(a+c)(b+d)
Sp1 Present Absent sum
Sp2 PresentAbsentsum
0 20 2040 60 10040 80 120
If you crank through the chi-square calculation on this association matrix you find that chi-square =
(ad-bc)2*N(a+b)(c+d)(a+c)(b+d)
E Sp1 Present Absent sum
Sp2 PresentAbsentsum
6.7 13.3 2033.3 66.6 10040 80 120
O
Sum(O-E)2/E = 12 QED
(0*60-20*40)2*12020*100*40*80
= 12 QED
Yates’ correction, contd!
Yates’ correction is a fudge applied to this calculation in cases where E values are <5 or N < 100. It goes as follows:
(|ad-bc|-n/2)2*N(a+b)(c+d)(a+c)(b+d)
Essentially this reduces the calculated chi-square value by reducing the (ad-bc)2 term on the top line, making the test more conservative (=more likely to accept H0). But I wouldn’t do it that way…
More on small samples
Each time we have a PhD student who studies primates in the field, they end up coming to me asking about chi-square analyses on small datasets where the E values are incorrigibly low. Since some M. Res. Primatology students will read this, the situation needs addressing.
The good news: the E>4 rule is over cautious, and in my experience you can get away with E values as low as 2 and still get accurate confidence levels.How do I know? Because there is a back-door solution, not available in the books or major packages, which I use that allows me blithely to ignore E<5 and Yates’ corection. It is a Monte-Carlo empirical determination of significance.
Monte-Carlo Chi-squareRemember the dartboard? A Monte-Carlo determination involves first calculating your actual 2 and ‘writing this down’ (in a PC memory). Let’s say that your data involves 20 observations. Then I get a PC to randomly ‘throw 20 darts’ at a dartboard of the same construction as your E values, and calculate a 2 value. This is a random 2 value. This is stored, and a second set of random darts is thrown, and a second 2 value calculated. This is repeated >200 times, to give you an empirical insight into what would be expected from random positioning of your 20 observations. Al the PC then has to do is compare your real 2 value with these random ones to derive a safe, dependable significance level.
The catch is that you have to use a bit of old DOS code I wrote to do this, hence the reason why these students keep knocking on my door..
Ymke’s warthogs..Number of respondents
to questionnaireNumber of respondent
reporting problems with Wart hog
39 013 410 021 57 0
31 3
Sum: 121 12
Number of respondents to questionnaire
Number of respondent reporting problems
with Wart hog
expected wrthg
39 0 3.86776859513 4 1.28925619810 0 0.99173553721 5 2.0826446287 0 0.694214876
31 3 3.074380165121 12 12
Oh Dear!! Not an E value >4
So I loaded O and E values into my MC-chi-sq programme…
And the MC-output said:2 = 15.34 (correct!)It calculated 1000 random 2 values based on “throwing 12 darts” at a board divided into 6 zones whose relative sizes were 3.87, 1.29,… 3.07 and found that 95% of them were <11.72 and 99% of them were < 16.11.
So what was the significance of 2 = 15.34?The standard tables, crudely ignoring E<5 problems, give this a significance level of 0.01>p>0.005.
Chi-square and model fittingThis is a whole lecture in itself! There is one simple, neat, easy-to-understand way to use chi-square to see whether two variables are associated. All that matters is that you can plot a graph of the variables: consider the scattergraph of the relationship between two arbitrary variables:
Plant massg
Fertiliser, gmedian
median
How many points in each quadrant (sum = 40)?What is H0, and why?
H0: 10 in each