z test t test chi sq test
TRANSCRIPT
-
STAT 2Lecture 32:
Still more testing
-
Recall: the two-sample z-test
In 1998, a survey of a SRS of 200 Berkeley students found that they, on average, played hackeysack 4 hours a week (with an SD+ of 8 hours)
In 2008, a survey of a SRS of 200 Berkeley students found that they, on average, played hackeysack 2 hours a week (with an SD+ of 6 hours)
Has the average number of hours of hackeysack played per week by all Berkeley students changed from 1998 to 2008?
-
Setting up the two-sample z-test
Null hypothesis: the 1998 population average is the same as the 2008 population average (so the data are like draws from two boxes with the same average)
Alternative: the averages are different Test statistic: the z-statistic for the difference
in sample averages Assume 1998 and 2008 samples are
independent
-
Calculations
Observed difference in averages = 2 hours Expected difference in averages,
assuming null hypothesis = 0 SE of 1998 sample average = 8/sqrt(200)
= 0.566 SE of 2008 sample average = 6/sqrt(200)
= 0.424 SE of difference in sample averages =
sqrt(0.5662 + 0.4242) = 0.707
-
Calculations
z-statistic = (2 0)/0.707 = 2.83From tables: P(Z < -2.83) = P(Z > 2.83) = 0.23% Two-tailed P-value is 0.23% + 0.23%
= 0.46%There is a statistically significant
difference; the average hours of hackeysack played has changed
-
Today
Testing results of experiments The chi-square test
-
I
Experimental averages
-
Example: vitamin C
A study of the effect of vitamin C on cold resistance is performed. 200 participants are randomly assigned to the treatment group of 100, which receives vitamin C pills, or the control group of 100, which receives placebos. Doctors and participants don't know who's in which group.
-
Example: vitamin C
The treatment group averaged 2.3 colds (SD 3.1)
The control group averaged 2.6 colds (SD 2.9)
Is this difference significant? Does vitamin C prevent colds?
-
A box model for controlled experiments First imagine a box containing 200 tickets:
one for each participant Each ticket has TWO numbers written on it:
a treatment number and a control number If they're randomly assigned to the
treatment group, we observe the treatment number; if they're randomly assigned to the control group, we observe the control number
-
A box model for controlled experiments Null hypothesis: the average of all
200 of the treatment numbers is equal to the average of all 200 of the control numbers
Alternative: the treatment and control averages are not equal
-
Try the two-sample z-test
Let's do the two-sample z-test for now, and think about whether it's appropriate later
Test statistic = treatment group average control group average
Observed value = 2.3 2.6 = -0.3 Expected value under null = 0
-
Standard errors
SE of treatment group average (ignoring correction factor) = 3.1/sqrt(100) = 0.31
SE of control group average = 2.9/sqrt(100) = 0.29
SE of difference = sqrt(0.312 + 0.292) = 0.42
-
Try the two-sample z-test
z-statistic = (-0.3 0)/0.42 = -0.71 P(Z < -0.71) = 23.89% P-value is 24% The difference is explainable by
chance: there's not enough evidence to show vitamin C prevents colds (even for this sample)
-
Was the two-sample z-test appropriate? But wait! We drew 100 tickets from a box
containing 200 tickets, so shouldn't we have used the correction factor?
Our standard error calculation assumed the two samples were independent, but they're not. The numbers you draw for the treatment tickets affect the numbers you get for the control tickets: if you're in the treatment group, you can't be in the control group!
IS IT ALL BALDERDASH
-
Why the two-sample z-test still works for experiments The two errors we stated on the
previous page will magically (almost) cancel out
The two-sample z-test will give us a P-value that's marginally too large (conservative) but this is usually better than being too small
For a proof, see page 33 of the textbook appendix, if you like algebra.
-
Another example: are people rational? This experiment was performed on 167
doctors in a summer course at Harvard The doctors were asked to evaluate
information presented on surgery and radiation treatment for cancer
Doctors were randomly divided into two groups; the group were presented the information in different ways
-
Group A
Of 100 people having surgery: 10 will die during treatment 32 will have died by one year 66 will have died by five yearsOf 100 people having radiation therapy: None will die during treatment 23 will die by one year 78 will due by five years
-
Group B
Of 100 people having surgery: 90 will survive the treatment 68 will survive one year or longer 34 will survive five years or longerOf 100 people having radiation therapy: All will survive the treatment 77 will survive one year or longer 22 will survive five years or longer
-
The doctors' results
In Group A, 40 favoured surgery, 40 favoured radiation (50% vs 50%)
In Group B, 73 favoured surgery, 14 favoured radiation (83.9% vs 16.1%)
Can this difference be explained by chance?
-
Setting up the test
We'll use (Group B percentage Group A percentage) as our test statistic
Null hypothesis: difference in percentages for all the doctors is zero
Alternative: difference is not zero
-
Calculations
Observed difference = 83.9% 50% = 33.9% Expected difference under null = 0% SE of Group A = sqrt(0.5*0.5/80)*100%
= 5.6% SE of Group B = sqrt(0.839*0.161/87)*100%
= 3.94% SE of difference = sqrt(3.942+5.62) = 6.84%Note: some statisticians would pool both samples to get
one estimate of the box SD (instead of one for each group). It doesn't make much difference though.
-
Setting up the test
z-statistic = (33.9 0)/6.84 = 4.96 This is so big it's not even on our
normal table, so the P-value will be miniscule (from computer, P-value is 1/14000 of 1%)
These doctors were not rational (doesn't necessarily mean that the whole population of doctors is irrational, but it might)
-
When can we use the two-sample z-test? When we are examining the difference between two
independent samplesOR When we are examining the difference between two
groups from a randomised experimentNOT When we are examining the difference between two
dependent samples, and they're not from a randomised experiment
In any case, we require a reasonably large sample size
-
Example
I have both the midterm and final scores for a large sample of past Stat 2 students. What test should I do for a difference between midterm and final scores?
Not an experiment Dependent: midterm and final scores for the same
student are obv. related data is paired Can't do a two-sample z-test Instead find the difference between final and
midterm scores, and compare the average difference to zero using a one-sample z-test
-
Aside: the two-sample z-test
Just as there's a two-sample z-test, there's a two-sample t-test for the average difference between two moderate-sized normal samples
We don't teach it because it's theoretically fraught; however it's commonly used (inappropriately)
-
II
The chi-square test
-
So far
We've seen: z-test: average/percentage/total/
count, large sample t-test: average/percentage/total/
count, normal data Two-sample z-test: difference
between averages of independent samples or of experimental groups
-
Now
All the tests we've seen have examined parameters of some model
What if we want to test an entire chance model? Specifically, how well does the data we've observed fit a particular chance model?
Such tests are goodness-of-fit tests
-
Example: is this die loaded?
To see if a die is loaded, I roll it 60 times. I expect ten 1's, ten 2's, ten 3's etc. I get:
four 1's six 2's 17 3's16 4's eight 5's nine 6's Is the die loaded?
-
Why not do a z-test?
We could do a z-test to check that the sample average is what it should be (3.5)
However, there are some thing this test won't pick up
e.g. the die has a 30% chance of a 1, a 30% chance of a 6, and a 10% chance of any other number
Mean will still be 3.5 even though it's loaded
-
The chi-square statistic
Again we want to compare what we observe to what we expect. The way we do this is by finding the chi-square statistic:
Calculate
for each outcome, then take the sum of theseNotes: Frequency just means the count in each category.
Don't divide by the SE; that was the z- or t-statistic.
observed frequencyexpected frequency2
expected frequency
-
Calculating the chi-square statistic
For ones: (4 10)2/10 = 3.6 For twos: (6 10)2/10 = 1.6 For threes: (16 10)2/10 = 3.6 For fours: (17 10)2/10 = 4.9 For fives: (8 10)2/10 = 0.4 For sixes: (9 10)2/10 = 0.1 2 = 3.6 + 1.6 +3.6 + 4.9 + 0.4 +0.1
= 14.2
-
Is this significant?
If we have used the true expected values, then 2 has an approximate chi-square distribution
Actually, like the t-distribution, the chi-square distribution is really a whole set of distributions
Here degrees of freedom = number of categories 1 = 6 1 = 5
So if the null is true, our has a chi-square distribution with 5 degrees of freedom
-
Is this significant?
We look up the chi-square distribution in tables or on a computer
We find the probability of getting a 2
5 of more than 14.2 is 1.4%
This means the P-value is 1.4% The die is loadedNote: chi-square tests are almost always one-tailed
-
When do we use the chi-square test? We have data in categories (or that we
can put into categories) We have a (box) model for how many
data points fall into each category We want to see if our observed data
matches this modelNote: for the test to be accurate, we need the
expected number in each category to be at least 5
-
The structure of the chi-square test
Null hypothesis: often easiest to express in terms of tickets in a box. For the die, the null hypothesis was that rolls were like draws from the box
[ 1 2 3 4 5 6 ] Alternative: data are not like draws
from this box
-
The structure of the chi-square test
Null hypothesis: often easiest to express in terms of tickets in a box. For the die, the null hypothesis was that rolls were like draws from the box
[ 1 2 3 4 5 6 ] Alternative: data are not like draws
from this box
-
The chi-square statistic
Draw a table of observed and expected frequencies for each category
Calculate
for each category Add up the results; this is 2
observed frequencyexpected frequency 2
expected frequency
-
The chi-square statistic
A high value of 2 means the data is different from what we expect
The exact distribution of 2 depends on the degrees of freedom (number of categories 1)
We calculate a P-value using tables or a computer
-
Recap
-
Recap: the two-sample z-test
We use the two-sample z-test to test a null hypothesis about the difference in the averages of two boxes (often, that the averages are the same)
The test statistic isobserved differenceexpected differenceSE of difference
-
Recap: the distribution of a difference If the null hypothesis is that the true averages
are the same, the expected difference is zero If the SEs of the sample averages are a and b,
the SE of their difference is
This requires either independent samples, or a randomised experiment
With large samples, we can compare z to a standard normal to get a P-value
a2b2
-
Recap: the chi-square test If we have data in categories, and we wish
to know if this data fits a null model, we calculate the chi-squared statistic:
The degrees of freedom is the number of categories minus one
We find the P-value from a chi-square distribution using a table or computer
2= observed frequencyexpected frequency 2
expected frequency
-
Monday
The chi-square test for independence
Problems with statistical tests
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45