z test t test chi sq test

STAT 2Lecture 32:

Still more testing

Recall: the two-sample z-test

In 1998, a survey of a SRS of 200 Berkeley students found that they, on average, played hackeysack 4 hours a week (with an SD+ of 8 hours)

In 2008, a survey of a SRS of 200 Berkeley students found that they, on average, played hackeysack 2 hours a week (with an SD+ of 6 hours)

Has the average number of hours of hackeysack played per week by all Berkeley students changed from 1998 to 2008?

Setting up the two-sample z-test

Null hypothesis: the 1998 population average is the same as the 2008 population average (so the data are like draws from two boxes with the same average)

Alternative: the averages are different Test statistic: the z-statistic for the difference

in sample averages Assume 1998 and 2008 samples are

independent

Calculations

Observed difference in averages = 2 hours Expected difference in averages,

assuming null hypothesis = 0 SE of 1998 sample average = 8/sqrt(200)

= 0.566 SE of 2008 sample average = 6/sqrt(200)

= 0.424 SE of difference in sample averages =

sqrt(0.5662 + 0.4242) = 0.707

Calculations

z-statistic = (2 0)/0.707 = 2.83From tables: P(Z < -2.83) = P(Z > 2.83) = 0.23% Two-tailed P-value is 0.23% + 0.23%

= 0.46%There is a statistically significant

difference; the average hours of hackeysack played has changed

Today

Testing results of experiments The chi-square test

I

Experimental averages

Example: vitamin C

A study of the effect of vitamin C on cold resistance is performed. 200 participants are randomly assigned to the treatment group of 100, which receives vitamin C pills, or the control group of 100, which receives placebos. Doctors and participants don't know who's in which group.

Example: vitamin C

The treatment group averaged 2.3 colds (SD 3.1)

The control group averaged 2.6 colds (SD 2.9)

Is this difference significant? Does vitamin C prevent colds?

A box model for controlled experiments First imagine a box containing 200 tickets:

one for each participant Each ticket has TWO numbers written on it:

a treatment number and a control number If they're randomly assigned to the

treatment group, we observe the treatment number; if they're randomly assigned to the control group, we observe the control number

A box model for controlled experiments Null hypothesis: the average of all

200 of the treatment numbers is equal to the average of all 200 of the control numbers

Alternative: the treatment and control averages are not equal

Try the two-sample z-test

Let's do the two-sample z-test for now, and think about whether it's appropriate later

Test statistic = treatment group average control group average

Observed value = 2.3 2.6 = -0.3 Expected value under null = 0

Standard errors

SE of treatment group average (ignoring correction factor) = 3.1/sqrt(100) = 0.31

SE of control group average = 2.9/sqrt(100) = 0.29

SE of difference = sqrt(0.312 + 0.292) = 0.42

Try the two-sample z-test

z-statistic = (-0.3 0)/0.42 = -0.71 P(Z < -0.71) = 23.89% P-value is 24% The difference is explainable by

chance: there's not enough evidence to show vitamin C prevents colds (even for this sample)

Was the two-sample z-test appropriate? But wait! We drew 100 tickets from a box

containing 200 tickets, so shouldn't we have used the correction factor?

Our standard error calculation assumed the two samples were independent, but they're not. The numbers you draw for the treatment tickets affect the numbers you get for the control tickets: if you're in the treatment group, you can't be in the control group!

IS IT ALL BALDERDASH

Why the two-sample z-test still works for experiments The two errors we stated on the

previous page will magically (almost) cancel out

The two-sample z-test will give us a P-value that's marginally too large (conservative) but this is usually better than being too small

For a proof, see page 33 of the textbook appendix, if you like algebra.

Another example: are people rational? This experiment was performed on 167

doctors in a summer course at Harvard The doctors were asked to evaluate

information presented on surgery and radiation treatment for cancer

Doctors were randomly divided into two groups; the group were presented the information in different ways

Group A

Of 100 people having surgery: 10 will die during treatment 32 will have died by one year 66 will have died by five yearsOf 100 people having radiation therapy: None will die during treatment 23 will die by one year 78 will due by five years

Group B

Of 100 people having surgery: 90 will survive the treatment 68 will survive one year or longer 34 will survive five years or longerOf 100 people having radiation therapy: All will survive the treatment 77 will survive one year or longer 22 will survive five years or longer

The doctors' results

In Group A, 40 favoured surgery, 40 favoured radiation (50% vs 50%)

In Group B, 73 favoured surgery, 14 favoured radiation (83.9% vs 16.1%)

Can this difference be explained by chance?

Setting up the test

We'll use (Group B percentage Group A percentage) as our test statistic

Null hypothesis: difference in percentages for all the doctors is zero

Alternative: difference is not zero

Calculations

Observed difference = 83.9% 50% = 33.9% Expected difference under null = 0% SE of Group A = sqrt(0.5*0.5/80)*100%

= 5.6% SE of Group B = sqrt(0.839*0.161/87)*100%

= 3.94% SE of difference = sqrt(3.942+5.62) = 6.84%Note: some statisticians would pool both samples to get

one estimate of the box SD (instead of one for each group). It doesn't make much difference though.

Setting up the test

z-statistic = (33.9 0)/6.84 = 4.96 This is so big it's not even on our

normal table, so the P-value will be miniscule (from computer, P-value is 1/14000 of 1%)

These doctors were not rational (doesn't necessarily mean that the whole population of doctors is irrational, but it might)

When can we use the two-sample z-test? When we are examining the difference between two

independent samplesOR When we are examining the difference between two

groups from a randomised experimentNOT When we are examining the difference between two

dependent samples, and they're not from a randomised experiment

In any case, we require a reasonably large sample size

Example

I have both the midterm and final scores for a large sample of past Stat 2 students. What test should I do for a difference between midterm and final scores?

Not an experiment Dependent: midterm and final scores for the same

student are obv. related data is paired Can't do a two-sample z-test Instead find the difference between final and

midterm scores, and compare the average difference to zero using a one-sample z-test

Aside: the two-sample z-test

Just as there's a two-sample z-test, there's a two-sample t-test for the average difference between two moderate-sized normal samples

We don't teach it because it's theoretically fraught; however it's commonly used (inappropriately)

II

The chi-square test

So far

We've seen: z-test: average/percentage/total/

count, large sample t-test: average/percentage/total/

count, normal data Two-sample z-test: difference

between averages of independent samples or of experimental groups

Now

All the tests we've seen have examined parameters of some model

What if we want to test an entire chance model? Specifically, how well does the data we've observed fit a particular chance model?

Such tests are goodness-of-fit tests

Example: is this die loaded?

To see if a die is loaded, I roll it 60 times. I expect ten 1's, ten 2's, ten 3's etc. I get:

four 1's six 2's 17 3's16 4's eight 5's nine 6's Is the die loaded?

Why not do a z-test?

We could do a z-test to check that the sample average is what it should be (3.5)

However, there are some thing this test won't pick up

e.g. the die has a 30% chance of a 1, a 30% chance of a 6, and a 10% chance of any other number

Mean will still be 3.5 even though it's loaded

The chi-square statistic

Again we want to compare what we observe to what we expect. The way we do this is by finding the chi-square statistic:

Calculate

for each outcome, then take the sum of theseNotes: Frequency just means the count in each category.

Don't divide by the SE; that was the z- or t-statistic.

observed frequencyexpected frequency2

expected frequency

Calculating the chi-square statistic

For ones: (4 10)2/10 = 3.6 For twos: (6 10)2/10 = 1.6 For threes: (16 10)2/10 = 3.6 For fours: (17 10)2/10 = 4.9 For fives: (8 10)2/10 = 0.4 For sixes: (9 10)2/10 = 0.1 2 = 3.6 + 1.6 +3.6 + 4.9 + 0.4 +0.1

= 14.2

Is this significant?

If we have used the true expected values, then 2 has an approximate chi-square distribution

Actually, like the t-distribution, the chi-square distribution is really a whole set of distributions

Here degrees of freedom = number of categories 1 = 6 1 = 5

So if the null is true, our has a chi-square distribution with 5 degrees of freedom

Is this significant?

We look up the chi-square distribution in tables or on a computer

We find the probability of getting a 2

5 of more than 14.2 is 1.4%

This means the P-value is 1.4% The die is loadedNote: chi-square tests are almost always one-tailed

When do we use the chi-square test? We have data in categories (or that we

can put into categories) We have a (box) model for how many

data points fall into each category We want to see if our observed data

matches this modelNote: for the test to be accurate, we need the

expected number in each category to be at least 5

The structure of the chi-square test

Null hypothesis: often easiest to express in terms of tickets in a box. For the die, the null hypothesis was that rolls were like draws from the box

[ 1 2 3 4 5 6 ] Alternative: data are not like draws

from this box


Draw a table of observed and expected frequencies for each category

Calculate

for each category Add up the results; this is 2

observed frequencyexpected frequency 2

expected frequency


A high value of 2 means the data is different from what we expect

The exact distribution of 2 depends on the degrees of freedom (number of categories 1)

We calculate a P-value using tables or a computer

Recap: the two-sample z-test

We use the two-sample z-test to test a null hypothesis about the difference in the averages of two boxes (often, that the averages are the same)

The test statistic isobserved differenceexpected differenceSE of difference

Recap: the distribution of a difference If the null hypothesis is that the true averages

are the same, the expected difference is zero If the SEs of the sample averages are a and b,

the SE of their difference is

This requires either independent samples, or a randomised experiment

With large samples, we can compare z to a standard normal to get a P-value

a2b2

Recap: the chi-square test If we have data in categories, and we wish

to know if this data fits a null model, we calculate the chi-squared statistic:

The degrees of freedom is the number of categories minus one

We find the P-value from a chi-square distribution using a table or computer

2= observed frequencyexpected frequency 2

expected frequency

Monday

The chi-square test for independence

Problems with statistical tests

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45

z test t test chi sq test

Documents