py1pr1 stats lecture 4 handout

8/15/2019 PY1PR1 Stats Lecture 4 Handout

1/29

PY1PR1 lecture 4: Comparingtwo sample means

Dr David Field


2/29

Comparing two samples

• Researchers often begin with a h pothesis thattwo sample means will be different from eachother

• !n practice" two sample means will almost alwa s

be slightl different from each other • #herefore" statistics are used to decide whether

the observed difference between two samples ismeaningful or not

• #o do this" we test the null h pothesis that the twosamples were both drawn randoml from thesame population


3/29

#est statistics• #o test the null h pothesis we need to $uantif the strength

of the evidence against it• #his is done using test statistics

% when the test statistic is larger" there is more evidence against thenull h pothesis

• &hat ma'es test statistics different from other statistics isthat the have 'nown probabilit distributions when the nullh pothesis is true % we 'now the p of a test statistic of 1 or (1 occurring purel due to

sampling variation from a null distribution % the p of a test statistic of ) or ( ) will be lower than the p of a test

statistic of (1 % if the p of the test statistic occurring purel due to sampling

variation is * +,+- .-/0 the null h pothesis is re ected• #est statistics with 'nown probabilit distributions under the

null h pothesis include 2" t" r, and chi3s$uare % ean" edian" 5D are not test statistics


4/29

Confidence intervals as a test

• 6ecture ) e7plained how to calculate a 8-/confidence interval around a single sample mean % this was achieved using the 59 of an inferred sampling

distribution of the mean % collecting two samples and calculating two separate

confidence intervals establishes that the two samples arefrom different populations if the confidence intervals do notoverlap

% but it does not allow a conclusion to be reached when the

confidence intervals do overlap• #o calculate a test statistic to directl test the nullh pothesis we need to consider a slightl differentsampling distribution

% the sampling distribution of the difference between twomeans


5/29

5ampling distribution of the differencebetween two means

• ormall " ou are onl able to measure ) samplesand calculate ) means and the difference betweenthem

• ;ut test statistics are based on properties of an

assumed underl ing sampling distribution of thedifference between two means

• #he best wa to understand test statistics is toconsider unusual or artificial e7amples where fullpopulation data and sampling distributions areavailable

• #herefore


6/29


7/29


8/29


9/29

&eights of ;ritish cats.Ag0

&eights of Bree' cats.Ag0

ean 5D 59 ean 5D 595mall samplesi2e . -0 4,1 +,- +,)= =,8 +,1 +,+>

4,8 +,> +,)8 4,4 +,- +,)1

-,) 1,1 +,48 =,? +,= +,1=

6argesample si2e. 1)0

4,1 +,@ +,)= 4,1 +,) +,+?

4,> +,> +,1@ =,8 +,4 +,1+

4,? +,- +,1- =,@ +,= +,1+


10/29

5ampling distribution of the differencebetween two means

• #a'e a large number of samples of - cats fromthe A population % Errange the samples in pairs and for each pair

calculate the difference between the two means

% alf the differences will be negative and half of themwill be positive

% #herefore the mean of this sampling distribution willbe 2ero, #his differs from the sampling distribution of

a single sample mean" which has a mean e$ual tothe underl ing population mean % #he sampling distribution of the difference between

two means will be normall distributed


11/29

• GR!B! E6 D!5#R!; #!G is the population fre$uenc distribution ofweight differences between pairs of individual cats

• ;lac' solid curves are sampling distributions of weight differencesbetween ) sample means" for samples of of 4" 1>" and >4 cats


12/29

5tandard error of the difference between twosample means

• H .sigma0 means the 5D of the population of differencescores

• 1 and ) are the two sample si2es % the formula allows the 59 of the sampling distribution to be

calculated when the two samples differ in si2e

• 6i'e the 59 of a single sample mean" this 59 gets smalleras increases and gets smaller as the 5D gets smaller

• 5maller 59 ma'es it easier to re ect null h pothesis

59 1

1H I 1

)


13/29

59 of the difference between mean Ag for twosamples of - A cats

• 1J- .or 1J)" or 1J=" or 1J)+0 is a number less than 1• #he s$uare root ma'es the number larger" but

never ma'es it greater than 1• 5o" the population 5D gets multiplied b a number

smaller than 1" which is wh the 59 is alwa s

smaller than the 5D of the population

+,-+> Ag 1

-+,@ I 1

-


14/29


ean 5D 59

5mall samplesi2e . -0

4,1 +,- +,)=

4,8 +,> +,)8

-,) 1,1 +,48=,> +,? +,=+

• For the highlightedpair of samples thedifference betweenthe means is +,-Ag

• &hat percentage ofsample pairs have adifference of +,-Agor largerK

• !f we e7pressed thedifference of +,-Agin units of 59 wecould answer that$uestion

• #his is because theconverted score is aL score

Remember that in this theoreticale7ample we 'now that bothsamples are from the samepopulation" and the purpose is tocalculate the p of a difference thisbig or bigger occurring when that is

the case


15/29

Converting the difference between ) samplemeans to a L score

L

1-

+,@ I 1-

+,-

L +,88

#he differencebetween the means

#he 59formula


16/29

1>,1/ of the total area underthe normal curve corresponds tovalues of +,88 or greater

1>,1/ of differences betweenmeans of sample si2e - willhave L scores greater than +,88


17/29

From L bac' to Ag

• 5o" 1>,1/ of differences between pairs ofsamples of - drawn from the population of Acats will be +,-Ag or larger

• #his is the same as sa ing the probabilit of a

single comparison producing a difference of +,-Agor greater is 1>,1/


18/29

5D1 )

&hat if the population 5D .H0 is un'nownK

• suall " researchers onl have two samples tocompare" and the population parameters areun'nown,

• !n this situation the sample 5D is used instead ofthe population 5D" and the 59 formula is modified

591 I

5D) )

)


19/29


ean 5D 59

5mall samplesi2e . -0

4,1 +,- +,)=

4,8 +,> +,)8

-,) 1,1 +,48=,> +,? +,=+

• For the highlightedpair of samples themean difference is+,-Ag

• #he sample 5DMs willbe used in themodified formulainstead of the

un'nown population5D


20/29

+,- )

Converting the difference between ) meansto a L score when H is un'nown

-I +,?

)

-

L+,-

1,)8+,-

+,=@


21/29

ow much evidence is there against the nullh pothesisK

• 8,@/ of L statistics are ( 1,)8" so we would not concludethat the two samples of cats are from different countries ifwe used the -/ cut off

• !n this e7ample" we 'now that the two samples were fromthe same population" so we can verif that this was thecorrect conclusion

• Gn the other hand" if two samples had a mean differenceof +,@Ag" then assuming the sample 5DMs remain thesame" the resulting L statistic would be ),+?

• Gnl 1,8/ of L statistics are greater than ),+?" and if wedidnMtknow that the two samples came from the samepopulation we would re ect the null h pothesis" and bdoing so commit a # pe ! error


22/29


&eights of Bree' cats.Ag0

ean 5D 59 ean 5D 595mall samplesi2e . -0

4.1 0.5 0.23 =,8 +,1 +,+>

4,8 +,> +,)8 4,4 +,- +,)1

-,) 1,1 +,48 3.7 0.3 0.13

6argesample si2e. 1)0

4,1 +,@ +,)= 4.1 0.2 0.07

4.6 0.6 0.18 =,8 +,4 +,1+

4,? +,- +,1- =,@ +,= +,1+


23/29

+,- )

#he L score of the difference betweensamples of - A and - Bree' cats

-I +,=

)

-

L4,1 % =,?

1,-=+,4

+,)>


24/29


•>,=/ of L statistics are ( 1,-=" so we would be unable toconclude that the two samples of cats are from differentcountries if we used the -/ cut off

• !n this e7ample we 'now that the two samples were fromdifferent populations" so we have committed a # pe !! errorb failing to re ect the null h pothesis

• # pe !! errors li'e this are common when the sample si2eis small


25/29

+,> )

#he L score of the difference betweensamples of 1) A and 1) Bree' cats

1)I +,)

)

1)

L4,> % 4,1

),?=+,-

+,1@


26/29


•+,+=)/ of L statistics are ( ),?=" so we would beconclude that the two samples of cats are from differentcountries if we used the -/ cut off

• !n this e7ample we 'now that the two samples were fromdifferent populations" so we have correctl re ected the nullh pothesis


27/29

!mportant caveat• &hat ! have described toda is called a NL testO• ;ut" the formula for estimating the 59 of the difference

between ) means used in the L test is onl accurate whenthe individual sample si2es are =+ or more % #his is because the estimate of the population 5D is not accurate

• #here is a different test that uses an accurate estimate ofthe 59 when sample si2e is less than =+ % the Nt testO" which is covered in the ne7t lecture

• ;ecause the t test produces the same results as the L test

when the sample si2e is (=+ computer programs li'e5P55 generall onl give the option of a t test

• ;oth tests wor' on the same principle" but the L test is lesscomplicated and easier to understand


28/29

Beneral principle of test statistics

test statisticvariation in the D due to the !

other variation in the data .error0

• Ell test statistics have 'nown probabilit distributions whenvariation in the D due to the ! is 2ero .i,e, the null h p istrue0

• L has the distribution of the standard normal distribution• Gther test statistics have different shaped distributions"

and different calculation formulas" but the general principlefor converting the test statistic to a p value is the same,


29/29

6ist of statistical terms for revision

• #his lecture made use of terms introduced inprevious lectures" and onl introduced one newterm % sampling distribution of the difference between two

means

py1pr1 stats lecture 4 handout

Documents