chi squared tests. introduction two statistical techniques are presented. both are used to analyze...

22
Chi Squared Tests

Upload: denzel-slape

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Chi Squared TestsChi Squared Tests

Introduction

• Two statistical techniques are presented. Both are used to analyze nominal data.– A goodness-of-fit test for a multinomial experiment.– A contingency table test of independence.

• The test statistics in both cases follow the 2

distribution.

• The hypothesis tested involves the “success” probabilities p1, p2, …, pk.of a multinomial distribution.

• The multinomial experiment is an extension of the binomial experiment.– There are n independent trials.– The outcome of each trial can be classified into one of k

categories, called cells.– The probability pi for an outcome to fall into cell i remains

constant for each trial. By assumption, p1 + p2 + … +pk = 1.

– Trials in the experiment are independent.

Chi-Squared Goodness-of-Fit Test

• Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for p i .

• The hypotheses:

H0 : p1 = a1, p2 = a2, ..., pk = akH1 : At least one pi ≠ ai

• The test builds on comparing actual frequency and the expected frequency of occurrences in all cells.

• Example 16.1– Two competing companies A and B have been

dominant players in the market. Both companies conducted recent advertising campaigns on their products.

– Market shares before the campaigns were:• Company A = 45%• Company B = 40%• Other competitors = 15%.

An Example

• Example 16.1 – continued– To study the effect of the campaigns on the market shares, a

survey was conducted.

– 200 customers were asked to indicate their preference regarding the products advertised.

– Survey results:• 102 customers preferred the company A’s product,• 82 customers preferred the company B’s product,• 16 customers preferred the competitors product.

• Example 16.1 – continued

Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?

• Solution– The population investigated is the brand preferences.– The data are nominal (A, B, or other)– This is a multinomial experiment (three categories).– The question of interest: Are p1, p2, and p3 different

after the campaign from their values prior to the campaigns?

1

2

3

1

2

3

• The hypotheses are:H0: p1 = .45, p2 = .40, p3 = .15H1: At least one pi changed.

The expected frequency for eachcategory (cell) if the null hypothesis is true is shown below:

90 = 200(.45)

30 = 200(.15)

102 82

16

What actual frequencies did the sample return?

80 = 200(.40)

• The statistic is:

Intuitively, this measures the extent of differences between the observed and the expected frequencies.

• The rejection region is:€

2 =( f i − ei)

2

eii=1

k

∑where ei = npi

2 > χ α ,k−12

• Example 16.1 – continued

18.830

)3016(80

)8082(90

)90102( 22k

1i

22 =

−+

−+

−= ∑

=

α ,k−12 = χ .05,3−1

2 = 5.99147

The p − value = P(χ 2 > 8.18) = .01679

[this come from Excel : = CHIDIST(8.18,2)]

• Example 16.1 – continued

0

0.005

0.01

0.015

0.02

0.025

0 2 4 6 8 10 12

Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed.

P valueAlpha

5.99 8.18Rejection region

2 with 2 degrees of freedom

Required Conditions – The Rule of Five

• The test statistic used to perform the test is only approximately Chi-squared distributed.

• For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (npi 5).

• If the expected frequency in a cell is less than 5, combine it with other cells.

Chi-squared Test of a Contingency Table

• This test is used to test whether…– two nominal variables are related?– there are differences between two or more

populations of a nominal variable?• To accomplish the test objectives, we need to

classify the data according to two different criteria.

• The idea is also based on goodness of fit.

• Example 16.2– In an effort to better predict the demand for courses

offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection.

– A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.

Degree Accounting Finance MarketingBA 31 13 16 60

BENG 8 16 7 31BBA 12 10 17 60

Other 10 5 7 3961 44 47 152

There are two ways to view this problem

If each undergraduate degree is considered a population, do these populations differ?

If each classification is considered a nominal variable, are these twovariables dependent?

The observed values

• Solution– The hypotheses are:

H0: The two variables are independent

H1: The two variables are dependent

k is the number of cells in the contingency table.

– The test statistic

∑=

−=

k

1i i

2ii2

e)ef(

– The rejection region

2)1c)(1r(,

2−−α>

Since ei = npi but pi is unknown, we need to estimate the unknown probability from the data, assuming H0 is true.

Under the null hypothesis the two variables are independent:

P(Accounting and BA) = P(Accounting)*P(BA)

Undergraduate MBA MajorDegree Accounting Finance Marketing Probability

BA 60 60/152BENG 31 31/152BBA 39 39/152Other 22 22/152

61 44 47 152Probability 61/152 44/152 47/152

The number of students expected to fall in the cell “Accounting - BA” iseAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08

= [61/152][60/152].

60

61 152

The number of students expected to fall in the cell “Finance - BBA” iseFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29

44

39

152

Estimating the expected frequencies

eij = (Column j total)(Row i total)Sample size

• The expected frequency of cell of row i and column j in the contingency table is calculated by:

∑=

−=

k

1i i

2ii2

e)ef(

Undergraduate MBA MajorDegree Accounting Finance Marketing

BA 31 (24.08) 13 (17.37) 16 (18.55) 60BENG 8 (12.44) 16 (8.97) 7 (9.58) 31BBA 12 (15.65) 10 (11.29) 17 (12.06) 39Other 10 (8.83) 5 (6.39) 7 (6.80) 22

61 44 47 152

The expected frequency

31 24.08

31 24.08

31 24.08

31 24.08

31 24.08

(31 - 24.08)2

24.08 +….+

5 6.39

5 6.39

5 6.395 6.39

(5 - 6.39)2

6.39 +….+

7 6.80

7 6.80

7 6.80

(7 - 6.80)2

6.80

7 6.80

2= = 14.70

2 =( f i − ei)

2

eii=1

k

Calculation of the 2 statistic

• Solution – continued

• Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent.

• Solution – continued– The critical value in our example is:

α ,(r−1)(c−1)2 = χ .05,(4 −1)(3−1)

2 = 12.5916

Degree MBA Major3 11 11 11 12 21 3. .

. .

Code:Undergraduate degree 1 = BA2 = BENG3 = BBA4 = OTHERSMBA Major 1 = ACCOUNTING2 = FINANCE3 = MARKETING

Contingency Table1 2 3 Total

1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22Total 61 44 47 152Test Statistic CHI-Squared = 14.7019P-Value = 0.0227

Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02

Define a code to specify each nominal value. Input the data in columns one column for each category.

Using the computer