chi squared tests. introduction two statistical techniques are presented. both are used to analyze...

Chi Squared TestsChi Squared Tests

Introduction

• Two statistical techniques are presented. Both are used to analyze nominal data.– A goodness-of-fit test for a multinomial experiment.– A contingency table test of independence.

• The test statistics in both cases follow the 2

distribution.

• The hypothesis tested involves the “success” probabilities p1, p2, …, pk.of a multinomial distribution.

• The multinomial experiment is an extension of the binomial experiment.– There are n independent trials.– The outcome of each trial can be classified into one of k

categories, called cells.– The probability pi for an outcome to fall into cell i remains

constant for each trial. By assumption, p1 + p2 + … +pk = 1.

– Trials in the experiment are independent.

Chi-Squared Goodness-of-Fit Test

• Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for p i .

• The hypotheses:

€

H0 : p1 = a1, p2 = a2, ..., pk = akH1 : At least one pi ≠ ai

• The test builds on comparing actual frequency and the expected frequency of occurrences in all cells.

• Example 16.1– Two competing companies A and B have been

dominant players in the market. Both companies conducted recent advertising campaigns on their products.

– Market shares before the campaigns were:• Company A = 45%• Company B = 40%• Other competitors = 15%.

An Example

• Example 16.1 – continued– To study the effect of the campaigns on the market shares, a

survey was conducted.

– 200 customers were asked to indicate their preference regarding the products advertised.

– Survey results:• 102 customers preferred the company A’s product,• 82 customers preferred the company B’s product,• 16 customers preferred the competitors product.

• Example 16.1 – continued

Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?

• Solution– The population investigated is the brand preferences.– The data are nominal (A, B, or other)– This is a multinomial experiment (three categories).– The question of interest: Are p1, p2, and p3 different

after the campaign from their values prior to the campaigns?

1

2

3

1

2

3

• The hypotheses are:H0: p1 = .45, p2 = .40, p3 = .15H1: At least one pi changed.

The expected frequency for eachcategory (cell) if the null hypothesis is true is shown below:

90 = 200(.45)

30 = 200(.15)

102 82

16

What actual frequencies did the sample return?

80 = 200(.40)

• The statistic is:

Intuitively, this measures the extent of differences between the observed and the expected frequencies.

• The rejection region is:€

2 =( f i − ei)

2

eii=1

k

∑where ei = npi

€

2 > χ α ,k−12


18.830

)3016(80

)8082(90

)90102( 22k

1i

22 =

−+

−+

−= ∑

=

€

α ,k−12 = χ .05,3−1

2 = 5.99147

The p − value = P(χ 2 > 8.18) = .01679

[this come from Excel : = CHIDIST(8.18,2)]


0

0.005

0.01

0.015

0.02

0.025

0 2 4 6 8 10 12

Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed.

P valueAlpha

5.99 8.18Rejection region

2 with 2 degrees of freedom

Required Conditions – The Rule of Five

• The test statistic used to perform the test is only approximately Chi-squared distributed.

• For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (npi 5).

• If the expected frequency in a cell is less than 5, combine it with other cells.

Chi-squared Test of a Contingency Table

• This test is used to test whether…– two nominal variables are related?– there are differences between two or more

populations of a nominal variable?• To accomplish the test objectives, we need to

classify the data according to two different criteria.

• The idea is also based on goodness of fit.

• Example 16.2– In an effort to better predict the demand for courses

offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection.

– A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.

Degree Accounting Finance MarketingBA 31 13 16 60

BENG 8 16 7 31BBA 12 10 17 60

Other 10 5 7 3961 44 47 152

There are two ways to view this problem

If each undergraduate degree is considered a population, do these populations differ?

If each classification is considered a nominal variable, are these twovariables dependent?

The observed values

• Solution– The hypotheses are:

H0: The two variables are independent

H1: The two variables are dependent

k is the number of cells in the contingency table.

– The test statistic

∑=

−=

k

1i i

2ii2

e)ef(

– The rejection region

2)1c)(1r(,

2−−α>

Since ei = npi but pi is unknown, we need to estimate the unknown probability from the data, assuming H0 is true.

Under the null hypothesis the two variables are independent:

P(Accounting and BA) = P(Accounting)*P(BA)

Undergraduate MBA MajorDegree Accounting Finance Marketing Probability

BA 60 60/152BENG 31 31/152BBA 39 39/152Other 22 22/152

61 44 47 152Probability 61/152 44/152 47/152

The number of students expected to fall in the cell “Accounting - BA” iseAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08

= [61/152][60/152].

60

61 152

The number of students expected to fall in the cell “Finance - BBA” iseFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29

44

39

152

Estimating the expected frequencies

eij = (Column j total)(Row i total)Sample size

• The expected frequency of cell of row i and column j in the contingency table is calculated by:

∑=

−=

k

1i i

2ii2

e)ef(

Undergraduate MBA MajorDegree Accounting Finance Marketing

BA 31 (24.08) 13 (17.37) 16 (18.55) 60BENG 8 (12.44) 16 (8.97) 7 (9.58) 31BBA 12 (15.65) 10 (11.29) 17 (12.06) 39Other 10 (8.83) 5 (6.39) 7 (6.80) 22

61 44 47 152

The expected frequency

31 24.08

31 24.08

31 24.08

31 24.08

31 24.08

(31 - 24.08)2

24.08 +….+

5 6.39

5 6.39

5 6.395 6.39

(5 - 6.39)2

6.39 +….+

7 6.80

7 6.80

7 6.80

(7 - 6.80)2

6.80

7 6.80

2= = 14.70

€

2 =( f i − ei)

2

eii=1

k

∑

Calculation of the 2 statistic

• Solution – continued

• Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent.

• Solution – continued– The critical value in our example is:

€

α ,(r−1)(c−1)2 = χ .05,(4 −1)(3−1)

2 = 12.5916

Degree MBA Major3 11 11 11 12 21 3. .

. .

Code:Undergraduate degree 1 = BA2 = BENG3 = BBA4 = OTHERSMBA Major 1 = ACCOUNTING2 = FINANCE3 = MARKETING

Contingency Table1 2 3 Total

1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22Total 61 44 47 152Test Statistic CHI-Squared = 14.7019P-Value = 0.0227

Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02

Define a code to specify each nominal value. Input the data in columns one column for each category.

Using the computer

chi squared tests. introduction two statistical techniques are presented. both are used to analyze...

Documents

p valuealpha

continued slide

probabilities p i

fit test slide

probability p i

success probabilities

chisquared test

degrees of freedom slide