analysis of contingency tables

34
1. Introduction 2. Two-way classification and PROC GENMOD 3. Three-way classification 4. Class exercises CHAPTER 2: BINARY LOGIT ANALYSIS OF CONTINGENCY TABLES Prof. Alan Wan 1 / 29

Upload: sylvia-cheung

Post on 14-Apr-2017

245 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

CHAPTER 2: BINARY LOGIT ANALYSIS OFCONTINGENCY TABLES

Prof. Alan Wan

1 / 29

Page 2: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Table of contents

1. Introduction

2. Two-way classification and PROC GENMOD2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

3. Three-way classification

4. Class exercises

2 / 29

Page 3: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

I Contingency table: a table containing two or more variables ofclassification, and the purpose is to determine if thesevariables are related;

I Here is an example:

Annual changesin stock prices

Up Down Total

January changes Up 22(16.1) 1(6.9) 23in stock prices Down 6(11.9) 11(5.1) 17

Total 28 12 40

3 / 29

Page 4: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

I Contingency table: a table containing two or more variables ofclassification, and the purpose is to determine if thesevariables are related;

I Here is an example:

Annual changesin stock prices

Up Down Total

January changes Up 22(16.1) 1(6.9) 23in stock prices Down 6(11.9) 11(5.1) 17

Total 28 12 40

3 / 29

Page 5: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

I A table containing information of this sort can be used to testwhether, as some financial analysts suggest, January is a goodprediction of whether stock prices will go up or down in theentire year; i.e., we can testH0: whether or not stock prices go up in the entire year is thesame regardless of the behaviour in January, vs.H1: otherwise

I Expected frequencies (under H0) are shown in parentheses inthe table.

4 / 29

Page 6: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

I The expected frequencies under H0 are calculated as follows:

16.1 =28

40× 23; 6.9 =

12

40× 23; 11.9 =

28

40× 17; 5.1 =

12

40× 17;

I Why? Take 16.1 as an example;

I Note that Pr(UpY ∩ UpJ) = Pr(UpY |UpJ)Pr(UpJ);

I But under independence (H0), Pr(UpY |UpJ) = Pr(UpY ).Hence Pr(UpY ∩ UpJ) = Pr(UpY )Pr(UpJ) = 28

402340 = 16.1

40 .

5 / 29

Page 7: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

I This test can be conducted using the usual Pearson’sChi-square statistic:

Pearson′sχ2 =∑n

i=1(Oi−Ei )

2

Ei∼ χ2

(r−1)(c−1), where r and care the numbers of rows and columns in the table respectively;

I For this example,∑4i=1

(22−16.1)216.1 + (1−6.9)2

6.9 + (6−11.9)211.9 + (11−5.1)2

5.1 = 16.96;

I Now, χ21,0.05 = 3.84. Hence we reject H0 and conclude that

stock price movements during the whole year are notindependent of their movements in January of the year.

6 / 29

Page 8: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Introduction

data stock; input f yp jp; datalines; 22 1 1 6 1 0 1 0 1 11 0 0 ; proc freq data=stock; weight f; tables yp*jp/chisq cmh; run; Statistics for Table of yp by jp Statistic DF Value Prob Chi-Square 1 16.9577 <.0001 Likelihood Ratio Chi-Square 1 18.5678 <.0001 Continuity Adj. Chi-Square 1 14.2053 0.0002 Mantel-Haenszel Chi-Square 1 16.5338 <.0001 Phi Coefficient 0.6511 Contingency Coefficient 0.5456 Cramer's V 0.6511

7 / 29

Page 9: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

I Consider the penalty data of Chapter 1. Suppose individualdata are unavailable and all we have is the following table:

Blacks Non-blacks Total

Death 28 22 50Life 45 52 97

Total 73 74 147

I The Logit model for regressing DEATH on BLACKD withdata contained in a contingency table is PROC GENMOD;

I One way to invoke PROC GENMOD is to use the FREQcommand, which simply replicates the observations andconverts the data into individual format based on thefrequency specified.

8 / 29

Page 10: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

DATA CONT1; INPUT F BLACKD DEATH; DATALINES; 22 0 1 28 1 1 52 0 0 45 1 0 ; PROC GENMOD DATA=CONT1 DESCENDING; FREQ F; MODEL DEATH=BLACKD/D=B; RUN;

9 / 29

Page 11: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

The GENMOD Procedure Model Information Data Set WORK.CONT1 Distribution Binomial Link Function Logit Dependent Variable DEATH Frequency Weight Variable F Observations Used 4 Sum Of Frequency Weights 147 Response Profile Ordered Total Value DEATH Frequency 1 1 50 2 0 97

PROC GENMOD is modeling the probability that DEATH='1'.

10 / 29

Page 12: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 145 187.2704 1.2915 Scaled Deviance 145 187.2704 1.2915 Pearson Chi-Square 145 147.0000 1.0138 Scaled Pearson X2 145 147.0000 1.0138 Log Likelihood -93.6352

Algorithm converged. Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.8602 0.2543 -1.3587 -0.3617 11.44 0.0007 BLACKD 1 0.3857 0.3502 -0.3006 1.0721 1.21 0.2706 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.

11 / 29

Page 13: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

I As far as PROC GENMOD is concerned, only 4 observationshave been inputted;

I The actual number of observations, namely, 147, is consideredto be the sum of the frequencies. The FREQ commandconverts the 4 observations into 147 frequencies to be usedfor ML estimation;

12 / 29

Page 14: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: frequency weight syntax

I The Deviance statistic is a LR test that tests if there aresignificant difference between the ”estimated” (restricted) and”saturated” (unrestricted) model;

I

Deviance = 2[lnL(β̂S)− lnL(β̂E )] ∼ χ2m,

where m is the difference in the number of parametersbetween the saturated and the estimated models;

I The saturated model is a model with number of unknownparameters being equal to the number of observations;

I Hence for a model estimated by individual data, there are nobservations for n unknowns, resulting in L(β̂S) = 1,lnL(β̂S) = 0 and Deviance = −2[lnL(β̂E )].

13 / 29

Page 15: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: event/trial syntax

I Instead of inputting all 4 internal cell counts, the cellfrequencies for death sentences (”events”) along with thecolumn totals (”trials”) are inputted.

I

DATA CONT1; INPUT DEATH TOTAL BLACKD; DATALINES; 22 74 0 28 73 1 ; PROC GENMOD DATA=CONT1; MODEL DEATH/TOTAL=BLACKD/D=B; RUN;

14 / 29

Page 16: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: event/trial syntax

The GENMOD Procedure Model Information Data Set WORK.CONT1 Distribution Binomial Link Function Logit Response Variable (Events) DEATH Response Variable (Trials) TOTAL Observations Used 2 Number Of Events 50 Number Of Trials 147

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF Deviance 0 0.0000 . Scaled Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Scaled Pearson X2 0 0.0000 . Log Likelihood -93.6352

Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.8602 0.2543 -1.3587 -0.3617 11.44 0.0007 BLACKD 1 0.3857 0.3502 -0.3006 1.0721 1.21 0.2706 Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

15 / 29

Page 17: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

I With frequency weighting syntax:

L =147∏i=1

[1

1 + e−(β1+β2BLACKDi )]DEATHi

×[1− 1

1 + e−(β1+β2BLACKDi )]1−DEATHi

I With event/trial syntax:

L = {[ 1

1 + e−(β1+β2(BLACKD=0))]22[1− 1

1 + e−(β1+β2(BLACKD=0))]52}

×{[ 1

1 + e−(β1+β2(BLACKD=1))]28[1− 1

1 + e−(β1+β2(BLACKD=1))]45}

16 / 29

Page 18: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax

PROC GENMOD: event/trial syntax

I The two likelihood functions are of course algebraicallyidentical, but PROC GENMOD treats the first likelihood asbeing based on 147 Bernoulli(p) observations, and the secondlikelihood as being based on 2 observations, each being aproduct of Bernoulli(p) densities corresponding to a commonvalue of BLACKD, namely, BLACKD=0 for the firstobservation and BLACKD=1 for the second observation;

I Under the event/trial syntax, there are 2 observations forestimating 2 parameters. Hence the estimated model is thesaturated model, thus resulting in a Deviance statistic of 0;

I The Deviance statistic carries no significant meaning fortwo-way cross classification.

16 / 29

Page 19: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I Consider the cross classification of race, gender and possessionof a driver’s license for a sample of 17 and 18 year old kids:

Driver’s license

Race Gender Yes No

White Male 43 134Female 26 149

Black Male 29 23Female 22 36

I Let YES represent the ”event” of interest, andTOTAL=YES+NO represent the ”trial”.

17 / 29

Page 20: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE/D=B; RUN;

18 / 29

Page 21: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

Model Information Data Set WORK.DRIVER Distribution Binomial Link Function Logit Response Variable (Events) YES Response Variable (Trials) TOTAL Observations Used 4 Number Of Events 120 Number Of Trials 462 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 1 0.0583 0.0583 Scaled Deviance 1 0.0583 0.0583 Pearson Chi-Square 1 0.0583 0.0583 Scaled Pearson X2 1 0.0583 0.0583 Log Likelihood -245.8974 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.4555 0.2221 -0.8909 -0.0201 4.20 0.0403 WHITE 1 -1.3135 0.2378 -1.7795 -0.8474 30.51 <.0001 MALE 1 0.6478 0.2250 0.2068 1.0889 8.29 0.0040 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. 19 / 29

Page 22: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I There are 4 observations, 120 events of YES and 462 trialsrepresented by TOTAL;

I Both the race and gender coefficients are significantlydifferent from zero;

I The present estimated model is not the saturated model asthere are 4 observations for 3 parameters. There is thedifference of one parameter between the estimated andsaturated models. Hence the Deviance statistic has df=1.

20 / 29

Page 23: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I How to construct the saturated model with the available data?

I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;

I The Deviance test is essentially a test of the significance ofthe interaction term

21 / 29

Page 24: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I How to construct the saturated model with the available data?

I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;

I The Deviance test is essentially a test of the significance ofthe interaction term

21 / 29

Page 25: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I How to construct the saturated model with the available data?

I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;

I The Deviance test is essentially a test of the significance ofthe interaction term

21 / 29

Page 26: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I Estimated model:

pi =1

1 + e−(β1+β2WHITEi+β3MALEi )

I Saturated model:

pi =1

1 + e−(β1+β2WHITEi+β3MALEi+β4WHITEi×MALEi )

22 / 29

Page 27: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I Estimated model:

pi =1

1 + e−(β1+β2WHITEi+β3MALEi )

I Saturated model:

pi =1

1 + e−(β1+β2WHITEi+β3MALEi+β4WHITEi×MALEi )

22 / 29

Page 28: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I Testing the significance of the difference between theestimated and saturated models is the same as testing β4 = 0;

I The p-value corresponding to the Deviance statistic of 0.0583can be computed using the following SAS commands:

data;chi=1-probchi(0.0583,1);put chi;run;

I This results in a p-value of 0.8092. Hence the interaction termbetween MALE and WHITE differs insignificantly from zero.

23 / 29

Page 29: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I To see this more clearly, let us fit the model explicitly with theinteraction term:

DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B; RUN;

24 / 29

Page 30: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

The GENMOD Procedure Model Information Data Set WORK.DRIVER Distribution Binomial Link Function Logit Response Variable (Events) YES Response Variable (Trials) TOTAL Observations Used 4 Number Of Events 120 Number Of Trials 462 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 0 0.0000 . Scaled Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Scaled Pearson X2 0 0.0000 . Log Likelihood -245.8682 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.4925 0.2706 -1.0229 0.0379 3.31 0.0688 WHITE 1 -1.2534 0.3441 -1.9278 -0.5789 13.27 0.0003 MALE 1 0.7243 0.3888 -0.0378 1.4864 3.47 0.0625 WHITE*MALE 1 -0.1151 0.4765 -1.0491 0.8189 0.06 0.8092 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.

25 / 29

Page 31: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I Now, to test H0 : β4 = 0 vs. H1 : otherwise, we apply the LRtest:

Deviance = 2(−245.8682−−245.8974)

= 0.0584

I Also, the log of the odds is given by

Zi = β1 + β2WHITEi + β3MALEi + β4WHITEi ×MALEi .

So, ∂Zi∂WHITEi

= β2 + β4MALEi ,

and ∂Zi∂MALEi

= β3 + β4WHITEi

26 / 29

Page 32: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

Hence the odds ratio estimates of WHITE and MALE are:

e−1.2534−0.1151MALEi and e0.7243−0.1151WHITEi

respectively, with the following interpretations:

I The odds of having a license for white females aree−1.2534 = 0.286 times the odds for black females;

I The odds of having a license for white males aree−1.2534−0.1151 = 0.2544 times the odds for black males;

I The odds of having a license for black males aree0.7243 = 2.063 times the odds for black females;

I The odds of having a license for white males aree0.7243−0.1151 = 1.839 the odds for white females.

27 / 29

Page 33: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Three-way classification

I The interaction term also affects the marginal effect on pi .For example,

∂pi∂WHITEi

= f (Zi )(β2 + β4MALEi ),

In other words, the marginal change of pi with respect to achange of race from black to white is dependent on thegender of the person;

I Pearson’s Chi-square goodness of fit test: see Tutorial 2

28 / 29

Page 34: Analysis of Contingency Tables

1. Introduction2. Two-way classification and PROC GENMOD

3. Three-way classification4. Class exercises

Class exercises

1. Tutorial 2

2. 2004 Final Exam, Question 1

3. 2007 Final Exam, Question 1

29 / 29