1 topic 2 logit analysis of contingency tables. 2 contingency table a cross classification table...

44
1 Topic 2 LOGIT analysis of contingency tables

Upload: britton-holmes

Post on 04-Jan-2016

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

1

Topic 2

LOGIT analysis of contingency tables

Page 2: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

2

Contingency table

a cross classification

Table containing two or more variables of classification, and the purpose is to determin if these variables are related.

Change in stock prices in yearChange in stock prices

in January

UP

DOWN

TOTAL

UP DOWN TOTAL

22 (16.1) 1 (6.9) 23

6 (11.9) 11 (5.1) 17

28 12 40

Page 3: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

3

A table of this sort can be used to test whether, as some financial analysts suggest, January is a good prediction of whether stock prices will go up or down in the entire year H0 : whether or not stock prices go up in the entire

year is the same regardless of the behaviour in January

H1 : otherwise

Expected frequencies are shown in parentheses in the table

Page 4: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

4

Pearson’s Chi-square statistic

where r and c are respectively the numbers of rows and columns in the table

2)1)(1(

1

2

~)(

cr

n

i i

ii

e

efP

Page 5: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

5

In our example,

Nowwe rejected the null. In other words, based on

this evidence the probability that stock prices will go up during the whole year does not seem to be independent of whether or not they go up in January

96.161.5

)1.511(

9.11

)9.116(

9.6

)9.61(

1.16

)1.1622()(

2

2224

1

2

i i

ii

e

ef

84.32)05.0,1(

Page 6: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

6

DATA STOCK;INPUT F YP JP;DATALINES;22 1 16 1 01 0 111 0 0;PROC FREQ DATA=STOCK;WEIGHT F;TABLES YP*JP/CHISQ CMH;RUN;

Page 7: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

7

Page 8: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

8

Page 9: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

9

Two Way Table

Consider the following SAS program and OUTPUT:

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC GENMOD DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD/D=B;

RUN;

Page 10: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

10

Page 11: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

11

But suppose we don’t have individual level data. All we have is the following table

Blacks Nonblacks Total

Death 28 22 50

Life 45 52 97

Total 73 74 147

Page 12: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

12

DATA CONT1;INPUT F BLACKD DEATH;DATALINES;22 0 128 1 152 0 045 1 0;PROC GENMOD DATA=CONT1 DESCENDING;FREQ F;MODEL DEATH=BLACKD/D=B;RUN;

Page 13: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

13

Page 14: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

14

Results are identical to those obtained previously Alternatively, we can run the programDATA CONT1;INPUT DEATH TOTAL BLACKD;DATALINES;22 74 028 73 1;PROC GENMOD DATA=CONT1;MODEL DEATH/TOTAL=BLACKD/D=B;RUN;

Page 15: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

15

And obtain output

Page 16: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

16

Points to note: Instead of replicating the observations, GENMOD

treats the variable DEATH as having a Binomial distribution with the number of trials given by TOTAL.

Deviance is 0. Why?

Note that the deviance is a likelihood ratio test that compares the fitted model with a saturated model. In the previous case, the saturated model is also the fitted model, with two parameter for two data lines.

Page 17: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

17

Three Way Table

Consider the cross classification table of race, gender and possession of a driver’s license for a sample of 17 and 18 year old kids.

Drivers’ License

Race Gender Yes No

White Male 43 134

Female 26 149

Black Male 29 23

Female 22 36

Page 18: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

18

DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE/D=B;RUN;

Page 19: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

19

Page 20: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

20

Deviance = 0.0583 with a

p-value of 0.8092033193 It can be obtained by executing the SAS program:

DATA;

CHI = 1 – PROBCHI(0.0583,1);

PUT CHI;

RUN; So there is no evidence of an interaction between

the explanatory variables.

Page 21: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

21

To see this more explicitly, let us fit the model with interaction

DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B;RUN;

Page 22: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

22

Page 23: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

23

Interpretation

Coefficient of MALE is 0.6478

Exponentiating the coefficient yields 1.91

=> the estimated odds of having a driver’s license are nearly twice as large for males as for females, after adjusting for racial differences.

Page 24: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

24

For WHITE, the highly significant, adjusted odds ratio is exp[-1.3135]=0.269, indicating that the odds of having a driver’s license for whites is a little more than ¼ the odds of blacks.

Page 25: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

25

Four Way Table

Slightly more complicated with four-way tables because more interactions are possible

Consider the following table Our goal is to estimate a LOGIT model for the

dependence of working class identification on the other three variables.

Page 26: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

26

Identifies with the

Working class

Country Occupation Fathers’ Occupation Yes No Total

France Manual Manual 85 22 107

Non-Manual 44 21 65

Non-Manual Manual 24 42 66

Non-Manual 17 154 171

U.S. Manual Manual 24 63 87

Non-Manual 22 43 65

Non-Manual Manual 1 84 85

Non-Manual 6 142 148

Page 27: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

27

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL/D=B;RUN;

Page 28: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

28

Page 29: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

29

The missing variables are the interaction terms: 3 2-way interactions and 1 3-way interaction. Because 3-way interactions cannot be interpreted easily, let’s see if we can get by with just the 2-way interactions.

Page 30: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

30

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*MANUAL FRANCE*FAMANUAL MANUAL*FAMANUAL/D=B;RUN;

Page 31: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

31

Page 32: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

32

Examining the Wald Chi-squares, we find that FRANCE*FAMANUAL is highly significant, but other interaction variables are not so significant.

Page 33: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

33

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL

FRANCE*FAMANUAL/D=B;RUN;

Page 34: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

34

Page 35: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

35

Interpretations of results

Coefficient for MANUAL:

exp(2.5155) = 12.4

=> Manual workers have an odds of identification with the working class that is more than 12 times the odds for non-manual workers

Coefficient for FRANCE*FAMANUAL:

)*5061.13802.0(.)( FRANCEfFAMANUAL

Pi

Page 36: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

36

If FRANCE=0, then f(.)[-0.3802] represents the effect of FAMANUAL when the respondent lives in the U.S.

If FRANCE=1, then f(.)[1.13] represents the effect of RAMANUAL when the respondent lives in France, exp[1.13]=3.1

In France, the men whose fathers had a manual occupation have an odds of identification that is more than three times the odds for men whose fathers did not have a manual occupation.

Page 37: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

37

Overdispersion

Refers to the situation of lack of fit Causes of overdispersion:

Incorrectly specified model: more interactions or nonlinearity are needed in the model.

Lack of independence of observations due to unobserved heterogeneity at group level.

Page 38: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

38

DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 22.106 29 10...2.329 5 213.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B;RUN;

Page 39: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

39

Page 40: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

40

Note that the deviance and Pearson 2 clearly indicate model mis-specification

Because there’s only one independent variable, we don’t have the option of putting in interactions

One can try allowing for nonlinearity by including powers of NIH in the model by that won’t help.

It is quite possible that lack of fit is due to a lack of independence in the observations

Page 41: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

41

There are many characteristics of biochemistry departments besides NIH funding that may have some bearings on whether their graduates seek and get postdoctoral training Examples are prestiage of the department, whether the department is in an agricultural or medical school, the age of the department and so on.

Lack of independence of this kind produces what is called extra-binomial variation. The variance of the dependent variable will be greater than what is expected under the assumption of a binomial distribution.

Page 42: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

42

Besides producing a large deviance, extra-binomial variation can result in underestimates of the standard errors and overestimates of the Chi-square statistics. Method of adjustment: take the square root of the Pearson Chi-square statistic and multiply all the standard errors by that number.

Page 43: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

43

DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 2...13.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B PSCALE;RUN;

Page 44: 1 Topic 2 LOGIT analysis of contingency tables. 2 Contingency table a cross classification Table containing two or more variables of classification, and

44