1 topic 2 logit analysis of contingency tables. 2 contingency table a cross classification table...

Post on 04-Jan-2016

224 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Topic 2

LOGIT analysis of contingency tables

2

Contingency table

a cross classification

Table containing two or more variables of classification, and the purpose is to determin if these variables are related.

Change in stock prices in yearChange in stock prices

in January

UP

DOWN

TOTAL

UP DOWN TOTAL

22 (16.1) 1 (6.9) 23

6 (11.9) 11 (5.1) 17

28 12 40

3

A table of this sort can be used to test whether, as some financial analysts suggest, January is a good prediction of whether stock prices will go up or down in the entire year H0 : whether or not stock prices go up in the entire

year is the same regardless of the behaviour in January

H1 : otherwise

Expected frequencies are shown in parentheses in the table

4

Pearson’s Chi-square statistic

where r and c are respectively the numbers of rows and columns in the table

2)1)(1(

1

2

~)(

cr

n

i i

ii

e

efP

5

In our example,

Nowwe rejected the null. In other words, based on

this evidence the probability that stock prices will go up during the whole year does not seem to be independent of whether or not they go up in January

96.161.5

)1.511(

9.11

)9.116(

9.6

)9.61(

1.16

)1.1622()(

2

2224

1

2

i i

ii

e

ef

84.32)05.0,1(

6

DATA STOCK;INPUT F YP JP;DATALINES;22 1 16 1 01 0 111 0 0;PROC FREQ DATA=STOCK;WEIGHT F;TABLES YP*JP/CHISQ CMH;RUN;

7

8

9

Two Way Table

Consider the following SAS program and OUTPUT:

DATA PENALTY;

INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';

INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;

PROC GENMOD DATA=PENALTY DESCENDING;

MODEL DEATH=BLACKD/D=B;

RUN;

10

11

But suppose we don’t have individual level data. All we have is the following table

Blacks Nonblacks Total

Death 28 22 50

Life 45 52 97

Total 73 74 147

12

DATA CONT1;INPUT F BLACKD DEATH;DATALINES;22 0 128 1 152 0 045 1 0;PROC GENMOD DATA=CONT1 DESCENDING;FREQ F;MODEL DEATH=BLACKD/D=B;RUN;

13

14

Results are identical to those obtained previously Alternatively, we can run the programDATA CONT1;INPUT DEATH TOTAL BLACKD;DATALINES;22 74 028 73 1;PROC GENMOD DATA=CONT1;MODEL DEATH/TOTAL=BLACKD/D=B;RUN;

15

And obtain output

16

Points to note: Instead of replicating the observations, GENMOD

treats the variable DEATH as having a Binomial distribution with the number of trials given by TOTAL.

Deviance is 0. Why?

Note that the deviance is a likelihood ratio test that compares the fitted model with a saturated model. In the previous case, the saturated model is also the fitted model, with two parameter for two data lines.

17

Three Way Table

Consider the cross classification table of race, gender and possession of a driver’s license for a sample of 17 and 18 year old kids.

Drivers’ License

Race Gender Yes No

White Male 43 134

Female 26 149

Black Male 29 23

Female 22 36

18

DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE/D=B;RUN;

19

20

Deviance = 0.0583 with a

p-value of 0.8092033193 It can be obtained by executing the SAS program:

DATA;

CHI = 1 – PROBCHI(0.0583,1);

PUT CHI;

RUN; So there is no evidence of an interaction between

the explanatory variables.

21

To see this more explicitly, let us fit the model with interaction

DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B;RUN;

22

23

Interpretation

Coefficient of MALE is 0.6478

Exponentiating the coefficient yields 1.91

=> the estimated odds of having a driver’s license are nearly twice as large for males as for females, after adjusting for racial differences.

24

For WHITE, the highly significant, adjusted odds ratio is exp[-1.3135]=0.269, indicating that the odds of having a driver’s license for whites is a little more than ¼ the odds of blacks.

25

Four Way Table

Slightly more complicated with four-way tables because more interactions are possible

Consider the following table Our goal is to estimate a LOGIT model for the

dependence of working class identification on the other three variables.

26

Identifies with the

Working class

Country Occupation Fathers’ Occupation Yes No Total

France Manual Manual 85 22 107

Non-Manual 44 21 65

Non-Manual Manual 24 42 66

Non-Manual 17 154 171

U.S. Manual Manual 24 63 87

Non-Manual 22 43 65

Non-Manual Manual 1 84 85

Non-Manual 6 142 148

27

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL/D=B;RUN;

28

29

The missing variables are the interaction terms: 3 2-way interactions and 1 3-way interaction. Because 3-way interactions cannot be interpreted easily, let’s see if we can get by with just the 2-way interactions.

30

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*MANUAL FRANCE*FAMANUAL MANUAL*FAMANUAL/D=B;RUN;

31

32

Examining the Wald Chi-squares, we find that FRANCE*FAMANUAL is highly significant, but other interaction variables are not so significant.

33

DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL

FRANCE*FAMANUAL/D=B;RUN;

34

35

Interpretations of results

Coefficient for MANUAL:

exp(2.5155) = 12.4

=> Manual workers have an odds of identification with the working class that is more than 12 times the odds for non-manual workers

Coefficient for FRANCE*FAMANUAL:

)*5061.13802.0(.)( FRANCEfFAMANUAL

Pi

36

If FRANCE=0, then f(.)[-0.3802] represents the effect of FAMANUAL when the respondent lives in the U.S.

If FRANCE=1, then f(.)[1.13] represents the effect of RAMANUAL when the respondent lives in France, exp[1.13]=3.1

In France, the men whose fathers had a manual occupation have an odds of identification that is more than three times the odds for men whose fathers did not have a manual occupation.

37

Overdispersion

Refers to the situation of lack of fit Causes of overdispersion:

Incorrectly specified model: more interactions or nonlinearity are needed in the model.

Lack of independence of observations due to unobserved heterogeneity at group level.

38

DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 22.106 29 10...2.329 5 213.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B;RUN;

39

40

Note that the deviance and Pearson 2 clearly indicate model mis-specification

Because there’s only one independent variable, we don’t have the option of putting in interactions

One can try allowing for nonlinearity by including powers of NIH in the model by that won’t help.

It is quite possible that lack of fit is due to a lack of independence in the observations

41

There are many characteristics of biochemistry departments besides NIH funding that may have some bearings on whether their graduates seek and get postdoctoral training Examples are prestiage of the department, whether the department is in an agricultural or medical school, the age of the department and so on.

Lack of independence of this kind produces what is called extra-binomial variation. The variance of the dependent variable will be greater than what is expected under the assumption of a binomial distribution.

42

Besides producing a large deviance, extra-binomial variation can result in underestimates of the standard errors and overestimates of the Chi-square statistics. Method of adjustment: take the square root of the Pearson Chi-square statistic and multiply all the standard errors by that number.

43

DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 2...13.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B PSCALE;RUN;

44

top related