log-linear analysis - analysing categorical data

31
1 Log-linear Analysis - Analysing Categorical Data These notes are based on Simkiss, D., Ebrahim, G. J. and Waterston, A. J. R. (Eds.) “Chapter 14: Analysing categorical data: Log-linear analysis". Journal of Tropical Pediatrics, “Research methods II: Multivariate analysis” (pp. 144–153). Document

Upload: ina

Post on 21-Jan-2016

131 views

Category:

Documents


0 download

DESCRIPTION

These notes are based on Simkiss, D., Ebrahim, G. J. and Waterston, A. J. R. (Eds.) “Chapter 14: Analysing categorical data: Log-linear analysis". Journal of Tropical Pediatrics, “Research methods II: Multivariate analysis” (pp. 144–153). Document. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Log-linear Analysis - Analysing Categorical Data

1

Log-linear Analysis - Analysing Categorical

DataThese notes are based on Simkiss, D., Ebrahim, G. J. and Waterston, A. J. R. (Eds.) “Chapter 14: Analysing categorical data: Log-linear analysis". Journal of Tropical Pediatrics, “Research methods II: Multivariate analysis” (pp. 144–153).

Document

Page 2: Log-linear Analysis - Analysing Categorical Data

2

Log-linear Analysis

In log-linear analysis tables are formed that contain one-way, two-way, and higher order associations. The logarithm of the cell frequency is estimated by means of a linear equation (function in mathematical terminology). The log-linear model so developed starts with all the one-way, two-way, and higher order associations. The aim is to construct a model such that the cell frequencies in a contingency table are accounted for by the minimum number of terms. This is done by a process of backward elimination. What this means is that one begins with the maximum number of terms, and then drops a term in each round. Statisticians refer to it as the backward hierarchical method.

Page 3: Log-linear Analysis - Analysing Categorical Data

3

Log-linear Analysis

In practice, one commences the analysis by including all the variables. This is referred to as the saturated model. It can usually be expected to predict the cell frequencies perfectly. Then the highest order interaction is removed, and its effect on how closely the model can now predict the cell frequencies is noted. This process of progressive elimination is continued.

Page 4: Log-linear Analysis - Analysing Categorical Data

4

Log-linear Analysis

Each time a variable is removed a statistical test is performed to determine whether the accuracy of prediction falls to an extent such that the component most recently eliminated should be one of the components of the final model. At each stage the assessment of goodness-of-fit is made by means of a statistic known as the likelihood ratio. The final model includes only the associations necessary to reproduce the observed frequencies.

Page 5: Log-linear Analysis - Analysing Categorical Data

5

Log-linear Analysis

A comparison of the observed and expected frequencies for each cell using the likelihood ratio makes the evaluation of the final model. In the same way as in the case of χ2 test, small expected frequencies can lead to loss of power. It is recommended that all expected frequencies should be greater than1, and not more than 20% should be less than 5.

Page 6: Log-linear Analysis - Analysing Categorical Data

6

The Data

In a hospital accident and emergency service 176 subjects who attended for acute chest pain were enrolled in a study. Of these 71 had abnormal electrocardiograms and in the case of 105 it was normal. Of those with abnormal electrocardiograms, 57 were overweight as judged by their body mass index, and 14 were normal. By comparison out of the 105 subjects with normal electrocardiograms 40 were overweight and 65 normal. 

Page 7: Log-linear Analysis - Analysing Categorical Data

7

The Data

In the first group of 71 subjects with abnormal electrocardiograms, out of the 57 overweight subjects 47 were smokers and 10 non-smokers. Amongst the 14 with normal weights 8 were smokers and 6 non-smokers. In the second group of 105 with normal electrocardiograms out of the 40 overweight subjects 25 were smokers and 15 non-smokers. Amongst the 65 with normal weights 35 were smokers and 30 non-smokers. The investigators wish to assess the contribution that overweight and smoking make to coronary artery disease.

Page 8: Log-linear Analysis - Analysing Categorical Data

8

The Data - Coding

ECG 1= Abnormal (electrocardiogram)2= Normal

 BMI 1= Overweight (body mass index)

2= Normal weight Smoke 1= Smoker

2= Non-smoker

Page 9: Log-linear Analysis - Analysing Categorical Data

9

Initial Analysis

We first perform a simple cross-tabulation to check whether the frequencies per each cell are adequate to allow log-linear analysis. Since only summary data is available use

Data > Weight casesWeight cases by frequency Count

then

Analyze > Descriptive statistics > CrosstabsSelect ECG for the rows, BMI for the columns and Smoke for the layer, finally under Statistics select chi-squared.

Page 10: Log-linear Analysis - Analysing Categorical Data

10

Initial AnalysisECG * BMI * SMOKE Crosstabulation

SMOKE

BMI

Total 1 2

1 ECG 1 Count 47 8 55

Expected Count 34.4 20.6 55.0

2 Count 25 35 60

Expected Count 37.6 22.4 60.0

Total Count 72 43 115

Expected Count 72.0 43.0 115.0

2 ECG 1 Count 10 6 16

Expected Count 6.6 9.4 16.0

2 Count 15 30 45

Expected Count 18.4 26.6 45.0

Total Count 25 36 61

Expected Count 25.0 36.0 61.0

Total ECG 1 Count 57 14 71

Expected Count 39.1 31.9 71.0

2 Count 40 65 105

Expected Count 57.9 47.1 105.0

Total Count 97 79 176

Expected Count 97.0 79.0 176.0

Raw data plus expected values.

Page 11: Log-linear Analysis - Analysing Categorical Data

11

Initial Analysis

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 20.57.b. Computed only for a 2x2 tablec. 0 cells (.0%) have expected count less than 5. The minimum expected count is 6.56.d. 0 cells (.0%) have expected count less than 5. The minimum expected count is 31.87.

Chi-Square Tests

SMOKE Value df

Asymp. Sig. (2-

sided)

1 Pearson Chi-Square 23.503a 1 .000

Continuity Correctionb 21.670 1 .000

Likelihood Ratio 24.906 1 .000

Fisher's Exact Test

Linear-by-Linear Association 23.298 1 .000

N of Valid Cases 115

2 Pearson Chi-Square 4.151c 1 .042

Continuity Correctionb 3.033 1 .082

Likelihood Ratio 4.113 1 .043

Fisher's Exact Test

Linear-by-Linear Association 4.083 1 .043

N of Valid Cases 61

Total Pearson Chi-Square 30.472d 1 .000

Continuity Correctionb 28.791 1 .000

Likelihood Ratio 32.094 1 .000

Fisher's Exact Test

Linear-by-Linear Association 30.299 1 .000

N of Valid Cases 176

Page 12: Log-linear Analysis - Analysing Categorical Data

12

Initial Analysis

From the results we infer that among both smokers and non-smokers there is an association between being overweight and an abnormal electrocardiogram.

How much is the extent of the interaction between an abnormal electrocardiogram, smoking and being overweight?

Chi-Square Tests

SMOKE Value df

Asymp. Sig. (2-

sided)

1 Pearson Chi-Square 23.503a 1 .000

Continuity Correctionb 21.670 1 .000

Likelihood Ratio 24.906 1 .000

Fisher's Exact Test

Linear-by-Linear Association 23.298 1 .000

N of Valid Cases 115

2 Pearson Chi-Square 4.151c 1 .042

Continuity Correctionb 3.033 1 .082

Likelihood Ratio 4.113 1 .043

Fisher's Exact Test

Linear-by-Linear Association 4.083 1 .043

N of Valid Cases 61

Total Pearson Chi-Square 30.472d 1 .000

Continuity Correctionb 28.791 1 .000

Likelihood Ratio 32.094 1 .000

Fisher's Exact Test

Linear-by-Linear Association 30.299 1 .000

N of Valid Cases 176

Page 13: Log-linear Analysis - Analysing Categorical Data

13

Full Analysis

This question is better answered by log-linear analysis as shown below:

Analyze > Loglinear > Model SelectionSelect BMI, ECG and Smoking as the factors, do not forget to define the ranges [1,2] in each case. Then proceed

Page 14: Log-linear Analysis - Analysing Categorical Data

14

Hierarchical Loglinear Analysis - Design 1

Cell Counts and Residuals

SMOKE BMI ECG

Observed Expected

Residuals

Std.

Residuals Counta % Count %

1 1 1 47.500 27.0% 47.500 27.0% .000 .000

2 25.500 14.5% 25.500 14.5% .000 .000

2 1 8.500 4.8% 8.500 4.8% .000 .000

2 35.500 20.2% 35.500 20.2% .000 .000

2 1 1 10.500 6.0% 10.500 6.0% .000 .000

2 15.500 8.8% 15.500 8.8% .000 .000

2 1 6.500 3.7% 6.500 3.7% .000 .000

2 30.500 17.3% 30.500 17.3% .000 .000

a. For saturated models, .500 has been added to all observed cells.

The output commences with information about the number of cases, the factors and their levels. A hierarchical model is being fitted. In a hierarchical model it is sufficient to list the highest order terms. This is called “generating class” of the model.

Page 15: Log-linear Analysis - Analysing Categorical Data

15

K-Way and Higher-Order Effects

The likelihood ratio chi-square with no parameters and only the mean is 69.822. The value for the first order effect is 44.530. The difference 69.822 − 44.530 = 25.292 is displayed on the first line of the next table. The difference is a measure of how much the model improves when first order effects are included. The significantly small P value (0.0000) means that the hypothesis of first order effect being zero is rejected. In other words there is a first order effect.

K-Way and Higher-Order Effects

K df

Likelihood Ratio Pearson

Chi-Square Sig. Chi-Square

K-way and Higher Order

Effectsa

1 7 69.822 .000 68.727

2 4 44.530 .000 46.724

3 1 1.389 .239 1.421

K-way Effectsb 1 3 25.292 .000 22.004

2 3 43.142 .000 45.303

3 1 1.389 .239 1.421

Page 16: Log-linear Analysis - Analysing Categorical Data

16

K-Way and Higher-Order Effects

Similar reasoning is applied now to the question of second order effect. The addition of a second order effect improves the likelihood ratio chi-square by 43.142. This is also significant. But the addition of a third order term does not help. The P value is not significant. 

K-Way and Higher-Order Effects

K df

Likelihood Ratio Pearson

Chi-Square Sig. Chi-Square

K-way and Higher Order

Effectsa

1 7 69.822 .000 68.727

2 4 44.530 .000 46.724

3 1 1.389 .239 1.421

K-way Effectsb 1 3 25.292 .000 22.004

2 3 43.142 .000 45.303

3 1 1.389 .239 1.421

Page 17: Log-linear Analysis - Analysing Categorical Data

17

K-Way and Higher-Order Effects

In log-linear analysis the change in the value of the likelihood ratio chi-square statistic when terms are removed (or added) from the model is an indicator of their contribution. We saw this in multiple linear regression with regard to R2. The difference is that in linear regression large values of R2 are associated with good models. Opposite is the case with log-linear analysis. Small values of likelihood ratio chi-square mean a good model.

K-Way and Higher-Order Effects

K df

Likelihood Ratio Pearson

Chi-Square Sig. Chi-Square

K-way and Higher Order

Effectsa

1 7 69.822 .000 68.727

2 4 44.530 .000 46.724

3 1 1.389 .239 1.421

K-way Effectsb 1 3 25.292 .000 22.004

2 3 43.142 .000 45.303

3 1 1.389 .239 1.421

Page 18: Log-linear Analysis - Analysing Categorical Data

18

Backward Elimination Statistics

The purpose here is to find the unsaturated model that would provide the best fit to the data. This is done by checking that the model currently being tested does not give a worse fit than its predecessor. .

Step Summary

Stepa Effects Chi-Squarec df

0 Generating Classb SMOKE*BMI*E

CG

.000 0

Deleted Effect 1 SMOKE*BMI*E

CG

1.389 1

1 Generating Classb SMOKE*BMI,

SMOKE*ECG,

BMI*ECG

1.389 1

Deleted Effect 1 SMOKE*BMI 3.080 1

2 SMOKE*ECG 3.505 1

3 BMI*ECG 27.631 1

2 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Deleted Effect 1 SMOKE*ECG 7.968 1

2 BMI*ECG 32.094 1

3 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Page 19: Log-linear Analysis - Analysing Categorical Data

19

Backward Elimination Statistics

As a first step the procedure commences with the most complex model. In our case it is BMI * ECG * SMOKING. Its elimination produces a chi-square change of 1.389, which has an associated significance level of 0.2386. Since it is greater than the criterion level of 0.05, it is removed. 

Step Summary

Stepa Effects Chi-Squarec df

0 Generating Classb SMOKE*BMI*E

CG

.000 0

Deleted Effect 1 SMOKE*BMI*E

CG

1.389 1

1 Generating Classb SMOKE*BMI,

SMOKE*ECG,

BMI*ECG

1.389 1

Deleted Effect 1 SMOKE*BMI 3.080 1

2 SMOKE*ECG 3.505 1

3 BMI*ECG 27.631 1

2 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Deleted Effect 1 SMOKE*ECG 7.968 1

2 BMI*ECG 32.094 1

3 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Page 20: Log-linear Analysis - Analysing Categorical Data

20

Backward Elimination Statistics

The procedure moves on to the next hierarchical level described under step 1. All 2 – way interactions between the three variables are being tested. Removal of BMI * ECG will produce a large change of 27.631 in the likelihood ratio chi-square. The P value for that is highly significant (prob < 0.0005). The smallest change (of 3.080) is related to the 

Step Summary

Stepa Effects Chi-Squarec df

0 Generating Classb SMOKE*BMI*E

CG

.000 0

Deleted Effect 1 SMOKE*BMI*E

CG

1.389 1

1 Generating Classb SMOKE*BMI,

SMOKE*ECG,

BMI*ECG

1.389 1

Deleted Effect 1 SMOKE*BMI 3.080 1

2 SMOKE*ECG 3.505 1

3 BMI*ECG 27.631 1

2 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Deleted Effect 1 SMOKE*ECG 7.968 1

2 BMI*ECG 32.094 1

3 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Page 21: Log-linear Analysis - Analysing Categorical Data

21

Backward Elimination Statistics

BMI * SMOKING interaction. This is removed next. And the procedure continues until the final model which gives the second order interactions of BMI * ECG and ECG * SMOKING. Each time an estimate is obtained it is called iteration. The largest difference between successive estimates is called convergence criterion. 

Step Summary

Stepa Effects Chi-Squarec df

0 Generating Classb SMOKE*BMI*E

CG

.000 0

Deleted Effect 1 SMOKE*BMI*E

CG

1.389 1

1 Generating Classb SMOKE*BMI,

SMOKE*ECG,

BMI*ECG

1.389 1

Deleted Effect 1 SMOKE*BMI 3.080 1

2 SMOKE*ECG 3.505 1

3 BMI*ECG 27.631 1

2 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Deleted Effect 1 SMOKE*ECG 7.968 1

2 BMI*ECG 32.094 1

3 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Page 22: Log-linear Analysis - Analysing Categorical Data

22

Backward Elimination Statistics

We conclude that being overweight and smoking have each a significant association with an abnormal cardiogram. However, in this particular group of subjects being overweight is more harmful.

Step Summary

Stepa Effects Chi-Squarec df

0 Generating Classb SMOKE*BMI*E

CG

.000 0

Deleted Effect 1 SMOKE*BMI*E

CG

1.389 1

1 Generating Classb SMOKE*BMI,

SMOKE*ECG,

BMI*ECG

1.389 1

Deleted Effect 1 SMOKE*BMI 3.080 1

2 SMOKE*ECG 3.505 1

3 BMI*ECG 27.631 1

2 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Deleted Effect 1 SMOKE*ECG 7.968 1

2 BMI*ECG 32.094 1

3 Generating Classb SMOKE*ECG,

BMI*ECG

4.469 2

Page 23: Log-linear Analysis - Analysing Categorical Data

23

Odds Ratio

We could have inferred this by calculating the odds ratio when we performed the cross tabulation.  The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as a descriptive statistic, and plays an important role in logistic regression. Unlike other measures of association for paired binary data such as the relative risk, the odds ratio treats the two variables being compared symmetrically, and can be estimated using some types of non-random samples.

Page 24: Log-linear Analysis - Analysing Categorical Data

24

Odds RatioI f we observe data in the form of a contingency table

Y = 1 Y = 0

X = 1

X = 0

then the probabilities in the joint distribution can be estimated as

Y = 1 Y = 0

X = 1

X = 0

where n

np̂

ijij with n = n11 + n10 + n01 + n00 being the sum of all f our cell counts.

Page 25: Log-linear Analysis - Analysing Categorical Data

25

Odds RatioThe sample log odds ratio is

.

The distribution of the log odds ratio is approximately normal with:

The standard errorfor the log odds ratio is approximately

.

Page 26: Log-linear Analysis - Analysing Categorical Data

26

Odds RatioThe odds ratio calculation is shown below:

Cardiogram abnormal Cardiogram normal

(ECG 1) (ECG 2)

Overweight (BMI 1) 47 25

Normal weight (BMI 2) 8 35

Odds Ratio = 8.225, ln(Odds Ratio) = 2.11

Cardiogram abnormal Cardiogram Normal

(ECG 1) (ECG 2)

Smoker (Smoking 1) 10 15

Non-Smoker (Smoking 2) 6 30

Odds ratio = 3.33, ln(Odds Ratio) = 1.2

Page 27: Log-linear Analysis - Analysing Categorical Data

27

Comments

To perform a multi-way frequency analysis tables are formed that contain the one-way, two-way, three-way, and higher order associations. The log-linear model starts with all of the one-, two-, three-, and higher-way associations, and then eliminates as many of them as possible while still maintaining an adequate fit between expected and observed cell frequencies. In log-linear modelling the full model that includes all possible main effects and interactions fits the data exactly, with zero residual deviance. One then assesses whether a less full model fits the data adequately by comparing its residual deviance with the full model.

Page 28: Log-linear Analysis - Analysing Categorical Data

28

Comments

In our example, the three-way association tested was between category of electrocardiogram, body mass index, and smoking. It got eliminated because it was found not significant. After that a two-way association (type of electrocardiogram and body mass index; type of electrocardiogram and smoking) was tested. The two-way association was found significant. 

Page 29: Log-linear Analysis - Analysing Categorical Data

29

Comments

As we have seen the purpose of multi-way frequency analysis is to test for association among discrete variables. Once a preliminary search for association is completed by simple 2 x 2 contingency tables a model is fitted that includes only the associations necessary to reproduce the observed frequencies. 

Page 30: Log-linear Analysis - Analysing Categorical Data

30

Comments

In the above example, we have a data set with a binary response variable (Electrocardiogram abnormal/normal) and explanatory variables that are all categorical. In such a situation one has a choice between using logistic regression and log-linear modelling. For performing logistic regression rearrangement of the data is needed so that for each variable we have a column of 1’s and 0’s. 

Page 31: Log-linear Analysis - Analysing Categorical Data

31

Comments

Other differences from logistic regression are: 1. There is no clear demarcation between outcome and explanatory variables in log-linear models. 2. Logistic regression allows continuous as well as

categorical explanatory variables to be included in the regression analysis.