chapter 2 st 544, d. zhang 2 contingency tables

82
CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables I. Probability Structure of a 2-way Contingency Table I.1 Contingency tables X, Y :– cat. var. Y - usually random (except in a case-control study), response; X - can be random or fixed, usually acts like a covariate. X has I levels, Y has J levels. A contingency table for X, Y is an I × J table filled with data. For example, Y 1 2 3 X 1 n 11 n 12 n 13 2 n 21 n 22 n 23 Y 1 2 X 1 n 11 n 12 2 n 21 n 22 3 n 31 n 32 Slide 40

Upload: others

Post on 16-Nov-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

2 Contingency Tables

I. Probability Structure of a 2-way Contingency Table

I.1 Contingency tables

• X,Y :– cat. var. Y− usually random (except in a case-control study),

response; X− can be random or fixed, usually acts like a covariate. X

has I levels, Y has J levels.

• A contingency table for X,Y is an I × J table filled with data.

• For example,

Y

1 2 3

X 1 n11 n12 n13

2 n21 n22 n23

Y

1 2

X 1 n11 n12

2 n21 n22

3 n31 n32

Slide 40

chenc
Highlight
chenc
Highlight
Page 2: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• For example, from a random sample of n = 1127 Americans, we have

the following contingency table:

Table 2.1. Cross classification of Belief in Afterlife by gender

Belief in afterlife

Yes No/Undecided

Gender Female 509 116

Male 398 104

• With a contingency table for X,Y , we would like to understand the

association between X and Y , the underlying probability structure of

the table, etc.

• For example, for the afterlife table, we would like to see if one gender

is more likely to believe in afterlife, or the overall proportion with belief

in afterlife in the population, etc.

Slide 41

Page 3: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

I.2 Sampling schemes, types of studies, probability structure

• Sampling schemes - ways to get data (tables):

1. Multinomial sampling: From the population, we obtain a random

sample, then cross classify individuals to table cells.

? An example on belief in afterlife from n = 1127 Americans

Table 2.1. Cross classification of Belief in Afterlife by gender

Belief in afterlife

Yes No/Undecided

Gender Female 509 116

Male 398 104

? This is an example of Multinomial sampling.

? The study using this sampling method is called across-sectional study

Slide 42

625

502

Total

chenc
Highlight
Page 4: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? In general, a 2× 2 table from multinomial sampling

Y

1 2

X 1 n11 n12 n1+

2 n21 n22 n2+

n+1 n+2 n

where (n11, n12, n21, n22) are random variables that have a

multinomial distribution with sample size n

(n = n11 + n12 + n21 + n22) and probabilities

Y

1 2

X 1 π11 π12

2 π21 π22

(π11, π12, π21, π22) define the probability structure of the

contingency table.

Slide 43

The standard statistical model underlying analysis of contingency tables is to assume that (unconditional on the total count) the cell counts are independent Poisson random variables.

Once you impose a total cell count for the contingency table, or a row or column count, the resulting conditional distributions of the cell counts then become multinomial.

https://stats.stackexchange.com/questions/45479/pearsons-residuals

Page 5: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? πij ’s can be estimated by pij = nij/n.

? With multinomial sampling, we can estimate many relevant

quantities:

P [Y = 1] =n11 + n21

n=n+1

n

P [X = 1] =n11 + n12

n=n1+

n

P [Y = 1|X = 1] =n11

n11 + n12=n11

n1+

P [X = 1|Y = 1] =n11

n11 + n21=n11

n+1...

? For afterlife example, we estimated that

P [belief in afterlife] =509 + 398

1127= 80%

P [belief in afterlife|Female] =509

509 + 116= 81%

P [belief in afterlife|Male] =398

398 + 104= 79%...

Slide 44

907 220 1,127Total

1. Find joint prob;2. Find marginal prob;3. Find Conditional prob.

chenc
Highlight
Page 6: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

2. Product-multinomial sampling on X: For example, in a clinical

trial for heart disease, we randomly assign 200 patients to

treatment 1 and 100 patients to treatment 2 and may obtain

potential data like the following:

Y

Better No Change Worse

Treatment 1 n11 n12 n13 200

Treatment 2 n21 n22 n23 100

Here we have

(n11, n12, n13) ⊥ (n21, n22, n23)

(n11, n12, n13) ∼ multinomial(200, (π1, π2, π3)), π1 + π2 + π3 = 1

(n21, n22, n23) ∼ multinomial(100, (τ1, τ2, τ3)), τ1 + τ2 + τ3 = 1

(π1, π2, π3) and (τ1, τ2, τ3) define the probability structure of this

contingency table.

Slide 45

chenc
Highlight
Page 7: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? In general, the data looks like

Y

1 2 3

X 1 n11 n12 n13 n1+

2 n21 n22 n23 n2+

where n1+ and n2+, the sample sizes for X = 1 and X = 2, are

fixed.

(n11, n12, n13) ⊥ (n21, n22, n23)

(n11, n12, n13) ∼ multinom(n1+, (π1, π2, π3)), π1 + π2 + π3 = 1

(n21, n22, n23) ∼ multinom(n2+, (τ1, τ2, τ3)), τ1 + τ2 + τ3 = 1

? Since the likelihood of π’s and τ ’s is the product of the likelihood

of π’s and the likelihood of τ ’s, this sampling scheme is called

product-multinomial sampling on X.

? Clinical trials, cohort studies (prospective studies) all use this

sampling scheme.

Slide 46

Prospective study: Participants are enrolled into the study before they develop the disease or outcome in question.

chenc
Highlight
chenc
Highlight
Page 8: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? When X is also random (so has a distribution in the population),

(π1, π2, π3)’s defines the conditional distribution of Y given

X = 1

(τ1, τ2, τ3)’s defines the conditional distribution of Y given

X = 2.

? With product-multinomial sampling on X, we can only estimate

conditional probabilities of Y |X = x. Other probabilities are not

estimable. For example, we cannot estimate P [Y = 1].

Slide 47

chenc
Highlight
chenc
Highlight
Page 9: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

3. Product multinomial sampling on Y:

If Y represents a rare event, then a prospective study is inefficient.

For example, if we would like to investigate the association between

smoking and lung cancer and conduct a prospective study

Lung Cancer

Yes No

Smoking Yes n11 n12 n1+

No n21 n22 n2+

then n11, n21 will be small unless n1+ and n2+ are very large.

This will yield an inefficient study.

Slide 48

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 10: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? We may consider a design such as the following one:

Lung Cancer

Yes No

Smoking Yes n11 n12

No n21 n22

n+1 = 100 n+2 = 200

All cell counts will not be small ⇒ efficient.

n11 ⊥ n12

n11 ∼ Bin(n+1, π1), π1 = P [smoking|case].

n12 ∼ Bin(n+2, π2), π2 = P [smoking|control].

? We can still investigate the association between smoking and

lung cancer using this design.

? This sampling scheme is product-multinomial on Y .

? The study is often called the case-control study.

Slide 49

chenc
Highlight
chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 11: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? In general,

Lung Cancer

Yes No

Smoking Yes n11 n12

No n21 n22

n+1 n+2

where n+1, n+2, are all fixed.

n11 ⊥ n12

n11 ∼ Bin(n+1, π1), π1 = P [smoking|case].

n12 ∼ Bin(n+2, π2), π2 = P [smoking|control].

Slide 50

n11

n11 + n21 π1=

π2=n12

n12 + n22

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 12: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? Example of a case-control study on MI (Table 2.4)

Table 2.4. Case-Control Study on MI

Myocardial Infarction

Case Control

Ever Smoker Yes 172 173

No 90 346

262 519

where 262 is the sample size for MI cases, 519 is the sample size

for controls.

? From this study, we cannot estimate the quantities such as

P [MI]

P [Ever Smoking]

P [MI|Ever smokers]

P [MI|Never smokers] ...

Slide 51

chenc
Highlight
Page 13: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Note: Multinomial sampling ⇒ product-multinomial sampling.

For example, if we have data from a multinomial sampling with sample

size n:

Y

1 2

X 1 n11 n12

2 n21 n22

Y

1 2

X 1 π11 π12

2 π21 π22

Then we can view the data from product-multinomial sampling on X

or product-multinomial sampling on Y.

That is:

n11|n1+ ∼ Bin(n1+,π11

π11+π12 π) ⊥ n21|n2+ ∼ Bin(n2+, 21

π21+π22

)

Or

n11|n+1 ∼ Bin(n+1,π11

π11+π21 π) ⊥ n12|n+2 ∼ Bin(n+2, 12

π12+π22

)

Slide 52

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 14: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

I.3 Sensitivity & Specificity in Diagnostic Tests

• In a diagnostic test, X = true disease status, Y = test result. Then we

can form a 2× 2 table:

Y

Positive Negative

X Disease

No Disease

• Using data from multinomial sampling or product-multinomial

sampling on X, we can estimate

Sensitivity = P [Y = Positive|X = Disease] (True positive rate)

Specificity = P [Y = Negative|X = No disease] (True negative rate)

• 1-Sensitivity = False negative rate, 1-Specificity = False positive rate.

These two quantities tell us how accurate a test/device is.

Manufacturer of a test device usually provides these two measures.

Slide 53

Q: Find sensitivity and specificity.

The higher the sensitivity and specificity, the better the diagnostic test.

chenc
Highlight
chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 15: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• However, a customer (or potential patient) may be more interested in

the following quantities:

P [X = Disease|Y = Positive] (PV+)

P [X = No disease|Y = Negative] (PV-)

• An accurate test may not yield high PV+ and/or PV-.

For example, assume a mammogram (for breast cancer) has

sensitivity=0.86 and specificity=0.88. If P [breast cancer]=0.01. Then

PV+ = P [X = BR|Y = +] =P [X = BR, Y = +]

P [Y = +]

=P [Y = +|X = BR]P [X = BR]

P [Y = +|X = BR]P [X = BR] + P [Y = +|X = No BR]P [X = No BR]

=0.86× 0.01

0.86× 0.01 + (1− 0.88)× (1− 0.01)= 6.8%

Similarly, PV- = 99.8% (without the test, P[No BR]=0.99).

Slide 54

Positive Predictive Value (PV+) is the probability of disease in an individual with a positive test result. Negative Predictive Value (PV - ) is the probability of not having the disease when the test result is negative.

chenc
Highlight
Page 16: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

I.4 Independence of X and Y

• X and Y are random with the underlying probability structure

Y

1 2 J

X 1 π11 π12 . π1J

2 π21 π22 . π2J

. . . . .

I πI1 πI2 . πIJ

• X ⊥ Y

⇔ P [X = i , Y = j ] = P [X = i ]*P [ Y = j ] f or i = 1, 2, . .., I , j = 1, 2, . .., J.⇔ πij = πi+π+j f or i = 1, 2, . .., I , j = 1, 2, . .., J.(πi+ = πi1 + πi2 + . .. + πiJ , π+j = π1j + π2j + . .. + πIj )⇔ P [ Y = j |X = i ] = P [ Y = j |X = k] f or all i , j, k.

Slide 55

chenc
Highlight
chenc
Highlight
Page 17: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• When X and Y are random 2-level cat. variables, the underlying

probability structure is

Y

1 2

X 1 π11 π12

2 π21 π22

• X ⊥ Y⇔ πij = πi+π+j for i, j = 1, 2 (πi+ = πi1 + πi2, π+j = π1j + π2j)

We only need one of them, e.g. π11 = π1+π+1

⇔ P [Y = 1|X = 1] = P [Y = 1|X = 2], i.e.

π1 =π11

π1+=π21

π2+= π2

Slide 56

Note that

Page 18: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

II Comparing Proportions in 2× 2 Tables

II.1 Difference of proportions

• Given data from a multinomial sampling or product-multinomial

sampling on X

Y

1 2

X 1 n11 n12 n1+

2 n21 n22 n2+

we would like to make inference on π1 − π2 where

π1 = P [Y = 1|X = 1] is the success probability for row 1 and

π2 = P [Y = 1|X = 2] is the the success probability for row 2.

• X ⊥ Y ⇔ π1 − π2 = 0.

Slide 57

Recall:

chenc
Highlight
chenc
Highlight
Page 19: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

1. Estimate of π1 − π2:

p1 − p2 =n11

n1+− n21

n2+.

2. Estimated SE (standard error) of p1 − p2:

SE(p1 − p2) =√p1(1− p1)/n1+ + p2(1− p2)/n2+

3. Large-sample (1− α) CI for π1 − π2:

p1 − p2 ± zα/2SE(p1 − p2).

If this CI does not contain 0, we can reject H0 : X ⊥ Y at

significance level α.

Slide 58

Recall:

Critical value:Zα/2 = qnorm(0.975)=1.959964

chenc
Highlight
chenc
Highlight
Page 20: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Example: Aspirin and heart attack.

In a 5-yr study, 22,000+ physicians were randomized (blinded) to the

placebo/aspirin (one tablet every other day) group:

Myocardial infarction

Yes No

Treatment Placebo 189 10, 845 11,034

Aspirin 104 10,933 11,037

1. Difference of MI probabilities between placebo and aspirin groups:

p1 − p2 = 189/11034− 104/11037 = 0.0171− 0.0094 = 0.0077.

2. SE =√

0.0171(1− 0.0171)/11034 + 0.0094(1− 0.0094)/11037 =

0.0015.

3. Large sample 95% CI of Difference of MI probabilities:

0.0077± 1.96× 0.0015 = [0.0048, 0.0106].

⇒ Physicians in placebo group are more likely to develop MI.Slide 59

(on X)

Critical value:Zα/2 = qnorm(0.975)=1.959964

chenc
Highlight
chenc
Highlight
Page 21: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

II.2 Relative Risk

• When both π1 and π2 are close to zero (rare event), the difference

π1 − π2 may not be very meaningful.

For example,

Case 1: π1 = 0.01, π2 = 0.001⇒ π1 − π2 = 0.009

Case 2: π1 = 0.41, π2 = 0.401⇒ π1 − π2 = 0.009

The above cases have the same difference π1 − π2. However, the

meanings are totally different.

• For rare events, a more relevant measure for difference is the relative

risk (RR):

RR =π1

π2.

Slide 60

For example:(a) RR=0.01/0.001=10;(b) RR=0.41/.401 = 1.022444.

chenc
Highlight
chenc
Highlight
Page 22: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Properties of the relative risk (RR):

1. 0 < RR <∞2. π1 > π2 ⇔ RR > 1;

π1 = π2 ⇔ RR = 1;

π1 < π2 ⇔ RR < 1.

3. X ⊥ Y ⇔ RR = 1.

• Estimate of RR: Given the 2× 2 table from multinomial sampling or

product-multinomial sampling on X, RR can be estimated by

RR =p1

p2.

Slide 61

Recall:

• X ⊥ Y ⇔ π1 − π2 = 0.

RR =π1

π2.

chenc
Highlight
Page 23: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• RR also has a nice interpretation. For the Aspirin Study, the RR

estimate is

RR =p1

p2=

0.0171

0.0094= 1.82.

⇒ Physicians receiving the placebo are 82% more likely to develop MI

(over 5 yrs) than physicians receiving aspirin.

• SE and CI for RR are complicated, Proc Freq calculates CI for RR

and other measures:data table2_3;

input group $ mi $ count @@;datalines;placebo yes 189 placebo no 10845aspirin yes 104 aspirin no 10933

;

title "Analysis of MI data";proc freq data=table2_3 order=data;

weight count;tables group*mi / norow nocol nopercent or;

run;

Slide 62

chenc
Highlight
Page 24: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

Output from the above SAS program:The FREQ Procedure

Table of group by mi

group mi

Frequency|yes |no | Total---------+--------+--------+placebo | 189 | 10845 | 11034---------+--------+--------+aspirin | 104 | 10933 | 11037---------+--------+--------+Total 293 21778 22071

Statistics for Table of group by mi Odds Ratio and Relative Risks

Statistic Value 95% Confidence Limits------------------------------------------------------------------Odds Ratio 1.8321 1.4400 2.3308Relative Risk (Column 1) 1.8178 1.4330 2.3059Relative Risk (Column 2) 0.9922 0.9892 0.9953

Sample Size = 22071

A 95% CI for RR is [1.43, 2.31]. We are 95% sure that physicians receiving the placebo is at least 43% and at most 131% more likely to develop MI (over 5 yrs) than physicians receiving aspirin.

Slide 63

The sample relative risk has a sampling distribution that is highly skewed unless the sample sizes are quite large. Because of this, its confidence interval formula is rather complex.

chenc
Highlight
Page 25: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

II.3 Odds Ratio

• Odds of a prob w (of an event): if π = P (A), then

ω =π

1− π=

success prob

failure prob

is called the odds of π (or of the event A). 0 < ω <∞.

For example, π = 0.75, then ω = 0.75/(1− 0.75) = 3.

For a rare event (π ≈ 0), π ≈ ω.

• The event prob π is related to odds ω as:

π =ω

1 + ω.

For example, ω = 4, then π = 4/(1 + 4) = 0.8.

Slide 64

When odds = 3.0, we expect to observe three successes for every one failure

chenc
Highlight
chenc
Highlight
Page 26: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• For the 2× 2 table

Y

1 2

X 1

2

the odds ratio between row 1 (π1 = P [Y = 1|X = 1]) and row 2

(π2 = P [Y = 1|X = 2]) is defined as

θ =odds1

odds2=π1/(1− π1)

π2/(1− π2).

• Properties of the odds ratio

1. 0 < θ < ∞.

2. π1 > π2 ⇔ θ > 1;π1 = π2 ⇔ θ = 1;π1 < π2 ⇔ θ < 1;

3. X ⊥ Y ⇔ θ = 1.Slide 65

Values of θ farther from 1.0 in a given direction represent a stronger association.

When θ = 0.25, for example, the odds of success in row 1 are 0.25 times the odds of success in row 2, or equivalently 1/0.25 = 4.0 times as high in row 2 as in row 1.

chenc
Highlight
chenc
Highlight
Page 27: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Given the 2× 2 table from multinomial sampling or

product-multinomial sampling on X:

Y

1 2

X 1 n11 n12 n1+

2 n21 n22 n2+

odds ratio θ can be estimated by

θ =p1/(1− p1)

p2/(1− p2)=n11/n1+/(1− n11/n1+)

n21/n2+/(1− n21/n2+)=n11/n12

n21/n22=n11n22

n12n21,

• var(log θ) can be estimated by

var(log θ) =1

n11+

1

n12+

1

n21+

1

n22.

Slide 66

Q: 95% CI for θ

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 28: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• We can construct a (1− α) CI for true θ as follows:

1. Get (1− α) CI for log(θ):

log θ ± zα/2SE(log θ).

2. Exponentiate both ends to get the CI for θ.

• For the Aspirin Study,

θ = 189×1093310845×104 = 1.8321(≈ RR)

var(log θ) = 1189 + 1

10845 + 1104 + 1

10933 = 0.01509

95%CI for log θ: log(1.8321)± 1.96√

0.01509 = [0.3647, 0.8462].

95% CI for θ : [e0.3647, e0.8462] = [1.44, 2.33].

Slide 67

Recall:

The estimated odds of MI for those takingplacebo equal 1.83 times the estimated odds for those taking aspirin. The estimated oddswere 83% higher for the placebo group.

We estimate that the odds of MIare at least 44% higher when taking placebo than when taking aspirin.

Critical value:Zα/2 = qnorm(0.975)=1.959964

chenc
Highlight
Page 29: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Note 1: If we have multinomial sampling:

Y

1 2

X 1 n11 n12

2 n21 n22

Y

1 2

X 1 π11 π12

2 π21 π22

the odds ratio θ can be also defined as

θ =π11π22

π12π21.

MLE of πij ’s are πij = nij/n ⇒ the same estimate of θ:

θ =π11π22

π12π21=n11n22

n12n21.

• Note 2: If some of nij ’s are small, add 0.5 to each cell then

re-calculate θ and var(log θ), e.g.

θ =(n11 + 0.5)(n22 + 0.5)

(n12 + 0.5)(n21 + 0.5)

Slide 68

chenc
Highlight
chenc
Highlight
Page 30: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• The relationship between θ and RR:

θ =π1/(1− π1)

π2/(1− π2)=π1

π2× (1− π2)

(1− π1)= RR× (1− π2)

(1− π1)

1. RR = 1⇔ θ = 1⇔ X ⊥ Y .

2. π1 > π2 ⇔ θ > RR > 1.

3. π1 < π2 ⇔ θ < RR < 1.

4. When π1 ≈ 0 & π2 ≈ 0 (rare events), θ ≈ RR.

0

-

θ RR 1 RR θ

Slide 69

chenc
Highlight
Page 31: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• The odds ratio for case-control studies:

? For the MI study (page 32)

Table 2.4. Case-Control Study on MI

Myocardial Infarction

Case Control

Ever Smoker Yes 172 173

No 90 346

262 519

we know that we cannot estimate π1 = P [MI|Eversmokers] and

π2 = P [MI|Neversmokers], and hence cannot estimate

RR =π1. π2

? However, we still want to assess the association between smoking and MI.

Slide 70

τ1 = P [Ever smoking|MI Case] τ2 = P [Ever smoking|MI Control]

chenc
Highlight
chenc
Highlight
Page 32: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? From the design, we can estimate

τ1 = P [Ever smoking|MI Case] : τ1 = 172/262 = 0.6565

τ2 = P [Ever smoking|MI Control] : τ2 = 172/262 = 0.3333

and the odds ratio between τ1 and τ2

θ∗ =τ1/(1− τ1)

τ2/(1− τ2): θ∗ =

τ1/(1− τ1)

τ2/(1− τ2)=n11n22

n12n21= 3.82.

? It can be shown that

θ∗ =π1/(1− π1)

π2/(1− π2)= θ

So we can use a case-control study to make inference on θ!

? The formula for var(log θ) is the same:

var(log θ) =1

n11+

1

n12+

1

n21+

1

n22.

Slide 71

chenc
Highlight
Page 33: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

? Therefore, for the Aspirin case-control study, the odds ratio of

developing MI between ever smokers and never smokers is

estimated as

θ = 3.82.

var(log θ) =1

172+

1

173+

1

90+

1

346= 0.0256.

95% CI for log θ:

log(3.82)± 1.96×√

0.0256 = [1.02665, 1.65385]

95% CI for θ: [e1.02665, e1.65385] = [2.79, 5.227].

• Since MI is a rare event, RR ≈ θ, so

RR ≈ 3.82 ≈ 4.

That is, ever smokers is about 3 times more likely

to develop MI than never smokers.

Slide 72

We estimate that the odds of MI are at least 179% higher when taking placebo than when taking aspirin.

chenc
Highlight
Page 34: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

III χ2 Test for Independence between X and Y (nominal)

Suppose X and Y are random and have the prob structure:

Y

1 2 J

X 1 π11 π12 . π1J

2 π21 π22 . π2J

. . . . .

I πI1 πI2 . πIJ

Given data {nij}’s from a multinomial sampling, we would like to test

H0 : πij = πij(θ), for i = 1, .., I, and j = 1, ..., J , where θ is a parameter

vector with dim(θ) = k.

If dim(θ) = 0, then πij ’s are totally known under H0.

Slide 73

https://academo.org/demos/dice-roll-statistics/

Consider the null hypothesis (H0) that cell probabilities in a two-way contingency table equal certain fixed values {πij}. For a sample of size n with cell counts {nij}, the values {μij = nπij} are called expected frequencies. They represent the expected values {E(nij)} when H0 is true. To judge whether the data contradict H0, we compare {nij} to {μij}. If H0 is true, nij should be close to μij in each cell.

chenc
Highlight
chenc
Highlight
Page 35: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

III.1 General Pearson χ2 test and LRT

• MLE θ of θ under H0; µij = nπij(θ), where n = n++.

• If H0 is true and n is large such as µij ’s are reasonably large (µij ≥ 5),

then the Pearson stat

χ2 =∑

all cells

(nij − µij)2

µij

H0∼ χ2df

where df = IJ − 1− dim(θ).

Reject H0 at level α if χ2 ≥ χ2df,α.

• LRT

G2 = 2∑

all cells

nij log

(nijµij

)H0∼ χ2

df .

• Calculation of df :

df = [# of unknown parameters under H 1 ∪ H 0 ] − [# of unknown parameters under H 0].

Slide 74

For testing independence in r × c contingency tables, the approximate chi-squared sampling distributions of X2 and G2 have df = (r − 1)(c − 1).

The df value means: under H0, {πi+} and {π+j} determines the cell prob. There are r − 1 non-redundant row prob. Because they sum to 1, the first r − 1 determines the last one through πr+ = 1− (π1+ + · · · + πr−1,+). Similarly, there are c − 1 non-redundant column prob, so, under H0, there are (r − 1) + (c − 1) parameters. Alternative hypothesis Ha states that there is not independence but does not specify a pattern for the rc cell prob. The prob are then solely constrained to sum to 1, so there are rc − 1 non-redundant parameters. Value for df is the difference between the number of parameters under (Ha and H0) and (H0), ordf = (rc − 1) − [(r − 1) + (c − 1)] = rc − r − c + 1 = (r − 1)(c − 1).

chenc
Highlight
chenc
Highlight
Page 36: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

Some χ2 distributions

Slide 75

Page 37: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

III.2 Test of independence

• X ⊥ Y ⇔ H0 : πij = πi+π+j , i = 1, ..., I, j = 1, ..., J

• The MLE of πi+’s and π+j ’s are

πi+ =ni+n, π+j =

n+j

n

• µij is equal to

µij = nπi+π+j =ni+n+j

n

• Pearson χ2 and LRT :

χ2 =∑

all cells

(nij − µij)2

µij, G2 = 2

∑all cells

nij log

(nijµij

)H0∼ χ2

df

df = IJ − 1− (I − 1 + J − 1) = (I − 1)(J − 1).

Reject H0 : X ⊥ Y if χ2 or G2 ≥ χ2df,α.

Slide 76

Note: For both test statistics, larger values provide stronger evidence against H0

For both test statistics:p-value = 1-pchisq(X2, df)

Q: Find X2 and G2, and then find the p-values.

chenc
Highlight
chenc
Highlight
Page 38: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Note: With data {nij}’s from a multinomial sampling or

product-multinomial sampling on X, we can test H0 : X ⊥ Y by

testing

H0 : P [Y = j|X = i] = P [Y = j|X = k] for all i, j, k

(cond. dist. of Y given X is the same across all levels of X)

It can be shown that the Pearson χ2 and LRT test stats are the same

with the same null dist χ2(I−1)(J−1).

Slide 77

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 39: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Example: Gender gap in party identification

Y –Party Identification

Democrat Independent Republican Total

X – Gender Female 762 327 468 1557

Male 484 239 477 1200

1246 566 945 n = 2757

Then µ11 = 1557× 1246/2757 = 703.7,

µ12 = 1557× 566/2757 = 319.6, etc.

⇒ χ2 =(762− 703.7)2

703.7+

(327− 319.6)2

319.6+ ... = 30.1

G2 = 2(762 log(762/703.7) + 327 log(327/319.6) + ...) = 30.0

χ22,0.05 = 5.99

Both Pearson test and LRT reject H0 : X ⊥ Y at level 0.05.

Note: χ2 ≈ G2 even if H0 is likely not true.

Slide 78

This evidence of association would be rather unusual if the variables were truly independent. Both test statistics suggest that political party ID and gender are associated.

See Chap2 R codes for details

Page 40: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• SAS program for the example:data table2_5;

input gender $ party $ count @@;datalines;female dem 762 female ind 327 female rep 468male dem 484 male ind 239 male rep 477

;

title "Analysis of Party Identification data";proc freq data=table2_5 order=data;

weight count;tables gender*party / norow nocol nopercent chisq expected measures cmh;

run;

• Output from the above program:Analysis of Party Identification data 1

The FREQ Procedure

Table of gender by party

gender party

Frequency|Expected |dem |ind |rep | Total---------+--------+--------+--------+female | 762 | 327 | 468 | 1557

| 703.67 | 319.65 | 533.68 |---------+--------+--------+--------+male | 484 | 239 | 477 | 1200

| 542.33 | 246.35 | 411.32 |---------+--------+--------+--------+Total 1246 566 945 2757

Slide 79

Page 41: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

Statistics for Table of gender by party

Statistic DF Value Prob------------------------------------------------------Chi-Square 2 30.0701 <.0001Likelihood Ratio Chi-Square 2 30.0167 <.0001Mantel-Haenszel Chi-Square 1 28.9797 <.0001Phi Coefficient 0.1044Contingency Coefficient 0.1039Cramer’s V 0.1044

Sample Size = 2757

Statistic Value ASE------------------------------------------------------Gamma 0.1710 0.0315Kendall’s Tau-b 0.0964 0.0180Stuart’s Tau-c 0.1078 0.0202

Somers’ D C|R 0.1097 0.0205Somers’ D R|C 0.0848 0.0158

Pearson Correlation 0.1025 0.0190Spearman Correlation 0.1016 0.0190

Summary Statistics for gender by party

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistic Alternative Hypothesis DF Value Prob---------------------------------------------------------------

1 Nonzero Correlation 1 28.9797 <.00012 Row Mean Scores Differ 1 28.9797 <.00013 General Association 2 30.0592 <.0001

Slide 80

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 42: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

III.3 Cell residuals for a contingency table

• Under H0 : X ⊥ Y ,

µij =ni+n+j

n.

• Calculate standardized Pearson residuals:

estij =nij − µij√

µij(1− pi+)(1− p+j).

• Under H0 : X ⊥ Y , E(estij) ≈ 0, var(estij) ≈ 1, and estij behaves like a

N(0, 1) variable.

• We can use estij to check the departure from H0 : X ⊥ Y .

• For the Party Identification example, p1+ = 1557/2757 = 0.565,

p+1 = 1246/2757 = 0.452

⇒ est11=

762− 703.7√703.7(1− 0.565)(1− 0.452)

= 4.50

Slide 81

P-value=2*pnorm(-4.5) = 6.795346e-06

Under H0, we expect about 5% of the standardized residuals to be farther from 0 than ±2 by chance alone.

Q: Find est12

chenc
Highlight
Page 43: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• We can use Proc Genmod of SAS to get the standardized Pearson

residuals:Proc Genmod order=data;

class gender party;model count = gender party / dist=poisson link=log residuals;

run;

• Part of the output:

Std StdRaw Pearson Deviance Deviance Pearson Likelihood

Observation Residual Residual Residual Residual Residual Residual

1 58.328618 2.1988558 2.1694814 4.4419109 4.5020535 4.48777992 7.3547334 0.4113702 0.4098076 0.6967948 0.6994517 0.69853393 -65.68335 -2.84324 -2.904774 -5.430995 -5.315946 -5.349114 -58.32862 -2.504669 -2.551707 -4.586602 -4.502054 -4.5283915 -7.354733 -0.468583 -0.470944 -0.702976 -0.699452 -0.7010366 65.683351 3.2386734 3.157751 5.1831197 5.3159455 5.2670354

The observation order is for row 1, then row 2, etc.

Slide 82

chenc
Highlight
chenc
Highlight
Page 44: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Put the standardized Pearson residuals in the original table:

Y –Party Identification

Democrat Independent Republican Total

X – Gender Female 4.5 0.7 -5.3

Male -4.5 -0.7 5.3

We see from the table that the independence model does not fit data well.

There are significantly more democrat females (less males) than predicted by

the independence model, there are significantly less republican females (more

males) than predicted by the model.

Slide 83

Under H0, we expect about 5% of the standardized residuals to be farther from 0 than ±2 by chance alone.

chenc
Highlight
chenc
Highlight
chenc
Highlight
chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 45: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

IV Testing Independence for Ordinal Data

IV.1 X,Y are both ordinal random cat. variables; Mantel-Haenszel M2

(CMH1)

• Assign scores u1 < u2 < · · · < uI to X and v1 < v2 < · · · < vJ to Y

Y

1(v1) j(vj) J(vJ)

1(u1)

X i(ui) πij

I(uI)

• Want to test H0 : X ⊥ Y given data such as

Slide 84

Let u1 ≤ u2 ≤ · · · ≤ ur denote scores for the rows, and v1 ≤ v2 ≤ · · · ≤ vc denote scores for the columns, having the same ordering as the categories.

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 46: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

Y

v1 v2 v3

u1 2 1 3

X u2 1 2 1

u3 1 1 2

Patient X Y

1 u1 v1

2 u1 v1

3 u1 v2

4 u1 v3

5 u1 v3

6 u1 v3

7 u2 v1

8 u2 v2

9 u2 v2

10 u2 v3

11 u3 v1

12 u3 v2

13 u3 v3

14 u3 v3

Slide 85

Q: Find X-bar and Y-bar.

Let u1 ≤ u2 ≤ · · · ≤ ur denote scores for the rows, and v1 ≤ v2 ≤ · · · ≤ vc denote scores for columns, having same ordering as categories.

Page 47: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Pearson correlation coefficient describes linear relationship between X

and Y and can be used to test H0 : X ⊥ Y :

r =1

n−1

∑ni=1(xi − x)(yi − y)√

1n−1

∑ni=1(xi − x)2 1

n−1

∑ni=1(yi − y)2

,

where

x =1

n

n∑i=1

xi =1

n

I∑i=1

ni+ui =I∑i=1

pi+ui = u

y =1

n

n∑i=1

yi =1

n

J∑j=1

n+jvj =

J∑j=1

p+jvj = v

Slide 86

Correlation falls between −1 and +1. Independence between variables implies that its population value ρ = 0. Larger value of |R|,farther data fall fromindependence in lineardimension.

chenc
Highlight
Page 48: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

=⇒

r =

∑Ii=1

∑Jj=1 pij(ui − u)(vj − v)√∑I

i=1 pi+(ui − u)2∑Jj=1 p+j(vj − v)2

• It can be shown that under H0 : X ⊥ Y√n − 1 r ∼a N(0, 1)

∼M2 = (n − 1) r2 a χ21

This is the Mantel-Haenszel test for H0 : X ⊥ Y (cmh1 in SAS).

• Note: We don’t have to expand the data to calculate r. Proc Freq

calculates r and M2.

Slide 87

chenc
Highlight
chenc
Highlight
Page 49: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• How to choose scores {ui}’s for X and {vj}’s for Y :

1. Any increasing/decreasing seq is ok for {ui}’s and {vj}’s. They

have to be chosen before analyzing data.

2. Mid-rank. For example,

Y

1 2 3 ui

1 2 1 3 6 3.5

X 2 1 2 1 4 8.5

3 1 1 2 4 12.5

4 4 6

vj 2.5 6.5 11.5Proc Freq order=data

tables x*y/CMH1 Scores=rank;run;

3. The default is “1, 2, · · · , I” for X and “1, 2, · · · , J” for Y in SAS.

Slide 88

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 50: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Note 1: M2 only detects “linear trend” between X and Y , Pearson

χ2 and LRT G2 detects any deviation from indep.

• Note 2: Proc corr of SAS uses (as the default)

t = (n− 2)1/2

(r2

1− r2

)1/2

to test H0 : ρ = 0 by comparing t to tn−2. M2 and t2 are asymptotically equivalent under H0.

• From slide 80, M2 = 28.98 using 1,2 for gender and 1,2,3 for party

identification. Reject H0 : X ⊥ Y .

• Note 3: M2 is for a 2-sided test. We can use√n− 1r for a

one-sided test.

From slide 80,√n− 1r =

√28.98 = 5.4 ⇒ reject H0 : X ⊥ Y in

favor of H1 : ρ > 0 (even if r = 0.1).

Slide 89

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 51: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Example: Mother’s alcohol consumption and infant malformation(Table 2.7 on p. 42)

Alcohol Malformation

Consumption Present (Y = 1) Absent (Y = 0)

0 48 17, 066

< 1 38 14, 464

1− 2 5 788

3− 5 1 126

≥ 6 1 37

χ2 = 12.1 (p-value = 0.016) , G2 = 6.2 (p-value = 0.185) ⇒ mixed results.

Assigned scores for alcohol consumption: 0, 0.5, 1.5, 4, 7 and 0/1 for absent/present

⇒ r = 0.0142, M2 = 6.6, p-value =P [χ2

1 ≥ M2] = 0.01.

χ2, G2, M2 may not be valid ⇒ Exact test (later).

Slide 90

chenc
Highlight
chenc
Highlight
Page 52: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• SAS program:data table2_7;

input alcohol malform count @@;datalines;0 1 48 0 0 170660.5 1 38 0.5 0 144641.5 1 5 1.5 0 7884 1 1 4 0 1267 1 1 7 0 37

;

title "Analysis of infant malformation data";proc freq data=table2_7;

weight count;tables alcohol*malform / measures chisq cmh;

run;

• Part of the output:Statistics for Table of alcohol by malform

Statistic DF Value Prob------------------------------------------------------Chi-Square 4 12.0821 0.0168Likelihood Ratio Chi-Square 4 6.2020 0.1846Mantel-Haenszel Chi-Square 1 6.5699 0.0104

Statistic Value ASE------------------------------------------------------Pearson Correlation 0.0142 0.0106Spearman Correlation 0.0033 0.0059

Slide 91

Page 53: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

IV.2 Trend test for I × 2 and 2× J tables

• For an I × 2 table where X is an I-level ordinal variable and Y is a

2-level variable (such as the infant malformation table) from a

multinomial sampling or product-multinomial sampling on X:

Y

1 0

u1 n11 n12 n1+

X u2 n21 n22 n2+

...

uI nI1 nI2 nI+

we can assign scores to X and any scores (usually 0/1) to Y ⇒ M2.

Slide 92

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 54: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• The Mantel-Haenszel M2 can be derived in a different way (taken

from Section 3.2.1)

Consider

πi = P [Y = 1|X = ui].

Assume a linear trend model for πi:

πi = α+ βui

Then H0 : X ⊥ Y =⇒ H∗0 : β = 0

An unbiased estimate of πi:

πi =ni1ni+

= pi ← sample proportion at X = ui

The trend model implies the following linear model for pi:

pi = α+ βui + εi,

Slide 93

chenc
Highlight
chenc
Highlight
Page 55: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

var(εi) = πi(1 − πi)/ni+, which equals α(1 − α)/ni+ under H0∗ : β = 0

=⇒ WLS (weighted LS, weighted by sample size ni+) estimate of β

β =

∑Ii=1 ni+(ui − u)(pi − p)∑I

i=1 ni+(ui − u)2,

where

u =1

n

I∑i=1

ni+ui ← sample mean of {Xi}

p =n+1

n← pooled sample response rate

var(β) under H0 can be estimated by

varH0(β) =

p(1− p)∑Ii=1 ni+(ui − u)2

.

Slide 94

WLS: weighted least square

Page 56: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

For testing H∗0 : β = 0, let’s use Wald test

Z =β√

varH0(β)

Under H0 : X ⊥ Y , Z ∼ N(0, 1) or Z2 ∼ χ21.

• Z2 or Z is the Cochran-Armitage Trend test.

It can be shown that Z2 = nr2. Remember M2 = (n− 1)r2

⇒ Z2 =n

n− 1M2 ≈M2

• SAS program:title "Trend test of infant malformation data";proc freq data=table2_7 order=data;

weight count;tables alcohol*malform / trend;

run;

Slide 95

chenc
Highlight
chenc
Highlight
Page 57: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Part of the output:Statistics for Table of alcohol by malform

Cochran-Armitage Trend Test--------------------------Statistic (Z) 2.5632One-sided Pr > Z 0.0052Two-sided Pr > |Z| 0.0104

Sample Size = 32574

• We see that Z = 2.5632. Both one-sided and 2-sided p-values are

significant. Since Z > 0, we conclude that β > 0.

We can confirm the relationship:

Z2 =n

n− 1M2.

Slide 96

chenc
Highlight
Page 58: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• For a 2× J table where X is nominal or ordinal variable, Y is an

ordinal variable with data {nij}’s from a multinomial sampling or

product-multinomial sampling on X

Y

v1 v2 · · · vJ

X 1 n11 n12 · · · n1J

2 n21 n22 · · · n2J

We have a situation similar to two sample t-test for comparing means of Y scores b/w X = 1 and X = 2. It can be shown that t2 ≈ M2 (M2 will be independent of the score choice for X).

If we use mid-ranks as the scores for Y , M2 is same as Mann-Whitney test.

Slide 97

chenc
Highlight
chenc
Highlight
Page 59: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

IV.3 Tests for nominal-ordinal tables

• X – nominal, Y – ordinal with data from multinomial sampling or

product-multinomial sampling on X such as:

Y

v1 v2 v3

1 n11 n12 n13 n1+

X 2 n21 n22 n23 n2+

3 n31 n32 n33 n3+

• H0 : X ⊥ Y⇓Cond. dists. of Y are same across levels of X⇓Mean scores of Y at X = i are same across levels of X

• This is an ANOVA problem.

Slide 98

chenc
Highlight
chenc
Highlight
chenc
Highlight
Page 60: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• We can use the ANOVA F -test to test X ⊥ Y :

F =SST/(I − 1)

SSE/(n− I)

H0∼ FI−1,n−I

• Equivalently (for large n), we can useχ2 =

SST

SSE∗/(n− 1)

H0∼ χ2I−1

where SSE∗ is the modified sum of squares of errors.

The test χ2 is called cmh2 by SAS:

proc freq;weight count;tables x*y / cmh2;

run;

Slide 99

SST: Sum of Square of TreatmentSSE: Sum of Square of Error

Page 61: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

V. Exact Inference for Sparse Tables

V.1 Fisher’s exact test for 2× 2 tables

• X,Y – 2 level cat. variables with structure

Y

1 2

X 1 π11 π12

2 π21 π22

• Want to test H0 : X ⊥ Y given data, WLOG, assuming from a

multinomial sampling:

Y

1 2

X 1 n11 n12

2 n21 n22

Slide 100

chenc
Highlight
chenc
Highlight
Page 62: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• When {nij}’s are large, we can use the Pearson χ2 or LRT G2 to test

H0 : X ⊥ Y .

• However, when some cell counts {nij}’s are small, the exact dist. of

χ2 or LRT G2 under H0 may be far from χ21, =⇒ use of asym. dist

may give wrong conclusions.

• Fisher’s tea example: Fisher’s colleague, Muriel Bristol claimed she

could tell whether or not tea (or milk) was added to the cup first.

Muriel’s Guess

Milk Tea

True Milk 3 1 4

Tea 1 3 4

4 4

Slide 101

chenc
Highlight
Page 63: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• By the design of Fisher’s tea example, Pearson χ2 or G2 can at most

take 5 different values (there are only 5 possible different tables).

Therefore, the χ21 approximate dist. of χ2 or G2 is very poor!

• Even if we assumed multinomial sampling, there would only be(8+3

3

)= 165 tables. Moreever, nij ’s are small. The χ2

1 approximation

of Pearson χ2 or G2 will still be very poor.

• Let us develop an exact test for testing H0 : X ⊥ Y in these kind of

sparse 2× 2 tables.

• Let us assume multinomial sampling and would like to test

H0 : θ = 1(X ⊥ Y ) v.s. one-sided alternative Ha : θ > 1.

Slide 102

Page 64: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• With multinomial sampling, (n11, n12, n21, n22) are random variables

(only the sum n = n++ is fixed).

• Under H0 : θ = 1(X ⊥ Y ), πij = πi+π+j , there are two unknown

π1+, π+1 parameters. So the distribution of data (n11, n12, n21, n22) is

unknown even under H0.

• It can be shown that under H0 : θ = 1(X ⊥ Y ), the conditional

distribution of n11|n1+, n+1 is totally known:

P [n11 = t0] =

(n1+

t0

)(n2+

n+1−t0

)(nn+1

) .

where t0 is the observed value of n11. This is a hyper-geometric

distribution.

Slide 103

chenc
Highlight
Page 65: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

V.2 P-values of Fisher’s exact tests:

Y

1 2

X 1 n11 n12 n1+

2 n21 n22 n1+

n+1 n+2 n

• Simple algebra shows

θ =n11n22

n12n21=

n11(n+2 − n1+ + n11)

(n1+ − n11)(n+1 − n11)↗ n11

=⇒ larger θ ⇔ larger n11

=⇒ We should reject H0 in favor of H1 when n11 is large.

=⇒ P-value = P [n11 ≥ t0|n1+, n+1, H0] – one-sided Fisher’s exact

test.

Slide 104

chenc
Highlight
Page 66: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• For Fisher’s tea example, one-sided p-value is:

P-value = P [n11 ≥ 3|n1+, n+1, H0]

= P [n11 = 3|n1+, n+1, H0] + P [n11 = 4|n1+, n+1, H0]

=

(43

)(41

)(84

) +

(44

)(40

)(84

) = 0.229 + 0.014 = 0.243

Mid P-value = 0.229/2 + 0.014 = 0.129.

Note: In this example, n1+, n+1 are naturally fixed.

Slide 105

Page 67: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Two-sided Fisher’s exact test: H0 : θ = 1(X ⊥ Y ) v.s. two-sided

alternative Ha : θ 6= 1.

Table n11 = 0 n11 = 1 n11 = 2 n11=3 n11 = 4

Prob 0.014 0.229 0.514 0.229 0.014

• P-value of two-sided Fisher’s exact test:

P-value =∑

P (n11)I{P (n11) ≤ P (t0)}

= sum of table probs that are ≤ observed table prob.

p-value = P [n11 = 0] + P [n11 = 1] + P [n11 = 3] + P [n11 = 4] =

0.014 + 0.229 + 0.229 + 0.014 = 0.486.

Slide 106

Page 68: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• SAS program & output for Fisher’s exact test:data table2_8;input pour $ guess $ count @@;datalines;milk milk 3 milk tea 1tea milk 1 tea tea 3

;

title "Analysis of Fisher’s tea data";proc freq data=table2_8;

weight count;tables pour*guess / norow nocol nopercent chisq;exact fisher or;

run;

The FREQ Procedure

Table of pour by guess

pour guess

Frequency|milk |tea | Total---------+--------+--------+milk | 3 | 1 | 4---------+--------+--------+tea | 1 | 3 | 4---------+--------+--------+Total 4 4 8

Statistics for Table of pour by guess

Statistic DF Value Prob------------------------------------------------------Chi-Square 1 2.0000 0.1573Likelihood Ratio Chi-Square 1 2.0930 0.1480

Slide 107

Page 69: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

Fisher’s Exact Test----------------------------------Cell (1,1) Frequency (F) 3Left-sided Pr <= F 0.9857Right-sided Pr >= F 0.2429

Table Probability (P) 0.2286Two-sided Pr <= P 0.4857

Odds Ratio-----------------------------------Odds Ratio 9.0000

Asymptotic Conf Limits95% Lower Conf Limit 0.366695% Upper Conf Limit 220.9270

Exact Conf Limits95% Lower Conf Limit 0.211795% Upper Conf Limit 626.2435

Sample Size = 8

Note: We can also obtain an exact CI for the true θ.

Slide 108

Page 70: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

V.3 Fisher’s exact tests can be conservative

• For the Fisher’s tea example, the exact null distribution of

n11|n1+, n+1:

Table n11 = 0 n11 = 1 n11 = 2 n11=3 n11 = 4

Prob 0.014 0.229 0.514 0.229 0.014

• If we would like to construct a one-sided test at significance level 0.05

(target type I error prob), then we would only reject H0 : θ = 1 in favor

of Ha : θ > 1 when n11 = 4. Therefore, the actual type I error prob is

P [n11 = 4|H0, n1+, n+1] = 0.014 < 0.05.

So the test is very conservative!

Slide 109

chenc
Highlight
Page 71: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

VI Association in Three-Way Tables

• X, Y – 2 categorical variables

The X, Y (marginal) association may not reflect a Causal relation.

Need to adjust a 3rd variable Z, confounding variable (related to both

X, Y )

For example,

X = second hand smoking

Y = lung cancer

Z = age, may be related to X and Y

Lung Cancer

Yes No

Second Hand Smoking Yes π11 π12

No π21 π22

Slide 110

For Chap2, skipped from here...

Page 72: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

VI.1 Partial tables, conditional and marginal associations

• With 3 categorical variables X,Y and Z, at each level of Z, there is an

XY tables. Together, they form partial tables.

• Each partial table provides information on conditional associations

between X and Y given Z = k.

• When collapsing partial tables over Z, we get a 2-way XY (marginal)

table. This table provides information of marginal association between

X and Y .

• We need to be aware that the conditional associations and marginal

association may be different!

Slide 111

Page 73: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Death penalty example (Table 2.10). Data from Florida, 1976-1987.

X = defendant’s’ race (W, B), Y = death penalty (Yes, No).

Y – Death Penalty

Yes No

X – Race W 53 430

B 15 176

Death penalty rate for W = π1 = 5353+430 = 0.11

Death penalty rate for B = π2 = 1515+176 = 0.079

ψ = 1.39, θ =53× 176

430× 15= 1.45

⇒ White defendants are (40%) more likely to receive a death penalty

than black defendants.

• Maybe the race of victims (Z) affects the XY association?

Slide 112

Page 74: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

When Z = White, XY table is

Y – Death Penalty

Yes No

X – Race W 53 414 π1 = 11.3%

B 11 37 π2 = 22.9%

When Z = Black, XY table is

Y – Death Penalty

Yes No

X – Race W 0 16 π1 = 0%

B 4 139 π2 = 2.8%

• We see that the conditional associations and the marginal association

between X and Y have different directions! This phenomenon is called

Simpson’s paradox.

Slide 113

Page 75: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Reasons causing Simpson’s paradox:

Z is related to both X and Y .

1. More white victims than black victims.

2. Given Z =white, defendants (X) are about 90% likely to be white

3. Given Z =black, defendants (X) are only about 10% likely to be

white.

4. More white defendants received death penalty (X,Y are related).

Slide 114

Page 76: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

VI.2 Conditional and marginal odds ratios

• When we have 2× 2×K tables for X,Y and Z, At Z = k, observed

table for XY is

Y

1 2

X 1 n11k n12k

2 n21k n22k

Then we have K conditional odds ratios that estimate the conditional

associations between X and Y at Z = k

θXY (k) =n11kn22k

n12kn21k.

Slide 115

Page 77: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

The marginal XY table is

Y

1 2

X 1 n11+ n12+

2 n21+ n22+

The marginal odds-ratio estimates the marginal association between X

and Y :

θXY =n11+n22+

n12+n21+.

Slide 116

Page 78: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• For the death penalty example,

θXY = 1.45

θXY (1) =53× 37

11× 414= 0.43

θXY (2) =0× 139

4× 16= 0

θmodXY (2) =

0.5× 139.5

4.5× 16.5= 0.94

Slide 117

Page 79: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

VI.3 Conditional and marginal independence

• If X and Y are independent at any level of Z, then X and Y are

called conditionally independent given Z.

If X,Y are 2-level variables, then X and Y conditionally independent

⇔ θXY (k) = 1, k = 1, 2, ...,K.

• X,Y marginally independent if X, Y are independent.

If X,Y are 2-level variables, then X and Y marginally independent ⇔θXY = 1.

Slide 118

Page 80: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Example: Conditional independence 6 ⇒ marginal independence.

Y

S F

X A 18 12

B 12 8

θXY (1) = 1 A = B

Y

S F

X A 2 8

B 8 32

θXY (2) = 1 A = B

Marginally,

Y

S F

X A 20 20

B 20 40

θXY = 2 ⇒ A > B

Slide 119

Page 81: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

• Example: Marginal independence 6 ⇒ conditional independence

Y

S F

X A 4 1

B 9 6

θXY (1) = 8/3

Y

S F

X A 6 9

B 1 4

θXY (2) = 8/3

Marginally,

Y

S F

X A 10 10

B 10 10

θXY = 1 ⇒ A = B

Slide 120

Page 82: CHAPTER 2 ST 544, D. Zhang 2 Contingency Tables

CHAPTER 2 ST 544, D. Zhang

VI.4 Homogeneous association

• Assume X,Y are 2-level variables.

Homogeneous association (in terms of θ) – no interaction

m

θXY (1) = θXY (2) = · · · = θXY (K)

When θXY (k) are not all the same, Z is called an effect modifier (there

is interaction).

• Note: Under homogeneous association, we cannot claim

θXY = θXY (1) = θXY (2) = · · · = θXY (K).

See previous examples.

Slide 121