goodness of fit tests: independence

30
Goodness of Fit Tests: Independence Mathematics 47: Lecture 34 Dan Sloughter Furman University May 9, 2006 Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 1 / 10

Upload: others

Post on 03-Feb-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Goodness of Fit Tests: Independence

Goodness of Fit Tests: IndependenceMathematics 47: Lecture 34

Dan Sloughter

Furman University

May 9, 2006

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 1 / 10

Page 2: Goodness of Fit Tests: Independence

Testing for independence

I Suppose X is a discrete random variable with r possible outcomesand Y is a random variable with c possible outcomes.

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

pij = P(X = i ,Y = j),

pi+ = pi1 + pi2 + · · ·+ pic = P(X = i),

andp+j = p1j + p2j + · · ·+ prj = P(Y = j).

I We want to test the hypothesis that X and Y are independent.

I That is, we wish to test

H0 : pij = pi+p+j for all i and j

HA : pij 6= pi+p+j for some i and j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 2 / 10

Page 3: Goodness of Fit Tests: Independence

Testing for independence

I Suppose X is a discrete random variable with r possible outcomesand Y is a random variable with c possible outcomes.

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

pij = P(X = i ,Y = j),

pi+ = pi1 + pi2 + · · ·+ pic = P(X = i),

andp+j = p1j + p2j + · · ·+ prj = P(Y = j).

I We want to test the hypothesis that X and Y are independent.

I That is, we wish to test

H0 : pij = pi+p+j for all i and j

HA : pij 6= pi+p+j for some i and j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 2 / 10

Page 4: Goodness of Fit Tests: Independence

Testing for independence

I Suppose X is a discrete random variable with r possible outcomesand Y is a random variable with c possible outcomes.

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

pij = P(X = i ,Y = j),

pi+ = pi1 + pi2 + · · ·+ pic = P(X = i),

andp+j = p1j + p2j + · · ·+ prj = P(Y = j).

I We want to test the hypothesis that X and Y are independent.

I That is, we wish to test

H0 : pij = pi+p+j for all i and j

HA : pij 6= pi+p+j for some i and j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 2 / 10

Page 5: Goodness of Fit Tests: Independence

Testing for independence

I Suppose X is a discrete random variable with r possible outcomesand Y is a random variable with c possible outcomes.

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

pij = P(X = i ,Y = j),

pi+ = pi1 + pi2 + · · ·+ pic = P(X = i),

andp+j = p1j + p2j + · · ·+ prj = P(Y = j).

I We want to test the hypothesis that X and Y are independent.

I That is, we wish to test

H0 : pij = pi+p+j for all i and j

HA : pij 6= pi+p+j for some i and j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 2 / 10

Page 6: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I To test the hypotheses, suppose we have a random sample of size nfrom the bivariate distribution of (X ,Y ).

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

nij = number of observations (X ,Y ) for which X = i and Y = j ,

ni+ = ni1 + ni2 + · · ·+ nic

= number of observations (X ,Y ) for which X = i ,

and

n+j = n1j + n2j + · · ·+ nrj

= number of observations (X ,Y ) for which Y = j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 3 / 10

Page 7: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I To test the hypotheses, suppose we have a random sample of size nfrom the bivariate distribution of (X ,Y ).

I For i = 1, 2, . . . , r and j = 1, 2, . . . , c , let

nij = number of observations (X ,Y ) for which X = i and Y = j ,

ni+ = ni1 + ni2 + · · ·+ nic

= number of observations (X ,Y ) for which X = i ,

and

n+j = n1j + n2j + · · ·+ nrj

= number of observations (X ,Y ) for which Y = j .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 3 / 10

Page 8: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I We call the table of the values nij a contingency table:

1 2 · · · c Total

1 n11 n12 · · · n1c n1+

2 n21 n22 · · · n2c n2+...

......

. . ....

...r nr1 nr2 · · · nrc nr+

Total n+1 n+2 · · · n+c n

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 4 / 10

Page 9: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I Now the maximum likelihood estimators are

p̂i+ =ni+

n,

for i = 1, 2, . . . , r , and

p̂+j =n+j

n,

for j = 1, 2, . . . , c .

I Hence, under H0, the expected frequencies are

eij = n · ni+

n·n+j

n=

ni+n+j

n,

i = 1, 2, . . . r and j = 1, 2, . . . , c .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 5 / 10

Page 10: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I Now the maximum likelihood estimators are

p̂i+ =ni+

n,

for i = 1, 2, . . . , r , and

p̂+j =n+j

n,

for j = 1, 2, . . . , c .

I Hence, under H0, the expected frequencies are

eij = n · ni+

n·n+j

n=

ni+n+j

n,

i = 1, 2, . . . r and j = 1, 2, . . . , c .

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 5 / 10

Page 11: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I We may evaluate either

−2 log(Λ) = 2r∑

i=1

c∑j=1

nij log

(nij

eij

)or

Q =r∑

i=1

c∑j=1

(nij − eij)2

eij.

I Under H0, both −2 log(Λ) and Q are, for large n, approximatelyχ2((r − 1)(c − 1)), where the degrees of freedom follow fromsubtracting the number of estimated parameters, that is,(r − 1) + (c − 1) = r + c − 2, from one less than the number of cells:

(rc − 1)− (r + c − 2) = rc − r − c + 1 = (r − 1)(c − 1).

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 6 / 10

Page 12: Goodness of Fit Tests: Independence

Testing for independence (cont’d)

I We may evaluate either

−2 log(Λ) = 2r∑

i=1

c∑j=1

nij log

(nij

eij

)or

Q =r∑

i=1

c∑j=1

(nij − eij)2

eij.

I Under H0, both −2 log(Λ) and Q are, for large n, approximatelyχ2((r − 1)(c − 1)), where the degrees of freedom follow fromsubtracting the number of estimated parameters, that is,(r − 1) + (c − 1) = r + c − 2, from one less than the number of cells:

(rc − 1)− (r + c − 2) = rc − r − c + 1 = (r − 1)(c − 1).

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 6 / 10

Page 13: Goodness of Fit Tests: Independence

Example

I After the Doll and Hill study of 1948 and 1949 on the connectionbetween lung cancer and smoking, R. A. Fisher raised questions aboutpossible links between genetics and smoking habits.

I In one study of 71 pairs of twins, he classified each pair in two ways:(1) whether they were identical or fraternal twins and (2) whetherthey had similar or dissimilar smoking habits.

I The following contingency table summarizes his results:

Like Habits Unlike Habits Total

Identical Twins 44 9 53Fraternal Twins 9 9 18

Total 53 18 71

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 7 / 10

Page 14: Goodness of Fit Tests: Independence

Example

I After the Doll and Hill study of 1948 and 1949 on the connectionbetween lung cancer and smoking, R. A. Fisher raised questions aboutpossible links between genetics and smoking habits.

I In one study of 71 pairs of twins, he classified each pair in two ways:(1) whether they were identical or fraternal twins and (2) whetherthey had similar or dissimilar smoking habits.

I The following contingency table summarizes his results:

Like Habits Unlike Habits Total

Identical Twins 44 9 53Fraternal Twins 9 9 18

Total 53 18 71

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 7 / 10

Page 15: Goodness of Fit Tests: Independence

Example

I After the Doll and Hill study of 1948 and 1949 on the connectionbetween lung cancer and smoking, R. A. Fisher raised questions aboutpossible links between genetics and smoking habits.

I In one study of 71 pairs of twins, he classified each pair in two ways:(1) whether they were identical or fraternal twins and (2) whetherthey had similar or dissimilar smoking habits.

I The following contingency table summarizes his results:

Like Habits Unlike Habits Total

Identical Twins 44 9 53Fraternal Twins 9 9 18

Total 53 18 71

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 7 / 10

Page 16: Goodness of Fit Tests: Independence

Example

I After the Doll and Hill study of 1948 and 1949 on the connectionbetween lung cancer and smoking, R. A. Fisher raised questions aboutpossible links between genetics and smoking habits.

I In one study of 71 pairs of twins, he classified each pair in two ways:(1) whether they were identical or fraternal twins and (2) whetherthey had similar or dissimilar smoking habits.

I The following contingency table summarizes his results:

Like Habits Unlike Habits Total

Identical Twins 44 9 53Fraternal Twins 9 9 18

Total 53 18 71

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 7 / 10

Page 17: Goodness of Fit Tests: Independence

Example (cont’d)

I The expected frequencies are

e11 =53 · 53

71= 39.56

e21 =53 · 18

71= 13.44

e12 =18 · 53

71= 13.44

e22 =18 · 18

71= 4.56.

I Table of expected frequencies:

Like Habits Unlike Habits Total

Identical Twins 39.56 13.44 53.00Fraternal Twins 13.44 4.56 18.00

Total 53.00 18.00 71.00

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 8 / 10

Page 18: Goodness of Fit Tests: Independence

Example (cont’d)

I The expected frequencies are

e11 =53 · 53

71= 39.56

e21 =53 · 18

71= 13.44

e12 =18 · 53

71= 13.44

e22 =18 · 18

71= 4.56.

I Table of expected frequencies:

Like Habits Unlike Habits Total

Identical Twins 39.56 13.44 53.00Fraternal Twins 13.44 4.56 18.00

Total 53.00 18.00 71.00

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 8 / 10

Page 19: Goodness of Fit Tests: Independence

Example (cont’d)

I The expected frequencies are

e11 =53 · 53

71= 39.56

e21 =53 · 18

71= 13.44

e12 =18 · 53

71= 13.44

e22 =18 · 18

71= 4.56.

I Table of expected frequencies:

Like Habits Unlike Habits Total

Identical Twins 39.56 13.44 53.00Fraternal Twins 13.44 4.56 18.00

Total 53.00 18.00 71.00

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 8 / 10

Page 20: Goodness of Fit Tests: Independence

Example (cont’d)

I To test the hypothesis H0 that the smoking habits and type of twinare independent, we compute the observed value of Q, namely,q = 7.76.

I If U is χ2(1), we have

p-value ≈ P(U ≥ 7.76) = 0.005342.

I Hence there is strong evidence to reject H0, and so to conclude thatthere is a genetic component to smoking habits.

I Note: we could have computed the observed value of −2 log(Λ):−2 log(λ) = 7.16, giving a p-value of 0.007455.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 9 / 10

Page 21: Goodness of Fit Tests: Independence

Example (cont’d)

I To test the hypothesis H0 that the smoking habits and type of twinare independent, we compute the observed value of Q, namely,q = 7.76.

I If U is χ2(1), we have

p-value ≈ P(U ≥ 7.76) = 0.005342.

I Hence there is strong evidence to reject H0, and so to conclude thatthere is a genetic component to smoking habits.

I Note: we could have computed the observed value of −2 log(Λ):−2 log(λ) = 7.16, giving a p-value of 0.007455.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 9 / 10

Page 22: Goodness of Fit Tests: Independence

Example (cont’d)

I To test the hypothesis H0 that the smoking habits and type of twinare independent, we compute the observed value of Q, namely,q = 7.76.

I If U is χ2(1), we have

p-value ≈ P(U ≥ 7.76) = 0.005342.

I Hence there is strong evidence to reject H0, and so to conclude thatthere is a genetic component to smoking habits.

I Note: we could have computed the observed value of −2 log(Λ):−2 log(λ) = 7.16, giving a p-value of 0.007455.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 9 / 10

Page 23: Goodness of Fit Tests: Independence

Example (cont’d)

I To test the hypothesis H0 that the smoking habits and type of twinare independent, we compute the observed value of Q, namely,q = 7.76.

I If U is χ2(1), we have

p-value ≈ P(U ≥ 7.76) = 0.005342.

I Hence there is strong evidence to reject H0, and so to conclude thatthere is a genetic component to smoking habits.

I Note: we could have computed the observed value of −2 log(Λ):−2 log(λ) = 7.16, giving a p-value of 0.007455.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 9 / 10

Page 24: Goodness of Fit Tests: Independence

Example (cont’d)

I To test the hypothesis H0 that the smoking habits and type of twinare independent, we compute the observed value of Q, namely,q = 7.76.

I If U is χ2(1), we have

p-value ≈ P(U ≥ 7.76) = 0.005342.

I Hence there is strong evidence to reject H0, and so to conclude thatthere is a genetic component to smoking habits.

I Note: we could have computed the observed value of −2 log(Λ):−2 log(λ) = 7.16, giving a p-value of 0.007455.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 9 / 10

Page 25: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10

Page 26: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10

Page 27: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10

Page 28: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10

Page 29: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10

Page 30: Goodness of Fit Tests: Independence

Example (cont’d)

I Note: To perform the above test in R, if the first column of data is inthe vector x and the second column of data is in the vector y, we firstcreate a matrix z with the command > z <- cbind(x, y)

I Then the command > chisq.test(z, correct=FALSE) willperform the above analysis.

I The command > chisq.test(z) will perform the test as well, butwith a correction for continuity.

I Note: If the data are in an array in a file called twins.dat, then thecommand > z <- read.table("twins.dat") would read the datadirectly into z for analysis.

I Note: In R Commander, use the Contingency tables option underthe Statistics menu.

Dan Sloughter (Furman University) Goodness of Fit Tests: Independence May 9, 2006 10 / 10