f73db3 categorical data analysis workbook contents page preface aims summary...
TRANSCRIPT
F73DB3 CATEGORICAL DATA ANALYSIS
Workbook
Contents pagePreface
AimsSummaryContent/structure/syllabusplus other information
Background – computing (R)
hwu
Examples
Single classifications (1-13)
Two-way classifications (14-27)
Three-way classifications (28-32)
hwu
hwu
Example 2 Prussian cavalry deaths
(a)Numbers killed in each unit in each year - frequency table
Number killed
0 1 2 3 4 5
Total
Frequency observed
144 91 32 11 2 0 280
hwu
Example 2 Prussian cavalry deaths
(b) Numbers killed in each unit in each year – raw data
0 0 1 0 0 2 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . . . 00 0 2 0 1 0 1 2 0 1 . . . . . . . . . . . . . . . . . . . . . . . .0…..…..3 0 0 1 0 0 2 1 0 0 1 0 0 1 0 0 1 1 2 0 1 0 1 1
hwu
Example 2 Prussian cavalry deaths
(c) Total numbers killed each year
1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘94
3 5 7 9 10 18 6 14 11 9 5 11 15 6 11 17 12 15 8 4
hwu
Example 4 Political views
1 2 3 4 5 6 7 (very L) (centre) (very R)
Don’tKnow
Total
46 179 196 559 232 150 35 93 1490
hwu
Example 7 Vehicle repair visits
Number of visits 0 1 2 3 4 5 6 Total
Frequency observed
295 190 53 5 5 2 0 550
hwu
Example 15 Patients in clinical trial
Drug Placebo Total
Side-effects 15 4 19
No side-effects 35 46 81
Total 50 50 100
§1 INTRODUCTION
Data are counts/frequencies (not measurements)
Categories (explanatory variable)
Distribution in the cells (response)
Frequency distribution
Single classifications
Two-way classifications
hwu
Data may arise as
Bernoulli/binomial data (2 outcomes)
Multinomial data (more than 2 outcomes)
Poisson data
[+ Negative binomial data – the version with
range x = 0,1,2, …]
hwu
hwu2.1 Bernoulli trials and related distributions
Number of successes – binomial distribution
[Time before kth success – negative binomial distributionTime to first success – geometric distribution]
Conditional distribution of success times
hwu
Poisson process with rate λ
Number of events in a time interval of length t, Nt , has a Poisson distribution with mean t
...,2,1,0,
! n
n
tenNP
nt
t
hwu
Poisson process with rate λ
Inter-event time, T, has an exponential distribution with parameter (mean 1/)
, 0tf t e t
hwu
given n events in time (0,t)
how many in time (0,s) (s < t)?
Conditional distribution of number of events
hwu
given n events in time (0,t)
how many in time (0,s) (s < t)?
Conditional distribution of number of events
Answer Ns|Nt = n ~ B(n,s/t)
hwu
X ~ Pn(), Y ~ Pn() X,Y independentthen we know X + Y ~ Pn( +)
Given X + Y = n, what is distribution of X?
hwu
X ~ Pn(), Y ~ Pn() X,Y independentthen we know X + Y ~ Pn(+)
Given X + Y = n, what is distribution of X?
AnswerX|X+Y=n ~ B(n,p) where p = /( +)
hwu
2.3 Inference for the Poisson distribution
ˆ /N r ˆ ˆ, . .E s e
r
ˆ , /N r
Ni , i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣNi
hwu
2.4 Dispersion and LR tests for Poisson dataHomogeneity hypothesis
H0: the Ni s are i. i. d. Pn()
(for some unknown)
2
2 211
r
ii
r
N MX
M
Dispersion statistic
(M = sample mean)
hwu
§3 SINGLE CLASSIFICATIONS
Binary classifications
(a) N1 , N2 independent Poisson, with Ni ~ Pn(i)
or
(b) fixed sample size, N1 + N2 = n, with N1 ~ B(n,p1) where p1 = 1/(1 + 2)
hwu
Qualitative categories
(a) N1 , N2, … , Nr independent Poisson, with
Ni ~ Pn(λi) or
(b) fixed sample size n, with joint multinomial distribution Mn(n;p)
hwu
Testing goodness of fitH0: pi = i , i = 1,2, …, r
2
2
1
2observed frequency expected frequency
expected frequency
ri i
i i
all cells
N MX
M
This is the (Pearson) chi-square statistic
hwu
21
21
It is distributed (approximately)
or when parameters have been estimated
in order to fit the model and calculate
the expected freqencies
r
r k k
hwu
Sparse data/small expected frequencies
ensure mi 1 for all cells, and
mi 5 for at least about 80% of the cells
if not - combine adjacent cells sensibly
hwu
Goodness-of-fit tests for frequency distributions
- very well-known application of the
2observed frequency expected frequency
expected frequencyall cells
statistic (see Illustration 3.4 p 22/23)
hwu
Residuals (standardised)
0,1
1 /i i i i
i
i i i i
N n N Mr N
n M n M n
0,1i ii
i
N Mr N
M
simpler version
hwu
Number of papers per author 1 2 3 4 5 6 7 8 9 10 11Number of authors 1062 263 120 50 22 7 6 2 0 1 1
MAJOR ILLUSTRATION 1Publish and be modelled
, 1, 2, 3,...!
x
P X x xx
Model
hwu
MAJOR ILLUSTRATION 2Birds in hedges
Hedge type i A B C D E F GHedge length (m) li 2320 2460 2455 2805 2335 2645 2099Number of pairs ni 14 16 14 26 15 40 71
Model Ni ~ Pn(ili)
hwu
Example 14 Numbers of mice bearing tumours in treated and control groups
Treated Control Total
Tumours 4 5 9
No tumours 12 74 86
Total 16 79 95
§4 TWO-WAY CLASSIFICATIONS
hwu
Example 15 Patients in clinical trial
Drug Placebo Total
Side-effects 15 4 19
No side-effects 35 46 81
Total 50 50 100
hwu
Patients in clinical trial – take 2
Drug Placebo Total
Side-effects 15 15 30
No side-effects 35 35 70
Total 50 50 100
4.1 Factors and responses
F × R tables
R × F , R × R
(F × F ?)
Qualitative, ordered, quantitative
Analysis the same - interpretation may be
different
hwu
hwu
Exposed Not exposed Total
Disease n11 n12 n1●
No disease n21 n22 n2●
Total n●1 n●2 n●● = n
Notation (2 2 case, easily extended)
hwu
Three possibilities
One overall sample, each subject classified according to 2 attributes- this is R × R
Retrospective study
Prospective study (use of treated and control groups; drug and placebo etc)
hwu
(a) R × R case
(a1) Nij ~ Pn(ij) , independent
or, with fixed table total
(a2) Condition on n = nij :
N|n ~ Mn(n ; p)
where N = {Nij} , p = {pij}.
4.2 Distribution theory and tests for r × s tables
hwu
(b) F × R case
Condition on the observed marginal
totals n•j = nij for the s categories
of F ( condition on n and n•1)
s independent multinomials
hwu
Usual hypotheses
(a1) Nij ~ Pn(ij) , independent
H0: variables/responses are independent
ij = i• •j / •• = ki•
(a2) Multinomial data (table total fixed)
H0: variables/responses are independent
P(row i and column j) = P(row i)P(column j)
hwu
(b) Condition on n and n•j (fixed column totals)
Nij ~ Bi(n•j , pij) j = 1,2, …, s ; independent
H0: response is homogeneous (pij = pi• for all j)
i.e. response has the same distribution for all levels of the factor
hwu
2
2
1 1
ij ij
r sij
N m
m
where mij = ni• n•j /n as before
Tests of H0
The χ2 (Pearson) statistic:
hwu
OR: test based on the LR statistic Y2
Illustration: tonsils data – see p27
In R
Pearson/X2 : read data in using “matrix”then use “chisq.test”
LR Y2 : calculate it directly (or get it fromthe results of fitting a “log-linear model”- see later)
hwu
Statistical tests
(a) Using Pearson’s χ2
4.3 The 2 2
table
Drug Placebo Total
Side-effects 15 4 19
No side-effects 35 46 81
Total 50 50 100
hwu
Yates (continuity) correction
Subtract 0.5 from |O – E| before squaring it
Performing the test in R
n.pat=matrix(c(15,35,4,46),2,2)chisq.test(n.pat)
hwu
Under a random allocation
50 50
4 15 50!50!19!81!( 4) 0.0039
100 4!46!15!35!100!
19
P N
:!
product of marginal factorialsNote probability
n product of cell factorials
one-sided P-value = P(N 4) = 0.0047
hwu
In the 2 2 table, the H0 : independence
condition is equivalent to 1122 = 1221
Let λ = log(1122 /1221)
Then we have H0: λ = 0
λ is the “log odds ratio”
4.4 Log odds, combining and collapsing tables,interactions
hwu
The odds ratio is
1122 /1221
11 1 21 111 22 11 21
12 21 12 22 12 2 22 2
/ / //
/ / / /
n n n nn n n n
n n n n n n n n
1
2
( / )
odds on for column
odds on for column
odds ratio observed sample version
Sample equivalent is
hwu
The odds ratio (or log odds ratio) provides a measure of association for the factors in the table.
no association odds ratio = 1 log odds ratio = 0
hwu
Interaction
An interaction exists between two factorswhen the effect of one factor is different at different levels of another factor.
hwu§5 INTRODUCTION TO GENERALISED LINEAR MODELS (GLMs)
Normal linear model
Y|x ~ N with E[Y|x]= + xor E[Y|x]= 0 + 1x1 + 2x2 + … + rxr = x
i.e. E[Y|x] = (x) = x
hwu
We are explaining (x) using a linear predictor (a linear function of the explanatory data)
Generalised linear model
Now we set g((x)) = x for some function g
We explain g((x)) using a linear function of the explanatory data, where g is called the link function
hwue.g. modelling a Poisson mean we use alog link g() = log
We use a linear predictor to explain log
rather than itself : the model is
Y|x ~ Pn with mean λx
with log λx = + x
or log λx = x
This is a log-linear model
hwu
An example is a trend model in which we uselogi = + i
Another example is a cyclic model in which we use
logi =0 + 1 cosθi + 2 sinθi
hwu
§6 MODELS FOR SINGLE CLASSIFICATIONS
6.1 Single classifications - trend models
Data: numbers in r categories
Model: Ni , i = 1, 2, …, r,
independent Pn(λi)
hwuBasic case
H0: λi’s equal v H1: λi’s follow a trend
Let Xj be category of observation j
P(Xj = i) = 1/r
XTest based on
see Illustration 6.1
hwu
It is a linear regression model for logλi
and a non-linear regression model for λi .
It is a generalised linear model.
Here the link between the parameter we are
estimating and the linear estimator is the
log function - it is a “log link”.
hwu
>n=c(15,11, …, 1, 4) response vector>r=length(n)>i=1:r explanatory vector
model>stress=glm(n~i,family=poisson)
hwu>summary(stress)
Call:glm(formula = n ~ i, family = poisson) model being fittedDeviance Residuals: Min 1Q Median 3Q Max -1.9886 -0.9631 0.1737 0.5131 2.0362 summary information on the residualsCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.80316 0.14816 18.920 < 2e-16 ***i -0.08377 0.01680 -4.986 6.15e-07 *** information on the fitted parameters
hwu
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1)
Null deviance: 50.843 on 17 degrees of freedom Residual deviance: 24.570 on 16 degrees of freedom deviances (Y2 statistics)
AIC: 95.825Number of Fisher Scoring iterations: 4
hwu
Fitted mean is
ˆ exp 2.80316 0.08377i i
e.g. for date 6, i = 6 and fitted mean is exp(2.30054) = 9.980
hwu
Test of H0: no trend the null fit, all fitted values equal (to theobserved mean) Y2 = 50.84 (~ 2 on 17df)
The trend model fitted values exp(2.80316-0.08377i)Y2 = 24.57 (~ 2 on 16df)
Crude 95% CI for slope is -0.084 ± 2(0.0168) i.e. -0.084 ± 0.034
hwu6.2 Taking into account a deterministic denominator – using an “offset” for the “exposure”
Model: Nx ~ Pn(λx) where
E[Nx] = λx = Exbθx
logλx = logEx + c + dx
See the Gompertz model example (p 40, data in Example 26)
hwu
We include a term “offset(logE)” in the formulafor the linear predictor: in R
model = glm(n.deaths ~ age + offset(log(exposure)), family = poisson)
Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)
hwu§7 LOGISTIC REGRESSION
• for modelling proportions
• we have a binary response for each item and a quantitative explanatory variable
for example: dependence of the proportion of insects killed in a chamber on the concentration of a chemical present – we want to predict the proportion killed from the concentration
hwu
for example: dependence of the proportion of
women who smoke - on age
metal bars on test which fail - on pressure applied policies which give rise to claims – on sum insured
Model: # successes at value xi of explanatory variable: Ni ~ bi(ni , πi)
hwu
We use a glm – we do not predict πi directly; we predict a function of πi called the logit of πi.
The logit function is given by:
logit( ) log1
It is the “log odds”.
See Illustration 7.1 p 43: proportion v dose
*
*
*
*
*
*
*
*
*
*
3.8 4.0 4.2 4.4 4.6 4.8
0.2
0.4
0.6
0.8
1.0
dose
pro
p
hwu
log1
ii
i
a bx
This leads to the “logistic regression” model
[ c.f. log linear model Ni ~ Poisson(λi) with log λi = a + bxi ]
hwu
Data: explanatory # successes group observed variable value size proportion
x1 n11 n1 n11/n1
x2 n21 n2 n21/n2
……. xs ns1 ns ns1/ns
hwu
In R we declare the proportion of successes asthe response and include the group sizes as a set of weights
drug.mod1 = glm(propdead ~ dose, weights = groupsize, family = binomial)
explanatory vector is dose note the family declaration
hwu
RHS of model can be extended if required to include additional explanatory variables and factors
e.g. mod3 = glm(mat3 ~ age+socialclass+gender)
hwu
drug.mod – see output p44
Coefficients very highly significant (***) Null deviance 298 on 9df Residual deviance 17.2 on 8df
But … residual v fitted plotand … fitted v observed proportions plot
hwu
-3 -2 -1 0 1 2
-2-1
01
23
Predicted values
Re
sid
ua
ls
glm(formula = num.mat ~ dose, family = binomial)
Residuals vs Fitted
10
5
1
hwu
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
prop
dru
g.m
od
2$
fit
model with aquadraticterm(dose^2)
hwu
8.1 Log-linear models for two-way classifications
Nij ~ Pn(ij) , i= 1,2, …, r ; j = 1,2, …, s
H0: variables are independent
ij = i• •j / ••
§8 MODELS FOR TWO-WAY AND THREE-WAY CLASSIFICATIONS
hwu
We “explain” log ij in terms of additive effects: logij = + αi + βj
Fitted values are the expected frequencies
ˆˆˆ ˆexpij i j
Fitting process gives us the value of Y2 = -2logλ
hwu
Nij ~ Pn(ij) , independent, with logij = + αi + βj
Declare the response vector (the cell frequencies) and the row/column codes as factorsthen use > name = glm(…)
Fitting a log-linear model
hwu
Tonsils data (Example 16)
n.tonsils = c(19,497,29,560,24,269)rc = factor(c(1,2,1,2,1,2))cc = factor(c(1,1,2,2,3,3))
tonsils.mod1 = glm(n.tonsils ~ rc + cc, family=poisson)
Call:glm(formula = n.tonsils2 ~ rc + cc, family = poisson)Deviance Residuals: 1 2 3 4 5 6 -1.54915 0.34153 -0.24416 0.05645 2.11018 -0.53736 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.27998 0.12287 26.696 < 2e-16 ***rc2 2.91326 0.12094 24.087 < 2e-16 ***cc2 0.13232 0.06030 2.195 0.0282 * cc3 -0.56593 0.07315 -7.737 1.02e-14 ***---Null deviance: 1487.217 on 5 degrees of freedomResidual deviance: 7.321 on 2 degrees of freedom Y2 = - 2logλ
hwu
> n.patients = c(15, 4, 35, 46)> rc = factor(c(1, 1, 2, 2)) > cc = factor(c(1, 2, 1, 2))
> pat.mod1 = glm(n.patients ~ rc + cc, family = poisson)
Patients data (Example 15)
Call:glm(formula = n.patients ~ rc + cc, family = poisson)Deviance Residuals: 1 2 3 4 1.6440 -2.0199 -0.8850 0.8457 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.251e+00 2.502e-01 8.996 < 2e-16 ***rc2 1.450e+00 2.549e-01 5.689 1.28e-08 ***cc2 2.184e-10 2.000e-01 1.09e-09 1 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1)Null deviance: 49.6661 on 3 degrees of freedomResidual deviance: 8.2812 on 1 degrees of freedom AIC: 33.172
hwu
fitted coefficients: coef(pat.mod1)(Intercept) rc2 cc2 2.251292e+00 1.450010e+00 2.183513e-10
fitted values: fitted(pat.mod1)1 2 3 4 9.5 9.5 40.5 40.5
hwu
1ˆ j
2ˆ j
Estimates are
Predictors for cells 1,1 and 1,2 are 2.251292 :
exp(2.251292) = 9.5
exp(3.701302) = 40.5
Predictors for cells 2,1 and 2,2 are 2.251292 + 1.450010 = 3.701302 :
1 2 1 2ˆˆˆ 2.251292, 0, 1.450010, 0, 0
hwuResidual deviance: 8.2812 on 1 degree of freedom Y2 for testing the model i.e. for testing H0: response is homogeneous/ column distributions are the same/ no association between response and treatment
group
The lower the value of the residual deviance, the better in general is the fit of the model.Here the fit of the additive model is very poor(we have of course already concluded that there is an association – P-value about 1%).
hwu
8.2 Two-way classifications - taking into account a deterministic denominator
See the grouse data (Illustration 8.3 p50, data in Example 25)
Model: Nij ~ Pn(λij) where
E[Nij] = λij = Eij exp( + αi + βj)
logE[Nij/Eij] = + αi + βj
i.e. logλij = logEij + + αi + βj
hwu
We include a term “offset(logE)” in the formulafor the linear predictor
Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)
hwu
8.3 Log-linear models for three-way classifications
Each subject classified according to 3factors/variables with r,s,t levels respecitvely
Nijk ~ Pn(ijk) withlog ijk = + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk
r s t parameters
hwu
Model with two factors and an
interaction (no longer additive) is
log ij = + αi + βj + (αβ)ij
Recall “interaction”
hwu
Range of possible models/dependenciesFrom1 Complete independence
model formula: A + B + C
link: log ijk = + αi + βj + γk
notation: [A][B][C]df: rst – r – s – t + 2
8.4 Hierarchic log-linear models
Interpretation!
hwu
…. through
2 One interaction (B and C say)model formula: A + B*C link: log ijk = + αi + βj + γk + (βγ)jk
notation: [A][BC]
df: rst – r – st + 1
hwu
Model selection: by
backward elimination or forward selection
through the hierarchy of modelscontaining all 3 variables
hwu
saturated [ABC]
[AB] [AC] [BC]
[AB] [AC] [AB] [BC] [AC][BC]
[AB] [C] [A] [BC] [AC] [B]
[A] [B] [C] independence
hwu
Our models can include
mean (intercept)
+ factor effects
+ 2-way interactions
+ 3-way interaction
hwu
Illustration 8.4 Models for lizards data (Example 29)
liz = array(c(32, 86, 11, 35, 61, 73, 41, 70), dim = c(2, 2, 2))n.liz = as.vector(liz)s = factor(c(1,1,1,1,2,2,2,2)) species d = factor(c(1, 1, 2, 2, 1, 1, 2, 2)) diameter of perchh = factor(c(1,2,1,2,1,2,1,2)) height of perch
hwu
Forward selectionliz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) liz.mod3 = glm(n.liz ~ s + d*h, family = poisson)liz.mod4 = glm(n.liz ~ s*h + d, family = poisson)liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson)
hwuForward selectionliz.mod1 = glm(n.liz ~ s + d + h, family = poisson) 25.04 on 4df
liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † 12.43 on 3df
liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson)
hwu
Forward selectionliz.mod1 = glm(n.liz ~ s + d + h, family = poisson)
liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) †
liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)† 2.03 on 2df
hwu> summary(liz.mod5)Call:glm(formula = n.liz ~ s * d + s * h, family = poisson)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4320 0.1601 21.436 < 2e-16 ***s2 0.5895 0.1970 2.992 0.002769 ** d2 -0.9420 0.1738 -5.420 5.97e-08 ***h2 1.0346 0.1775 5.827 5.63e-09 ***s2:d2 0.7537 0.2161 3.488 0.000486 ***s2:h2 -0.6967 0.2198 -3.170 0.001526 ** Null deviance: 98.5830 on 7 degrees of freedomResidual deviance: 2.0256 on 2 degrees of freedom
hwu
Number of papers per author 1 2 3 4 5 6 7 8 9 10 11Number of authors 1062 263 120 50 22 7 6 2 0 1 1
MAJOR ILLUSTRATION 1
, 1, 2, 3,...!
x
P X x xx
Model
hwu
MAJOR ILLUSTRATION 2
Hedge type i A B C D E F GHedge length (m) li 2320 2460 2455 2805 2335 2645 2099Number of pairs ni 14 16 14 26 15 40 71
Model Ni ~ Pn(ili)
hwu
Model
Ni independent Pn(λi) with
0
0
exp cos
exp cos sin , 1, ...,
exp cos sin
i i
i i
i i
a b i r
c a b
Explanatory variable: the category/month i has been transformed into an angle i
hwu
It is another example of a non-linear regression model for Poisson responses.
It is a generalised linear model.
hwu
Fitting in R>n=c(40, 34, …, 33, 38) response vector>r=length(n)>i=1:r>th=2*pi*i/r explanatory vector
model>leuk=glm(n~cos(th) + sin(th),family=poisson)
Male Female
Cinema often 22 21 43
Not often 20 12 32
42 33 75
P(often|male) = 22/42 = 0.524P(often|female) = 21/33 = 0.636significant difference (on these numbers)?is there an association between gender and cinema attendance?
hwu
Null hypothesis H0: no association between gender and cinema attendance
Alternative: not H0
Under H0 we expect 42 43/75 = 24.08 in cell 1,1 etc.
hwu
> matcinema=matrix(c(22,20,21,12),2,2)> chisq.test(matcinema)Pearson's Chi-squared test with Yates' continuity correctiondata: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574
> chisq.test(matcinema)$expected [,1] [,2][1,] 24.08 18.92 [2,] 17.92 14.08
hwu
> matcinema=matrix(c(22,20,21,12),2,2)> chisq.test(matcinema)Pearson's Chi-squared test with Yates' continuity correctiondata: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574
> chisq.test(matcinema)$expected [,1] [,2] null hypothesis can stand [1,] 24.08 18.92 no association between gender[2,] 17.92 14.08 and cinema attendance
hwu
Male Female
Cinema often 110 105 215
Not often 100 60 160
210 165
P(often|male) = 110/210 = 0.524P(often|female) = 105/60 = 0.636significant difference (on these numbers)?
more students, same proportions
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)> chisq.test(matcinema2)Pearson's Chi-squared test with Yates' continuity correctiondata: matcinema2
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)> chisq.test(matcinema2)Pearson's Chi-squared test with Yates' continuity correctiondata: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731> chisq.test(matcinema2)$expected [,1] [,2][1,] 120.4 94.6[2,] 89.6 70.4
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)> chisq.test(matcinema2)Pearson's Chi-squared test with Yates' continuity correctiondata: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731> chisq.test(matcinema2)$expected
[,1] [,2] null hypothesis is rejected[1,] 120.4 94.6 there IS an association between[2,] 89.6 70.4 gender and cinema attendance