Download - Analysis of Categorical Data
1
Analysis of Categorical Data
Nick JacksonUniversity of Southern CaliforniaDepartment of Psychology10/11/2013
2
OverviewData TypesContingency TablesLogit Models◦Binomial◦Ordinal◦Nominal
3
Things not covered (but still fit into the topic)Matched pairs/repeated measures◦McNemar’s Chi-Square
Reliability◦Cohen’s Kappa◦ROC
Poisson (Count) modelsCategorical SEM◦Tetrachoric Correlation
Bernoulli Trials
4
Data Types (Levels of Measurement)Discrete/
Categorical/Qualitative
Continuous/Quantitative
Nominal/Multinomial:
Properties:Values arbitrary (no magnitude) No direction (no ordering)
Example: Race: 1=AA, 2=Ca, 3=As
Measures:Mode, relative frequency
Rank Order/Ordinal:
Properties:Values semi-arbitrary (no magnitude?) Have direction (ordering)
Example:Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree
Measures:Mode, relative frequency, medianMean?
Binary/Dichotomous/
Binomial:
Properties:2 LevelsSpecial case of Ordinal or Multinomial
Examples: Gender (Multinomial)Disease (Y/N)
Measures:Mode, relative frequency,Mean?
5
Contingency TablesOften called Two-way tables or Cross-TabHave dimensions I x JCan be used to test hypotheses of
association between categorical variables
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
Code 1.1
6
Contingency Tables: Test of Independence Chi-Square Test of Independence (χ2)
◦ Calculate χ2
◦ Determine DF: (I-1) * (J-1)◦ Compare to χ2 critical value for given DF.
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
C1=265 C2=331 C3=264
R1=156R2=664
N=820
χ2=∑𝑖=1
𝑛 (𝑂 𝑖−𝐸𝑖 )2𝐸𝑖
Where: Oi = Observed FreqEi = Expected Freqn = number of cells in table
𝐸𝑖 , 𝑗=(𝑅 𝑖∗𝐶 𝑗 )
𝑁
7
Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ2)
◦ H0: No Association
◦ HA: Association….where, how?
Not appropriate when Expected (Ei) cell size freq < 5◦ Use Fisher’s Exact Chi-Square
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
C1=265 C2=331 C3=264
R1=156R2=664
N=820
χ2 (𝑑𝑓 2 )=23.39 ,𝑝<0.001
Code 1.2
8
Contingency Tables2x2
a b
c d
a+b
c+d
b+da+c a+b+c+d
Disorder (Outcome)
Risk Factor/Exposure
Yes No
Yes
No
9
Contingency Tables:Measures of Association a=
25b=10
c=20
d=45
35
65
5545 100
Depression
Alcohol Use
Yes No
Yes
No
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑅𝑖𝑠𝑘(𝑅𝑅)=𝑃 (𝐷|𝐴¿ ¿𝑃 (𝐷∨𝐴)
=0.7140.308=2.31
Probability :
Odds:
Contrasting Probability:
Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅)=𝑂𝑑𝑑𝑠 (𝐷|𝐴¿ ¿𝑂𝑑𝑑𝑠(𝐷∨𝐴)
=2.50.44=5.62
Contrasting Odds:
The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.
10
Why Odds Ratios?2
34
56
OR
/ R
R
0 .1 .2 .3 .4 .5Overall Probability of Depression
RR OR
a=25
b=10*i
c=20
d=45*i
(25 + 10*i)
55*i45
Depression
Alcohol Use
Yes No
Yes
No (20 + 45*i)
(45 + 55*i)
i=1 to 45
11
The Generalized Linear ModelGeneral Linear Model (LM)◦Continuous Outcomes (DV)◦ Linear Regression, t-test, Pearson correlation,
ANOVA, ANCOVAGeneralized Linear Model (GLM)◦ John Nelder and Robert Wedderburn◦Maximum Likelihood Estimation◦Continuous, Categorical, and Count outcomes. ◦Distribution Family and Link Functions
Error distributions that are not normal
12
Logistic Regression“This is the most important model for categorical
response data” –Agresti (Categorical Data Analysis, 2nd Ed.)
Binary ResponsePredicting Probability (related to the Probit model)Assume (the usual):◦ Independence◦NOT Homoscedasticity or Normal Errors◦ Linearity (in the Log Odds)◦Also….adequate cell sizes.
13
Logistic RegressionThe Model
In terms of probability of success π(x)
In terms of Logits (Log Odds) Logit transform gives us a linear equation
14
Logistic Regression: Example
The Output as Logits◦Logits: H0: β=0
Y=Depressed Coef SE Z P CIα (_constant) -1.51 0.091 -16.7 <0.001 -1.69, -1.34
Freq. PercentNot Depressed 672 81.95Depressed 148 18.05
Code 2.1
Conversion to Probability:
Conversion to Odds
Also=0.1805/0.8195=0.22
What does H0: β=0 mean?
15
Logistic Regression: ExampleThe Output as ORs◦Odds Ratios: H0: β=1
◦ Conversion to Probability:
◦ Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51
Y=Depressed OR SE Z P CIα (_constant) 0.220 0.020 -16.7 <0.001 0.184, 0.263
Freq. PercentNot Depressed 672 81.95Depressed 148 18.05
Code 2.2
16
Logistic Regression: ExampleLogistic Regression w/ Single Continuous Predictor:
Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28
β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030
AS LOGITS:
Interpretation:A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.Hmmmm….I have no concept of what a log-odds is. Interpret as something else.Logit > 0 so as age increases the risk of depression increases.
OR=e^0.013 = 1.013For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]
Code 2.3
17
Logistic Regression: GOF• Overall Model Likelihood-Ratio Chi-Square
• Omnibus test for the model• Overall model fit?
• Relative to other models• Compares specified model with Null model (no
predictors)• Χ2=-2*(LL0-LL1), DF=K parameters estimated
18
Logistic Regression: GOF (Summary Measures) Pseudo-R2
◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden)◦ Only comparable within nested models of the same outcome.
Hosmer-Lemeshow◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X2
◦ H0: Good Fit for Data, so we want p>0.05◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of
Group * Outcome using. Df=g-2
◦ Conservative (rarely rejects the null) Pearson Chi-Square
◦ Models with categorical predictors◦ Similar to Hosmer-Lemeshow
ROC-Area Under the Curve◦ Predictive accuracy/Classification
Code 2.4
19
Logistic Regression: GOF(Diagnostic Measures) Outliers in Y (Outcome)
◦ Pearson Residuals Square root of the contribution to the Pearson χ2
◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated
model vs fitted model. Outliers in X (Predictors)
◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values
Influential Observations◦ Pregibon’s Delta-Beta influence statistic◦ Similar to Cook’s-D in linear regression
Detecting Problems◦ Residuals vs Predictors◦ Leverage Vs Residuals◦ Boxplot of Delta-Beta
Code 2.5
20
Logistic Regression: GOF
Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28
β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030
log ( 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)1−𝜋 (𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑))=𝛼+𝛽1(𝑎𝑔𝑒)
H-L GOF:Number of Groups: 10H-L Chi2: 7.12DF: 8P: 0.5233
McFadden’s R2: 0.0030
L-R χ2 (df=1): 2.47, p=0.1162
21
Logistic Regression: DiagnosticsLinearity in the Log-Odds◦Use a lowess (loess) plot◦Depressed vs Age
-3-2
-10
1D
epre
ssed
(Log
it)
20 40 60 80age
bandwidth = .8
Logit transformed smoothLowess smoother
Code 2.6
22
Logistic Regression: ExampleLogistic Regression w/ Single Categorical Predictor:
Y=Depressed OR SE Z P CIα (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756
β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444
AS OR:
Interpretation:The odds of depression are 0.299 times lower for males compared to females.
We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.
Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.
Code 2.7
23
Ordinal Logistic RegressionAlso called Ordered Logistic or Proportional
Odds ModelExtension of Binary Logistic Model>2 Ordered responsesNew Assumption!◦Proportional Odds
BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across
levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age)
24
Ordinal Logistic RegressionThe Model◦ A latent variable model (Y*)◦ j= number of levels-1◦ From the equation we can see that the odds ratio is
assumed to be independent of the category j
25
Ordinal Logistic Regression ExampleY=bmi3grp Coef SE Z P CI
β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021Threshold1/cut1 -0.696 0.6678 -2.004, 0.613Threshold2/cut2 0.773 0.6680 -0.536, 2.082
AS LOGITS:
Y=bmi3grp OR SE Z P CIβ1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986
β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022
Threshold1/cut1 -0.696 0.6678 -2.004, 0.613
Threshold2/cut2 0.773 0.6680 -0.536, 2.082
AS OR:
For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category
For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.
Code 3.1
26
Ordinal Logistic Regression: GOFAssessing Proportional Odds Assumptions◦Brant Test of Parallel Regression
H0: Proportional Odds, thus want p >0.05 Tests each predictor separately and overall
◦Score Test of Parallel Regression H0: Proportional Odds, thus want p >0.05
◦Approx Likelihood-ratio test H0: Proportional Odds, thus want p >0.05
Code 3.2
27
Ordinal Logistic Regression: GOFPseudo R2
Diagnostics Measures◦Performed on the j-1 binomial logistic
regressions
Code 3.3
28
Multinomial Logistic RegressionAlso called multinomial logit/polytomous
logistic regression.Same assumptions as the binary logistic
model>2 non-ordered responses◦Or You’ve failed to meet the parallel odds
assumption of the Ordinal Logistic model
29
Multinomial Logistic RegressionThe Model◦ j= levels for the outcome◦ J=reference level◦ where x is a fixed setting of an explanatory variable
◦Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.
◦ Similar to conducting separate binary logistic models, but with better type 1 error control
30
Multinomial Logistic Regression Example
Y=religion (ref=Catholic(1))
OR SE Z P CI
Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317
α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425Evangelical (3)
β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746
Does degree of supernatural belief indicate a religious preference?
AS OR:
For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.
Code 4.1
31
Multinomial Logistic Regression GOF
Limited GOF tests.◦Look at LR Chi-square and compare nested
models.◦“Essentially, all models are wrong, but some
are useful” –George E.P. BoxPseudo R2
Similar to Ordinal◦Perform tests on the j-1 binomial logistic
regressions
32
Resources“Categorical Data Analysis” by Alan Agresti
UCLA Stat Computing:http://www.ats.ucla.edu/stat/