analysis of categorical data
DESCRIPTION
Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013. Overview. Data Types Contingency Tables Logit Models Binomial Ordinal Nominal. Things not covered (but still fit into the topic). Matched pairs/repeated measures - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/1.jpg)
1
Analysis of Categorical Data
Nick JacksonUniversity of Southern CaliforniaDepartment of Psychology10/11/2013
![Page 2: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/2.jpg)
2
OverviewData TypesContingency TablesLogit Models◦Binomial◦Ordinal◦Nominal
![Page 3: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/3.jpg)
3
Things not covered (but still fit into the topic)Matched pairs/repeated measures◦McNemar’s Chi-Square
Reliability◦Cohen’s Kappa◦ROC
Poisson (Count) modelsCategorical SEM◦Tetrachoric Correlation
Bernoulli Trials
![Page 4: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/4.jpg)
4
Data Types (Levels of Measurement)Discrete/
Categorical/Qualitative
Continuous/Quantitative
Nominal/Multinomial:
Properties:Values arbitrary (no magnitude) No direction (no ordering)
Example: Race: 1=AA, 2=Ca, 3=As
Measures:Mode, relative frequency
Rank Order/Ordinal:
Properties:Values semi-arbitrary (no magnitude?) Have direction (ordering)
Example:Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree
Measures:Mode, relative frequency, medianMean?
Binary/Dichotomous/
Binomial:
Properties:2 LevelsSpecial case of Ordinal or Multinomial
Examples: Gender (Multinomial)Disease (Y/N)
Measures:Mode, relative frequency,Mean?
![Page 5: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/5.jpg)
5
Contingency TablesOften called Two-way tables or Cross-TabHave dimensions I x JCan be used to test hypotheses of
association between categorical variables
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
Code 1.1
![Page 6: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/6.jpg)
6
Contingency Tables: Test of Independence Chi-Square Test of Independence (χ2)
◦ Calculate χ2
◦ Determine DF: (I-1) * (J-1)◦ Compare to χ2 critical value for given DF.
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
C1=265 C2=331 C3=264
R1=156R2=664
N=820
χ2=∑𝑖=1
𝑛 (𝑂 𝑖−𝐸𝑖 )2𝐸𝑖
Where: Oi = Observed FreqEi = Expected Freqn = number of cells in table
𝐸𝑖 , 𝑗=(𝑅 𝑖∗𝐶 𝑗 )
𝑁
![Page 7: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/7.jpg)
7
Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ2)
◦ H0: No Association
◦ HA: Association….where, how?
Not appropriate when Expected (Ei) cell size freq < 5◦ Use Fisher’s Exact Chi-Square
2 X 3 Table Age GroupsGender <40 Years 40-50 Years >50 YearFemale 25 68 63Male 240 223 201
C1=265 C2=331 C3=264
R1=156R2=664
N=820
χ2 (𝑑𝑓 2 )=23.39 ,𝑝<0.001
Code 1.2
![Page 8: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/8.jpg)
8
Contingency Tables2x2
a b
c d
a+b
c+d
b+da+c a+b+c+d
Disorder (Outcome)
Risk Factor/Exposure
Yes No
Yes
No
![Page 9: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/9.jpg)
9
Contingency Tables:Measures of Association a=
25b=10
c=20
d=45
35
65
5545 100
Depression
Alcohol Use
Yes No
Yes
No
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑅𝑖𝑠𝑘(𝑅𝑅)=𝑃 (𝐷|𝐴¿ ¿𝑃 (𝐷∨𝐴)
=0.7140.308=2.31
Probability :
Odds:
Contrasting Probability:
Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜(𝑂𝑅)=𝑂𝑑𝑑𝑠 (𝐷|𝐴¿ ¿𝑂𝑑𝑑𝑠(𝐷∨𝐴)
=2.50.44=5.62
Contrasting Odds:
The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.
![Page 10: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/10.jpg)
10
Why Odds Ratios?2
34
56
OR
/ R
R
0 .1 .2 .3 .4 .5Overall Probability of Depression
RR OR
a=25
b=10*i
c=20
d=45*i
(25 + 10*i)
55*i45
Depression
Alcohol Use
Yes No
Yes
No (20 + 45*i)
(45 + 55*i)
i=1 to 45
![Page 11: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/11.jpg)
11
The Generalized Linear ModelGeneral Linear Model (LM)◦Continuous Outcomes (DV)◦ Linear Regression, t-test, Pearson correlation,
ANOVA, ANCOVAGeneralized Linear Model (GLM)◦ John Nelder and Robert Wedderburn◦Maximum Likelihood Estimation◦Continuous, Categorical, and Count outcomes. ◦Distribution Family and Link Functions
Error distributions that are not normal
![Page 12: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/12.jpg)
12
Logistic Regression“This is the most important model for categorical
response data” –Agresti (Categorical Data Analysis, 2nd Ed.)
Binary ResponsePredicting Probability (related to the Probit model)Assume (the usual):◦ Independence◦NOT Homoscedasticity or Normal Errors◦ Linearity (in the Log Odds)◦Also….adequate cell sizes.
![Page 13: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/13.jpg)
13
Logistic RegressionThe Model
In terms of probability of success π(x)
In terms of Logits (Log Odds) Logit transform gives us a linear equation
![Page 14: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/14.jpg)
14
Logistic Regression: Example
The Output as Logits◦Logits: H0: β=0
Y=Depressed Coef SE Z P CIα (_constant) -1.51 0.091 -16.7 <0.001 -1.69, -1.34
Freq. PercentNot Depressed 672 81.95Depressed 148 18.05
Code 2.1
Conversion to Probability:
Conversion to Odds
Also=0.1805/0.8195=0.22
What does H0: β=0 mean?
![Page 15: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/15.jpg)
15
Logistic Regression: ExampleThe Output as ORs◦Odds Ratios: H0: β=1
◦ Conversion to Probability:
◦ Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51
Y=Depressed OR SE Z P CIα (_constant) 0.220 0.020 -16.7 <0.001 0.184, 0.263
Freq. PercentNot Depressed 672 81.95Depressed 148 18.05
Code 2.2
![Page 16: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/16.jpg)
16
Logistic Regression: ExampleLogistic Regression w/ Single Continuous Predictor:
Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28
β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030
AS LOGITS:
Interpretation:A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.Hmmmm….I have no concept of what a log-odds is. Interpret as something else.Logit > 0 so as age increases the risk of depression increases.
OR=e^0.013 = 1.013For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]
Code 2.3
![Page 17: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/17.jpg)
17
Logistic Regression: GOF• Overall Model Likelihood-Ratio Chi-Square
• Omnibus test for the model• Overall model fit?
• Relative to other models• Compares specified model with Null model (no
predictors)• Χ2=-2*(LL0-LL1), DF=K parameters estimated
![Page 18: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/18.jpg)
18
Logistic Regression: GOF (Summary Measures) Pseudo-R2
◦ Not the same meaning as linear regression. ◦ There are many of them (Cox and Snell/McFadden)◦ Only comparable within nested models of the same outcome.
Hosmer-Lemeshow◦ Models with Continuous Predictors ◦ Is the model a better fit than the NULL model. X2
◦ H0: Good Fit for Data, so we want p>0.05◦ Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of
Group * Outcome using. Df=g-2
◦ Conservative (rarely rejects the null) Pearson Chi-Square
◦ Models with categorical predictors◦ Similar to Hosmer-Lemeshow
ROC-Area Under the Curve◦ Predictive accuracy/Classification
Code 2.4
![Page 19: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/19.jpg)
19
Logistic Regression: GOF(Diagnostic Measures) Outliers in Y (Outcome)
◦ Pearson Residuals Square root of the contribution to the Pearson χ2
◦ Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated
model vs fitted model. Outliers in X (Predictors)
◦ Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values
Influential Observations◦ Pregibon’s Delta-Beta influence statistic◦ Similar to Cook’s-D in linear regression
Detecting Problems◦ Residuals vs Predictors◦ Leverage Vs Residuals◦ Boxplot of Delta-Beta
Code 2.5
![Page 20: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/20.jpg)
20
Logistic Regression: GOF
Y=Depressed Coef SE Z P CIα (_constant) -2.24 0.489 -4.58 <0.001 -3.20, -1.28
β (age) 0.013 0.009 1.52 0.127 -0.004, 0.030
log ( 𝜋(𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑)1−𝜋 (𝑑𝑒𝑝𝑟𝑒𝑠𝑠𝑒𝑑))=𝛼+𝛽1(𝑎𝑔𝑒)
H-L GOF:Number of Groups: 10H-L Chi2: 7.12DF: 8P: 0.5233
McFadden’s R2: 0.0030
L-R χ2 (df=1): 2.47, p=0.1162
![Page 21: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/21.jpg)
21
Logistic Regression: DiagnosticsLinearity in the Log-Odds◦Use a lowess (loess) plot◦Depressed vs Age
-3-2
-10
1D
epre
ssed
(Log
it)
20 40 60 80age
bandwidth = .8
Logit transformed smoothLowess smoother
Code 2.6
![Page 22: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/22.jpg)
22
Logistic Regression: ExampleLogistic Regression w/ Single Categorical Predictor:
Y=Depressed OR SE Z P CIα (_constant) 0.545 0.091 -3.63 <0.001 0.392, 0.756
β (male) 0.299 0.060 -5.99 <0.001 0.202, 0.444
AS OR:
Interpretation:The odds of depression are 0.299 times lower for males compared to females.
We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.
Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.
Code 2.7
![Page 23: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/23.jpg)
23
Ordinal Logistic RegressionAlso called Ordered Logistic or Proportional
Odds ModelExtension of Binary Logistic Model>2 Ordered responsesNew Assumption!◦Proportional Odds
BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across
levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age)
![Page 24: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/24.jpg)
24
Ordinal Logistic RegressionThe Model◦ A latent variable model (Y*)◦ j= number of levels-1◦ From the equation we can see that the odds ratio is
assumed to be independent of the category j
![Page 25: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/25.jpg)
25
Ordinal Logistic Regression ExampleY=bmi3grp Coef SE Z P CI
β1 (age) -0.026 0.006 -4.15 <0.001 -0.381, -0.014β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021Threshold1/cut1 -0.696 0.6678 -2.004, 0.613Threshold2/cut2 0.773 0.6680 -0.536, 2.082
AS LOGITS:
Y=bmi3grp OR SE Z P CIβ1 (age) 0.974 0.006 -4.15 <0.001 0.962, 0.986
β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022
Threshold1/cut1 -0.696 0.6678 -2.004, 0.613
Threshold2/cut2 0.773 0.6680 -0.536, 2.082
AS OR:
For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category
For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.
Code 3.1
![Page 26: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/26.jpg)
26
Ordinal Logistic Regression: GOFAssessing Proportional Odds Assumptions◦Brant Test of Parallel Regression
H0: Proportional Odds, thus want p >0.05 Tests each predictor separately and overall
◦Score Test of Parallel Regression H0: Proportional Odds, thus want p >0.05
◦Approx Likelihood-ratio test H0: Proportional Odds, thus want p >0.05
Code 3.2
![Page 27: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/27.jpg)
27
Ordinal Logistic Regression: GOFPseudo R2
Diagnostics Measures◦Performed on the j-1 binomial logistic
regressions
Code 3.3
![Page 28: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/28.jpg)
28
Multinomial Logistic RegressionAlso called multinomial logit/polytomous
logistic regression.Same assumptions as the binary logistic
model>2 non-ordered responses◦Or You’ve failed to meet the parallel odds
assumption of the Ordinal Logistic model
![Page 29: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/29.jpg)
29
Multinomial Logistic RegressionThe Model◦ j= levels for the outcome◦ J=reference level◦ where x is a fixed setting of an explanatory variable
◦Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.
◦ Similar to conducting separate binary logistic models, but with better type 1 error control
![Page 30: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/30.jpg)
30
Multinomial Logistic Regression Example
Y=religion (ref=Catholic(1))
OR SE Z P CI
Protestant (2) β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317
α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425Evangelical (3)
β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469α (_constant) 0.619 0.059 -5.02 <0.001 0.512, 0.746
Does degree of supernatural belief indicate a religious preference?
AS OR:
For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.
Code 4.1
![Page 31: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/31.jpg)
31
Multinomial Logistic Regression GOF
Limited GOF tests.◦Look at LR Chi-square and compare nested
models.◦“Essentially, all models are wrong, but some
are useful” –George E.P. BoxPseudo R2
Similar to Ordinal◦Perform tests on the j-1 binomial logistic
regressions
![Page 32: Analysis of Categorical Data](https://reader036.vdocuments.site/reader036/viewer/2022081800/56816136550346895dd092ec/html5/thumbnails/32.jpg)
32
Resources“Categorical Data Analysis” by Alan Agresti
UCLA Stat Computing:http://www.ats.ucla.edu/stat/