multivariate methods for analysis of categorical data for...
TRANSCRIPT
Multivariate methods for analysis of categorical data
for linguists
Natalia Levshina
Mainz, May 9 2016
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Categorical data
• Binary (biological sex, rural or urban, left the village or not) or having more than 2 categories (case in Russian, gender, valency)
• Ordered (level of education) or unordered (case in Russian)
• Pervasive in linguistics
• Poorly described in statistical textbooks
Examples from WALS
http://wals.info/feature„Classical“ categorical variables:
• Feature 107A: Passive Constructions• Feature 87A: Order of Adjective and Noun• Feature 1A: Consonant Inventories
Tricky variables (might need recoding before analysis): • Feature 30A: Number of Genders (loss of information (“5 and
more”)?)• Feature 72A: Imperative-Hortative Systems (two variables in
fact?)• Feature 144U: Double Negation in Verb-Initial Languages (one
category – one count)• Feature 142A: Para-Linguistic Usages of Clicks (is it good to
have “other or none”?)
Contingency table
• 2- and more-dimensional cross-tabulated data of 2 and more categorical variables
• Example: • Dryer’s WALS F86 WO ADJ + Noun and F87 WO GEN +
Noun:
ADJ_Noun Noun_ADJ No dominant WO
GEN_Noun 232 342 38
Noun_GEN 65 342 26
No dominant WO 21 48 21
A 3-dimensional contingency table• WALS F86 + F87 + F89 (WO Num + Noun)
ADJ_Noun Noun_ADJ No dominant WO
GEN_Noun 9 13 6
Noun_GEN 1 16 2
No dominant WO 0 3 1
ADJ_Noun Noun_ADJ No dominant WO
GEN_Noun 15 226 13
Noun_GEN 16 184 4
No dominant WO 1 28 0
ADJ_Noun Noun_ADJ No dominant WO
GEN_Noun 176 45 10
Noun_GEN 37 100 15
No dominant WO 15 10 5
Num_Noun
Noun_Num
No dominantWO
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Exploratory vs. confirmatory multivariate methods• Exploratory:
• most commonly, finding patterns in large complex data sets
• Confirmatory (or hypothesis-testing): • statistical inference, i.e. how confident can we be that
we can reject the null hypothesis (frequentist statistics) /believe in the alternative hypothesis (Bayesian statistics)?• Frequentist statistics: p-values, confidence intervals
• Bayesian statistics: posterior probabilities, credible intervals
Exploratory methods discussed today• Correspondence Analysis
• Multidimensional Scaling based on Gower’s distances
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Correspondence Analysis
• Represents associations between categorical variables on a map
• Simple CA: two variables
• Multiple CA: more than two variables
• Supplementary points: all kinds of metainformation
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Colour terms in different registers of COCA
spoken fiction academic press
black 20335 41118 26892 73080
blue 4693 22093 3605 21210
brown 1185 10914 1201 11539
gray 1168 12140 1289 6559
green 3860 14398 4477 26837
orange 931 3496 474 5766
pink 962 7312 584 6356
purple 613 3366 429 3403
red 7230 25111 5621 34596
white 14474 40745 26336 54883
yellow 1349 10553 1855 10382
Simple CA map
Interpretation
• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as
proportions [0.13, 0.25, 0.17, 0.45]
• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]
• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]
Interpretation
• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as
proportions [0.13, 0.25, 0.17, 0.45]
• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]
• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]
Which colours have similar profiles?
Interpretation (cont.)
• The same holds for columns (registers).
• The absolute distances between columns and rows are not always meaningful (depends on the type of CA map). What matters always, however, is the dimensional interpretation (e.g. in which quadrants do you find the rows and columns)?
• Less frequent categories are usually further from the origin.
• The absolute values of the coordinates and their sign usually do not matter.
What can you say about the map?
How good is the 2-dimensional solution?• 1 dimension: 77.9% of variation explained
• 2 dimension: 19.2% of variation explained
• 3 dimension: 2.9% of variation explained
Total for dimensions 1 and 2: 97.1%
An excellent result!
Exercise: WO (2 var)
• How can you interpret the plot? Is the CA solution good?
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Stuhl oder Sessel?
Function Age Back Soft Arms Upholstery Mat_Seat
1 Eat Adult Low No No No Plastic
2 Eat Children Mid No No No Wood
3 NotSpec Adult Mid No Yes No Rattan
4 Eat Adult High Yes No Yes Fabric
5 Eat Children High No Yes No Plastic
6 Work Adult High Yes Yes Yes Fabric
7 NotSpec Adult Mid No No No Wood
8 Relax Adult High Yes Yes Yes Leather
9 Eat Adult Mid No No No Wood
10 Eat Adult Mid No No Yes Fabric
[188 observations, 16 variables in total]
What is shown?
• The black points are the values of the categorical variables (e.g. Rattan from Material, Work from Function).
• The grey points are the individual observations.
• Again, the safe interpretation is dimensional.
How good is the solution?
• Unfortunately, the traditional MCA inflates the variance (inertia), so the quality often seems lower than it actually is.
• Use the adjusted version of MCA, which provides a correction (see Levshina 2015: Ch. 19).
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Supplementary points
• Represent variables or individuals that have a different nature from the rest (e.g. demographic information against linguistic variables, linguistic variables against the referential features).
• Are passive (do not influence the orientation of the axes).
• Can be plotted onto the maps.
Supplementary points
Confidence ellipses
Exercise: MCA of WO data (3 var)
• Interpret the map.
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
A fictitious example: data
A fictitious example: compute Gower’s distances
A fictitious example: MDS
Interpretation
• Large distances between points suggest little overlap of the formal expressions cross-linguistically; small distances suggest great overlap of the formal expressions cross-linguistically.
• The coordinates (the absolute magnitude and the sign) normally do not matter.
• What matters are the dimensions and clusters on the map. However, their interpretation is not provided by the algorithm. This is a task for a linguist (not always easy).
A real example
• ParTy corpus of film and TED talk subtitles (see my website)
• English + ten other languages (Finnish, French, Indonesian, Japanese, Mandarin, Russian, Turkish, Vietnamese)
• Causal connectives (because, so, so that, that’s why, etc.) – in total, 205 instances.
MDS map with English categories
MDS map with Indonesian categories
How to interpret the points?
• You can provide any metainformation (e.g. original sentences) in a clickable plot (package googleVis)
• See an example here: http://www.natalialevshina.com/plots/bubblechart1.html
Exploratory methods: summary
• CA:+ Shows all variables (features of furniture) and individuals
(furniture items) on one map+ Shows the average positions of variables (features of furniture)+ The number of variables is not very important- Very sensitive to outliers (rare categories)- Missing data can be a problem
• MDS:- Shows only individuals (instances of connectives) with maximum
one variable (French, Chinese, etc.) - Does not show the average positions of variable categories
(language-specific causal connectives)- Too few variables do not create enough variation+ Rare categories are not a big deal+ Missing data are not a big deal (of course, to a reasonable extent)
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Logistic regression
• Models the relationship between a categorical response (e.g. active or passive voice, going to or gonna) and one or more predictors (e.g. direct or indirect causation, spoken or written data, the country, formal or informal speech…)
- Two outcomes: binomial (dichotomous)
- Three and more: multinomial (polytomous)
A case study
• Causative verbs doen “do, make” and laten “let”
• Semantics: doen expresses more direct causation than laten
• Syntax: doen is used more often with intransitive verbs
• Geographic variation: causative doen occurs more frequently in Belgian Dutch
(1) Hij deed me denken aan mijn vader.
He did me think at my father
“He reminded me of my father.”
(2) Ik liet hem mijn huis schilderen.
I let him my house paint
“I had him paint my house.”
Data (first 6 observations)
Aux Country Causation EPTrans EPTrans1
1 laten NL Inducive Intr Intr
2 laten NL Physical Intr Intr
3 laten NL Inducive Tr Tr
4 doen BE Affective Intr Intr
5 laten NL Inducive Tr Tr
6 laten NL Volitional Intr Intr
Table of coefficients
Coef S.E. Wald Z Pr(>|Z|)
Intercept 1.8631 0.3771 4.94 <0.0001
Causation=Inducive -3.3725 0.3741 -9.01 <0.0001
Causation=Physical 0.4661 0.6275 0.74 0.4576
Causation=Volitional -3.7373 0.4278 -8.74 <0.0001
EPTrans=Tr -1.2952 0.3394 -3.82 0.0001
Country=BE 0.7085 0.2841 2.49 0.0126
The coefficients of the variables are log-odds ratios. They show by how much the chances of doen against laten increase (if > 0) or decrease (if < 0) in comparison with the reference level (Causation = Affective: EPTrans = Intr; Country = NL)
Goodness of fit
• Provide an estimate of how well the model fits the data
• The most popular measures:• Pseudo-R2 (here: 0.61), ranges from 0 to 1. Caution: it is
usually lower for logistic regression models than its counterpart in linear regression.
• A better option: Concordance index C (here: 0.89)
Goodness of fit: concordance index C• If you take all possible pairs that contain a sentence
with doen and a sentence with laten, and try all combinations, the statistic C will be the proportion of the times when the model predicts a higher probabilityof doen for the sentence with doen , and a higherprobability of laten for the sentence with laten.
• Rule of thumb:
C = 0.5 no discrimination
0.7 ≤ C < 0.8 acceptable discrimination
0.8 ≤ C < 0.9 excellent discrimination
C ≥ 0.9 outstanding discrimination
Exercise: nerd or geek?
• Data:
Noun Num Century Register Eval
nerd pl XX ACAD Neutral
geek pl XXI MAG Neutral
geek pl XX NEWS Neutral
geek sg XXI MAG Neutral
nerd sg XXI SPOK Neg
geek sg XX SPOK Positive
[1316 observations in total]
Table of coefficients
Coef S.E. Wald Z Pr(>|Z|)
Intercept -1.5038 0.3515 -4.28 <0.0001
Num=sg 0.2724 0.1291 2.11 0.0348
Century=XXI 0.8063 0.1220 6.61 <0.0001
Register=MAG 0.7457 0.3208 2.32 0.0201
Register=NEWS 0.5962 0.3301 1.81 0.0709
Register=SPOK 0.5729 0.3310 1.73 0.0835
Eval=Neutral -0.0991 0.1942 -0.51 0.6098
Eval=Positive 1.5084 0.2375 6.35 <0.0001
The coefficients show the effect for geek vs. nerd!
Does the model fit well?
• Pseudo-R2 = 0.17
• C = 0.69
Outline
1. What are categorical data like?
2. Exploratory analysis• Correspondence Analysis
• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points
• Multidimensional Scaling and exemplar-based semantic maps
3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests
Trees and forests
• These are methods based on recursive binary partitioning. • Why binary? The algorithm tests if any independent
variable is associated with the response variable and chooses the one that has the strongest association with the response (e.g. geek or nerd). It makes a binary split in this variable and splits the dataset into two subsets: those with value A and value B.
• Why recursive? The previous steps are repeated again and again until there’re no more variables associated with the outcome at the given level of statistical significance (e.g. α = 0.05).
A case study
• Variation in the English causatives make + V, have + V and cause + to V
• 50 examples of each construction from a corpus = 150 observations in total
• 6 categorical variables:• CrSem: semantics of the Causer (Animate, Inanimate)• CeSem: semantics of the Causee (Animate, Inanimate)• CdEv: semantics of the infinitive (Mental, Physical, Social)• Neg: negation• Coref: coreferentiality between Cr and other participants (Yes,
No)• Poss: possessive markers that suggest a possessive
relationships between Cr and other participants (Yes, No)
Conditional inference tree
Goodness of fit: classification assuracy
Observed outcomes
Pre
dic
ted
ou
tco
mes
Accuracy = (35 + 42 + 24)/150 = 0.67
Random forests
• Are grown from many trees (e.g. 1000) if we repeat the conditional tree algorithm many times.
• We can compute the conditional variable importance scores for each variable. ‘Conditional’, because they are computed given the impact of all other variables and interactions with them.
• Important: the variable importance scores are relative. They cannot be compared across different models.
Conditional variable importance
Confirmatory methods: summary
• Conditional inference trees and random forests are used in the situations when the use of regression is problematic:• ‘Small n, large p’ (the maximum number of coefficients
in the binary logistic regression model is the frequency of the less frequent response category divided by 10)
• Complex interactions• Outliers
• Unlike regression, the partitioning methods do not return coefficients. However, it is possible to obtain relative variable importance measures.
Acknowledgements
• All analyses were performed with R, a free statistical environment available from https://cran.r-project.org/
Thanks for your attention!
References
• All methods and many of the case studies are discussed in my textbook:• Levshina, N. 2015. How to Do Linguistics with R: Data exploration and statistical
analysis. Amsterdam: John Benjamins.
• Correspondence Analysis:• Greenacre, M. 2007. Correspondence Analysis in Practice (2nd ed.). Boca Raton, FL:
Hall/CRC Press.
• Multidimensional Scaling:• Borg, I. & Groenen, P. 2005. Modern Multidimensional Scaling: Theory and
Applications (2nd ed.). New York: Springer.
• Logistic regression:• Hosmer, D. W., Lemeshow, S. & Sturdivant, R.X. 2013. Applied Logistic Regression.
New York: Wiley.
• Conditional inference trees and random forests:• Tagliamonte, S. & Baayen, R.H. 2012. Models, forests and trees of York English:
Was/were variation as a case study for statistical practice. Language Variation and Change 24(2): 135-178.
R code
• See the textbook and the companion website
• Exemplar-based MDS maps: • See my page on Academia.edu, paper “How to make
semantic maps with R (based on contextual features of exemplars and Multidimensional Scaling)”