multivariate methods for analysis of categorical data for...

Multivariate methods for analysis of categorical data

for linguists

Natalia Levshina

Mainz, May 9 2016

Outline

1. What are categorical data like?

2. Exploratory analysis• Correspondence Analysis

• Simple (2 variables)• Multiple (> 2 variables)• Correspondence Analysis with supplementary points

• Multidimensional Scaling and exemplar-based semantic maps

3. Confirmatory analysis• Logistic regression• Conditional inference trees and random forests

Categorical data

• Binary (biological sex, rural or urban, left the village or not) or having more than 2 categories (case in Russian, gender, valency)

• Ordered (level of education) or unordered (case in Russian)

• Pervasive in linguistics

• Poorly described in statistical textbooks

Examples from WALS

http://wals.info/feature„Classical“ categorical variables:

• Feature 107A: Passive Constructions• Feature 87A: Order of Adjective and Noun• Feature 1A: Consonant Inventories

Tricky variables (might need recoding before analysis): • Feature 30A: Number of Genders (loss of information (“5 and

more”)?)• Feature 72A: Imperative-Hortative Systems (two variables in

fact?)• Feature 144U: Double Negation in Verb-Initial Languages (one

category – one count)• Feature 142A: Para-Linguistic Usages of Clicks (is it good to

have “other or none”?)

http://wals.info/feature

Contingency table

• 2- and more-dimensional cross-tabulated data of 2 and more categorical variables

• Example: • Dryer’s WALS F86 WO ADJ + Noun and F87 WO GEN +

Noun:

ADJ_Noun Noun_ADJ No dominant WO

GEN_Noun 232 342 38

Noun_GEN 65 342 26

No dominant WO 21 48 21

A 3-dimensional contingency table• WALS F86 + F87 + F89 (WO Num + Noun)


GEN_Noun 9 13 6

Noun_GEN 1 16 2



GEN_Noun 15 226 13

Noun_GEN 16 184 4



GEN_Noun 176 45 10

Noun_GEN 37 100 15


Num_Noun

Noun_Num

No dominantWO

Outline






Exploratory vs. confirmatory multivariate methods• Exploratory:

• most commonly, finding patterns in large complex data sets

• Confirmatory (or hypothesis-testing): • statistical inference, i.e. how confident can we be that

we can reject the null hypothesis (frequentist statistics) /believe in the alternative hypothesis (Bayesian statistics)?• Frequentist statistics: p-values, confidence intervals

• Bayesian statistics: posterior probabilities, credible intervals

Exploratory methods discussed today• Correspondence Analysis

• Multidimensional Scaling based on Gower’s distances

Outline






Correspondence Analysis

• Represents associations between categorical variables on a map

• Simple CA: two variables

• Multiple CA: more than two variables

• Supplementary points: all kinds of metainformation

Outline






Colour terms in different registers of COCA

spoken fiction academic press

black 20335 41118 26892 73080

blue 4693 22093 3605 21210

brown 1185 10914 1201 11539

gray 1168 12140 1289 6559

green 3860 14398 4477 26837

orange 931 3496 474 5766

pink 962 7312 584 6356

purple 613 3366 429 3403

red 7230 25111 5621 34596

white 14474 40745 26336 54883

yellow 1349 10553 1855 10382

Simple CA map

Interpretation

• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as

proportions [0.13, 0.25, 0.17, 0.45]

• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]

• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]

Interpretation

• Rows (colours) are close to one another if they have similar profiles.• black [20335, 41118, 26892, 73080], expressed as

proportions [0.13, 0.25, 0.17, 0.45]

• white [14474, 40745, 26336, 54883], expressed as proportions [0.12, 0.30, 0.19, 0.40]

• gray [1168, 12140, 1289, 6559], expressed as proportions [0.06, 0.57, 0.06, 0.31]

Which colours have similar profiles?

Interpretation (cont.)

• The same holds for columns (registers).

• The absolute distances between columns and rows are not always meaningful (depends on the type of CA map). What matters always, however, is the dimensional interpretation (e.g. in which quadrants do you find the rows and columns)?

• Less frequent categories are usually further from the origin.

• The absolute values of the coordinates and their sign usually do not matter.

What can you say about the map?

How good is the 2-dimensional solution?• 1 dimension: 77.9% of variation explained

• 2 dimension: 19.2% of variation explained

• 3 dimension: 2.9% of variation explained

Total for dimensions 1 and 2: 97.1%

An excellent result!

Exercise: WO (2 var)

• How can you interpret the plot? Is the CA solution good?

Outline






Stuhl oder Sessel?

Function Age Back Soft Arms Upholstery Mat_Seat

1 Eat Adult Low No No No Plastic

2 Eat Children Mid No No No Wood

3 NotSpec Adult Mid No Yes No Rattan

4 Eat Adult High Yes No Yes Fabric

5 Eat Children High No Yes No Plastic

6 Work Adult High Yes Yes Yes Fabric

7 NotSpec Adult Mid No No No Wood

8 Relax Adult High Yes Yes Yes Leather

9 Eat Adult Mid No No No Wood

10 Eat Adult Mid No No Yes Fabric

[188 observations, 16 variables in total]

What is shown?

• The black points are the values of the categorical variables (e.g. Rattan from Material, Work from Function).

• The grey points are the individual observations.

• Again, the safe interpretation is dimensional.

How good is the solution?

• Unfortunately, the traditional MCA inflates the variance (inertia), so the quality often seems lower than it actually is.

• Use the adjusted version of MCA, which provides a correction (see Levshina 2015: Ch. 19).

Outline






Supplementary points

• Represent variables or individuals that have a different nature from the rest (e.g. demographic information against linguistic variables, linguistic variables against the referential features).

• Are passive (do not influence the orientation of the axes).

• Can be plotted onto the maps.

Supplementary points

Confidence ellipses

Exercise: MCA of WO data (3 var)

• Interpret the map.

Outline






A fictitious example: data

A fictitious example: compute Gower’s distances

A fictitious example: MDS

Interpretation

• Large distances between points suggest little overlap of the formal expressions cross-linguistically; small distances suggest great overlap of the formal expressions cross-linguistically.

• The coordinates (the absolute magnitude and the sign) normally do not matter.

• What matters are the dimensions and clusters on the map. However, their interpretation is not provided by the algorithm. This is a task for a linguist (not always easy).

A real example

• ParTy corpus of film and TED talk subtitles (see my website)

• English + ten other languages (Finnish, French, Indonesian, Japanese, Mandarin, Russian, Turkish, Vietnamese)

• Causal connectives (because, so, so that, that’s why, etc.) – in total, 205 instances.

MDS map with English categories

MDS map with Indonesian categories

How to interpret the points?

• You can provide any metainformation (e.g. original sentences) in a clickable plot (package googleVis)

• See an example here: http://www.natalialevshina.com/plots/bubblechart1.html

Exploratory methods: summary

• CA:+ Shows all variables (features of furniture) and individuals

(furniture items) on one map+ Shows the average positions of variables (features of furniture)+ The number of variables is not very important- Very sensitive to outliers (rare categories)- Missing data can be a problem

• MDS:- Shows only individuals (instances of connectives) with maximum

one variable (French, Chinese, etc.) - Does not show the average positions of variable categories

(language-specific causal connectives)- Too few variables do not create enough variation+ Rare categories are not a big deal+ Missing data are not a big deal (of course, to a reasonable extent)

Outline






Logistic regression

• Models the relationship between a categorical response (e.g. active or passive voice, going to or gonna) and one or more predictors (e.g. direct or indirect causation, spoken or written data, the country, formal or informal speech…)

- Two outcomes: binomial (dichotomous)

- Three and more: multinomial (polytomous)

A case study

• Causative verbs doen “do, make” and laten “let”

• Semantics: doen expresses more direct causation than laten

• Syntax: doen is used more often with intransitive verbs

• Geographic variation: causative doen occurs more frequently in Belgian Dutch

(1) Hij deed me denken aan mijn vader.

He did me think at my father

“He reminded me of my father.”

(2) Ik liet hem mijn huis schilderen.

I let him my house paint

“I had him paint my house.”

Data (first 6 observations)

Aux Country Causation EPTrans EPTrans1

1 laten NL Inducive Intr Intr

2 laten NL Physical Intr Intr

3 laten NL Inducive Tr Tr

4 doen BE Affective Intr Intr

5 laten NL Inducive Tr Tr

6 laten NL Volitional Intr Intr

Table of coefficients

Coef S.E. Wald Z Pr(>|Z|)

Intercept 1.8631 0.3771 4.94 <0.0001

Causation=Inducive -3.3725 0.3741 -9.01 <0.0001

Causation=Physical 0.4661 0.6275 0.74 0.4576

Causation=Volitional -3.7373 0.4278 -8.74 <0.0001

EPTrans=Tr -1.2952 0.3394 -3.82 0.0001

Country=BE 0.7085 0.2841 2.49 0.0126

The coefficients of the variables are log-odds ratios. They show by how much the chances of doen against laten increase (if > 0) or decrease (if < 0) in comparison with the reference level (Causation = Affective: EPTrans = Intr; Country = NL)

Goodness of fit

• Provide an estimate of how well the model fits the data

• The most popular measures:• Pseudo-R2 (here: 0.61), ranges from 0 to 1. Caution: it is

usually lower for logistic regression models than its counterpart in linear regression.

• A better option: Concordance index C (here: 0.89)

Goodness of fit: concordance index C• If you take all possible pairs that contain a sentence

with doen and a sentence with laten, and try all combinations, the statistic C will be the proportion of the times when the model predicts a higher probabilityof doen for the sentence with doen , and a higherprobability of laten for the sentence with laten.

• Rule of thumb:

C = 0.5 no discrimination

0.7 ≤ C < 0.8 acceptable discrimination

0.8 ≤ C < 0.9 excellent discrimination

C ≥ 0.9 outstanding discrimination

Exercise: nerd or geek?

• Data:

Noun Num Century Register Eval

nerd pl XX ACAD Neutral

geek pl XXI MAG Neutral

geek pl XX NEWS Neutral

geek sg XXI MAG Neutral

nerd sg XXI SPOK Neg

geek sg XX SPOK Positive

[1316 observations in total]

Table of coefficients

Coef S.E. Wald Z Pr(>|Z|)

Intercept -1.5038 0.3515 -4.28 <0.0001

Num=sg 0.2724 0.1291 2.11 0.0348

Century=XXI 0.8063 0.1220 6.61 <0.0001

Register=MAG 0.7457 0.3208 2.32 0.0201

Register=NEWS 0.5962 0.3301 1.81 0.0709

Register=SPOK 0.5729 0.3310 1.73 0.0835

Eval=Neutral -0.0991 0.1942 -0.51 0.6098

Eval=Positive 1.5084 0.2375 6.35 <0.0001

The coefficients show the effect for geek vs. nerd!

Does the model fit well?

• Pseudo-R2 = 0.17

• C = 0.69

Outline






Trees and forests

• These are methods based on recursive binary partitioning. • Why binary? The algorithm tests if any independent

variable is associated with the response variable and chooses the one that has the strongest association with the response (e.g. geek or nerd). It makes a binary split in this variable and splits the dataset into two subsets: those with value A and value B.

• Why recursive? The previous steps are repeated again and again until there’re no more variables associated with the outcome at the given level of statistical significance (e.g. α = 0.05).

A case study

• Variation in the English causatives make + V, have + V and cause + to V

• 50 examples of each construction from a corpus = 150 observations in total

• 6 categorical variables:• CrSem: semantics of the Causer (Animate, Inanimate)• CeSem: semantics of the Causee (Animate, Inanimate)• CdEv: semantics of the infinitive (Mental, Physical, Social)• Neg: negation• Coref: coreferentiality between Cr and other participants (Yes,

No)• Poss: possessive markers that suggest a possessive

relationships between Cr and other participants (Yes, No)

Conditional inference tree

Goodness of fit: classification assuracy

Observed outcomes

Pre

dic

ted

ou

tco

mes

Accuracy = (35 + 42 + 24)/150 = 0.67

Random forests

• Are grown from many trees (e.g. 1000) if we repeat the conditional tree algorithm many times.

• We can compute the conditional variable importance scores for each variable. ‘Conditional’, because they are computed given the impact of all other variables and interactions with them.

• Important: the variable importance scores are relative. They cannot be compared across different models.

Conditional variable importance

Confirmatory methods: summary

• Conditional inference trees and random forests are used in the situations when the use of regression is problematic:• ‘Small n, large p’ (the maximum number of coefficients

in the binary logistic regression model is the frequency of the less frequent response category divided by 10)

• Complex interactions• Outliers

• Unlike regression, the partitioning methods do not return coefficients. However, it is possible to obtain relative variable importance measures.

Acknowledgements

• All analyses were performed with R, a free statistical environment available from https://cran.r-project.org/

Thanks for your attention!

https://cran.r-project.org/

References

• All methods and many of the case studies are discussed in my textbook:• Levshina, N. 2015. How to Do Linguistics with R: Data exploration and statistical

analysis. Amsterdam: John Benjamins.

• Correspondence Analysis:• Greenacre, M. 2007. Correspondence Analysis in Practice (2nd ed.). Boca Raton, FL:

Hall/CRC Press.

• Multidimensional Scaling:• Borg, I. & Groenen, P. 2005. Modern Multidimensional Scaling: Theory and

Applications (2nd ed.). New York: Springer.

• Logistic regression:• Hosmer, D. W., Lemeshow, S. & Sturdivant, R.X. 2013. Applied Logistic Regression.

New York: Wiley.

• Conditional inference trees and random forests:• Tagliamonte, S. & Baayen, R.H. 2012. Models, forests and trees of York English:

Was/were variation as a case study for statistical practice. Language Variation and Change 24(2): 135-178.

R code

• See the textbook and the companion website

• Exemplar-based MDS maps: • See my page on Academia.edu, paper “How to make

semantic maps with R (based on contextual features of exemplars and Multidimensional Scaling)”

multivariate methods for analysis of categorical data for...

Documents