green acre 2007 mc a col

26
Correspondence analysis and Related Methods – Part 2 1. What is multiple correspondence analysis (MCA)? 2. Why is MCA so useful as a method of visualizing questionnaire data? 3. How is MCA implemented in XLSTAT?

Upload: hubik38

Post on 23-Dec-2015

224 views

Category:

Documents


3 download

DESCRIPTION

Green Acre 2007 Mc a Col

TRANSCRIPT

Page 1: Green Acre 2007 Mc a Col

Correspondence analysisand Related Methods – Part 2

1. What is multiple correspondence analysis (MCA)?

2. Why is MCA so useful as a method of visualizingquestionnaire data?

3. How is MCA implemented in XLSTAT?

Page 2: Green Acre 2007 Mc a Col

• “Classical” or “simple” CA analyses the relationshipsbetween two variables, although the method is extendedto analyse different forms of tabular data, for example theproduct–attribute data shown previously, as well as ratings, preferences, on an individual or aggregate level.

• Multiple CA analyses several categorical variables wherewe are interested in all the relationships within the set of variables, not between one set and another

• The best way to understand the difference is to see thedifferent data format for the MCA program in XLSTAT: these are individual-level responses to several questions.

Page 3: Green Acre 2007 Mc a Col

Responses to four questions concerning working women Demographic categories

Source:Family & ChangingGenderRoles SurveyISSP (1994)

Page 4: Green Acre 2007 Mc a Col

• “between-set” means that there are two sets of variables and we are interested in the relationshipsbetween them – e.g., between demographics and the question responses

• “within-set” means that there is one set of variables and we are interested in the relationships amongstthem – e.g., amongst the question responses... thisis the multiple correspondence analysis (MCA) case

BetweenBetweenBetweenBetween----setsetsetset versusversusversusversus withinwithinwithinwithin----setsetsetset• Questions: Should a women work full-time, work part-time

or stay at home or missing data [4 response categories]: (Q1) before she has children; (Q2) when she has a pre-school child; (Q3) when children are still at school; (Q4) when all children have left home.

• Demographics: Country [24], Sex [2], Age group [6]

Page 5: Green Acre 2007 Mc a Col

BetweenBetweenBetweenBetween----setsetsetset exampleexampleexampleexample: Simple CA: Simple CA: Simple CA: Simple CAQ3: Should a woman with a child at school work full-time, part-time or stay at home? work work stay at DK/unsure/

full-time part-time home missingCOUNTRY W w H ? TotalAUS 256 1156 176 191 1779DW 101 1394 581 248 2324DE 278 691 62 66 1097GB 161 646 70 107 984NIRL 126 394 75 52 647USA 482 686 107 172 1447A 84 632 202 59 977H 285 736 447 32 1500I 171 670 167 10 1018IRL 223 424 209 82 938NL 539 1205 143 81 1968N 487 1242 205 153 2087S 295 833 39 105 1272CZ 228 585 198 13 1024SLO 341 428 222 41 1032PL 431 425 589 152 1597BG 270 427 335 94 1126RUS 175 1154 550 119 1998NZ 120 754 72 101 1047CDN 566 497 108 269 1440IL 468 664 92 63 1287J 203 671 313 120 1307E 738 1012 514 230 2494RP 243 448 484 25 1200Total 7271 17774 5960 2585 33590Average profile 0.216 0.529 0.177 0.077 1

Source:Family & Changing GenderRoles SurveyISSP (1994)

Page 6: Green Acre 2007 Mc a Col

Simple CASimple CASimple CASimple CAShould a woman with a child at school work full-time, part-time or stay at home?

2W

2w

2H

2?

AUS

DW

DE

GB

NIRL

USA

A

H

I

IRL

NLN

S CZ

SLO

PL

BG

RUS

NZ

CDN

RP

IL

J

E

-0.4

-0.2

0

0.2

0.4

0.6

-0.4 -0.2 0 0.2 0.4 0.6

0.0737 (50.6%)

0.0532 (36.5%)

87.1% inertiaexplained

W

?

w

H

Page 7: Green Acre 2007 Mc a Col

Simple CA Simple CA Simple CA Simple CA ofofofof multiwaymultiwaymultiwaymultiway tablestablestablestablesShould a woman with a child at school work full-time, part-time or stay at home?

work work stay at DK/unsure/full-time part-time home missing

COUNTRY W w H ? TotalAUSm 117 596 114 82 909AUSf 138 559 60 109 866DWm 43 675 357 123 1198DWf 58 719 224 125 1126 . . . . . . . . . . . . . . . . . . RPm 347 445 294 111 1197RPf 390 566 218 118 1292Total 7271 17774 5960 2585 33590Average profile 0.216 0.529 0.177 0.077 1

•Each country is split by gender: 24×2 country-age groups. We say the variables country and age are interactively coded

•Average profile stays the same, so definition of centre and geometric distance remainidentical to previous map, all thathas been done is to split eachcountry point into two profiles

Page 8: Green Acre 2007 Mc a Col

Simple CA Simple CA Simple CA Simple CA ofofofof multiwaymultiwaymultiwaymultiway tablestablestablestablesShould a woman with a child at school work full-time, part-time or stay at home?

86.8% inertiaexplained

W

w

H

?

AUSmDWm

Dem

GBmNIRLm

USAm

Am

Hm

Im

IRLm

NLm Nm

Sm CZm

SLOmPLm

BGm

RUSm

NZm

CDNm

RPm

Ilm

Jm

Em

AUSf

DWf

Def

GBf

NIRLf

USAf

Af

Hf

If

IRLf

NLfNf

Sf CZf

SLOf

PLf

BGf

RUSfNZf

CDNf

RPf

Ilf

Jf

Ef

-0.4

-0.2

0

0.2

0.4

0.6

-0.4 -0.2 0 0.2 0.4 0.6 0.8

0.0797 (51.5%)

0.0546 (35.3%)

•Ireland (IRL) has largest M–Fdifference

•Bulgaria (BG) is only country witha reverse M–F difference

•Inertia before:0.01456

•Inertia with M–F split:

0.01546•5.8% due to M–F

Page 9: Green Acre 2007 Mc a Col

Simple CA Simple CA Simple CA Simple CA ofofofof multiwaymultiwaymultiwaymultiway tablestablestablestablesShould a woman with a child at school work full-time, part-time or stay at home?

87.3% inertiaexplained

•Points tend to lie in a curved pattern (calledarch or horseshoe )

•Points that lie insidethe arch arepolarized, e.g. PLm26-35: 32% W, 22% w, 32% H, butNZm>66: 7% W, 73% w, 15% HAverage: 22% W, 53% w, 18% H

•Interactive coding of country (24), gender (2) and age (6), giving 288 combinations

?H

w

W

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

0.1301 (54.3%)

0.0791 (33.0%)

CDNf<25

Hm>66

PLm>66

NZm>66

DEm<25PLm<26-35

Page 10: Green Acre 2007 Mc a Col

StackedStackedStackedStacked tablestablestablestablesShould a woman with a child at school work full-time, part-time or stay at home?

•Since the column margins of each table areidentical (and same as the interactively codedtables before), the basic geometry remains thesame, it’s just the detail that is sacrificed here, all the information is collapsed into “maineffects”.

•Each variable is separately cross-tabulated withthe question and then stacked one on top of another.

W w H ?

Country (24)

Gender (2)

Age (6)

Education (7)

Marital status (5)

Social class (8)

•Inertia of stacked table is the average of theinertias of its subtables

Page 11: Green Acre 2007 Mc a Col

StackedStackedStackedStacked tablestablestablestables... with a child at school ...

•Tables can be stackedrow-wise and column-wise, adding additional questionsas columns

W w H ?

Country (24)

Gender (2)

Age (6)

Education (7)Marital status (5)

Social class (8)

W w H ? W w H ? W w H ?

Should a (married) woman before havingchildren...

... with a preschool child...

... when herchildren haveleft home workfull-time, part-time or stay at home?

•24 contingency tables in a 6 × 4 pattern, row margins and column margins are the same.

•Inertia of stacked table is the average of the inertiasof its subtables

Page 12: Green Acre 2007 Mc a Col

StackedStackedStackedStacked tablestablestablestablesWomen in the workplace and 6 demographic variables

71.0% inertiaexplained

•Relationshipswithin questionsand relationshipswithindemographics notdisplayedexplicitly

•Join categories of ordinal variable to see trends, for example age.

•Relationshipsbetween eachdemographicvariable and eachquestiondisplayed jointly

1W

1w

1H

1?

2W

2w

2H

2?

3W

3w

3H

3? 4W

4w

4H

4?

AUS

DW

DE

GB

NIRL

USA

A

H

I

IRLNL

NS

CZ

SLO PL

BG

RUS

NZ

CDN

RP

IL

J

E

M

F

A1

A2

A3

A4

A5 A6ma widi

se

si

E1

E2

E3E4

E5E6

E7 S0

S1

S2

S3S4

S5

S6

S*

-0.4

-0.2

0

0.2

0.4

-0.4 -0.2 0 0.2 0.4 0.6

0.0188 (49.1%)

0.0084 (21.9%)

Page 13: Green Acre 2007 Mc a Col

MultipleMultipleMultipleMultiple correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis (MCA)(MCA)(MCA)(MCA)Women in the workplace – 4 questions

West & East German samples only

•N rows, Qquestions, q-thquestion has Jqcategories, total number of categories is J

( N = 3415, Q = 4 Jq = 4 for all q,

J = 16 )

•One definition of MCA is that it is the CA of theindicator matrix

•Response data is recoded as dummy variables

Questions Qu. 1 Qu. 2 Qu. 3 Qu. 4

1 2 3 4 W w H ? W w H ? W w H ? W w H ?

--------------------------------------------------

1 3 2 2 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0

2 3 3 2 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0

4 3 3 2 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0

4 4 4 4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1

4 4 4 4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1

1 3 2 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0. . .

. . .

. . . and so on for 3415 rows

Original data Indicator Matrix

Page 14: Green Acre 2007 Mc a Col

MCA: XLSTAT MCA: XLSTAT MCA: XLSTAT MCA: XLSTAT initialinitialinitialinitial outputoutputoutputoutput

Total inertia: 3

Eigenvalues and percentages of inertia:

F1 F2 F3 F4 F5Eigenvalue 0.692 0.513 0.365 0.307 0.218Inertia (%) 23.061 17.108 12.156 10.248 7.254Cumulative % 23.061 40.169 52.325 62.573 69.827Adjusted Inertia 0.347 0.123 0.023 0.006Adjusted Inertia (%) 66.152 23.482 4.456 1.118Cumulative % 66.152 89.634 94.090 95.208

... F12

Total inertia in MCA of indicator matrix Z = 34

416 =−=−Q

QJ

Page 15: Green Acre 2007 Mc a Col

MultipleMultipleMultipleMultiple correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis (MCA)(MCA)(MCA)(MCA)Women in the workplace – 4 questions

•If Z (N×J) is theindicator matrix, then the Burtmatrix B (J×J) is B = ZTZ

•Alternativedefinition of MCA is that it is the CA of the Burt matrix

•Stacked matrix of all two-waycontingencytables, includingeach variable withitself

1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4?

2500 0 0 0 172 1107 1130 91 355 1709 345 91 1766 537 40 157

0 476 0 0 7 129 335 5 16 261 181 18 128 293 17 38

0 0 79 0 1 6 72 0 1 17 61 0 14 21 38 6

0 0 0 360 1 57 108 194 7 96 55 202 51 45 2 262

172 7 1 1 181 0 0 0 127 48 4 2 165 15 0 1

1107 129 6 57 0 1299 0 0 219 997 61 22 972 239 13 75

1130 335 72 108 0 0 1645 0 24 988 573 60 760 615 84 186

91 5 0 194 0 0 0 290 9 50 4 227 62 27 0 201

355 16 1 7 127 219 24 9 379 0 0 0 360 14 1 4

1709 261 17 96 48 997 988 50 0 2083 0 0 1348 566 23 146

345 181 61 55 4 61 573 4 0 0 642 0 202 286 73 81

91 18 0 202 2 22 60 227 0 0 0 311 49 30 0 232

1766 128 14 51 165 972 760 62 360 1348 202 49 1959 0 0 0

537 293 21 45 15 239 615 27 14 566 286 30 0 896 0 0

40 17 38 2 0 13 84 0 1 23 73 0 0 0 97 0

157 38 6 262 1 75 186 201 4 146 81 232 0 0 0 463

Burt matrix

1W

1w

1H

1?

2W

2w

2H

2?

3W

3w

3H

3?

4W

4w

4H

4?

Page 16: Green Acre 2007 Mc a Col

MCA (MCA (MCA (MCA (BurtBurtBurtBurt matrixmatrixmatrixmatrix versionversionversionversion))))

64.9% inertiaexplained (only 40.2% if indicatormatrix analysed)

•Missing valuecategories havestrong association

•Relationshipsamongst (within) theset of questions aredisplayed jointly

Women in the workplace – 4 questions

1W

1w

1H

1?

2W

2w

2H

2?

3W

3w

3H

3?4W

4w

4H

4?

-3

-2

-1

0

1

2

-1 0 1 2 3

0.263 (41.9%)

0.479 (23.0%)

0.479 (41.9%)

0.263 (23.0%) •Results are same for Burt matrix, just principal inertiaschange.

Page 17: Green Acre 2007 Mc a Col

MultipleMultipleMultipleMultiple correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis (MCA)(MCA)(MCA)(MCA)Women in the workplace – 4 questions

•Since thediagonal inertiasare so high, thisinflates theaverage, hencelow percentages

•Total inertia of Burt matrix is average of theinertias of itssubmatrices = 1.143

1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4?

2500 0 0 0 172 1107 1130 91 355 1709 345 91 1766 537 40 157

0 476 0 0 7 129 335 5 16 261 181 18 128 293 17 38

0 0 79 0 1 6 72 0 1 17 61 0 14 21 38 6

0 0 0 360 1 57 108 194 7 96 55 202 51 45 2 262

172 7 1 1 181 0 0 0 127 48 4 2 165 15 0 1

1107 129 6 57 0 1299 0 0 219 997 61 22 972 239 13 75

1130 335 72 108 0 0 1645 0 24 988 573 60 760 615 84 186

91 5 0 194 0 0 0 290 9 50 4 227 62 27 0 201

355 16 1 7 127 219 24 9 379 0 0 0 360 14 1 4

1709 261 17 96 48 997 988 50 0 2083 0 0 1348 566 23 146

345 181 61 55 4 61 573 4 0 0 642 0 202 286 73 81

91 18 0 202 2 22 60 227 0 0 0 311 49 30 0 232

1766 128 14 51 165 972 760 62 360 1348 202 49 1959 0 0 0

537 293 21 45 15 239 615 27 14 566 286 30 0 896 0 0

40 17 38 2 0 13 84 0 1 23 73 0 0 0 97 0

157 38 6 262 1 75 186 201 4 146 81 232 0 0 0 463

Burt matrix – inertias of each subtable

1W

1w

1H

1?

2W

2w

2H

2?

3W

3w

3H

3?

4W

4w

4H

4?

3.000 0.363 0.424 0.644

0.363 3.000 0.892 0.345

0.424 0.892 3.000 0.480

0.644 0.345 0.480 3.000

•Percentage of varianceexplained is actually muchhigher, in MCA the overall inertiais inflated by thediagonal tables in the Burt matrix –the percentage is actually about90%

Page 18: Green Acre 2007 Mc a Col

AdjustmentAdjustmentAdjustmentAdjustment ofofofof principal principal principal principal inertiasinertiasinertiasinertias((((eigenvalueseigenvalueseigenvalueseigenvalues))))

Here are the steps to rescale the solution:

1. Calculate the average off-diagonal inertia :

average off-diagonal inertia =

2. Calculate the adjusted principal inertias :

adjusted principal inertias =

3. Calculate adjusted percentages of inertia :

adjusted percentages of inertia =

−−− 2

)( 1 Q

QJinertia

Q

QB

QQQ

Qkk

111

λλ for only

22

>

inertia diagonal-off averageinertias principal adjusted

We can rescale an existing MCA solution in order to best fit the off-diagonal tables. All we need is the total inertia of the Burt matrix, inertia(B), and the

principal inertias λλλλk2 of the Burt matrix in the solution space.

If we have computed the solution on the indicator matrix Z (as in MCA module

of XLSTAT), the eigenvalues calculated are λλλλk so all the squares of theprincipal inertias of Z need to be summed in order to get inertia(B). If you

have analysed the Burt matrix B, inertia(B) is the total inertia.

Page 19: Green Acre 2007 Mc a Col

MCA (MCA (MCA (MCA (adjustedadjustedadjustedadjusted))))Women in the workplace – 4 questions

4?

4H

4w

4W 3?

3H

3w

3W

2?

2H

2w

2W

1?

1H

1w

1W

-3

-2

-1

0

1

2

-1 0 1 2 3

0.347 (66.2%)

0.123 (23.5%)

89.7% inertia explained

Page 20: Green Acre 2007 Mc a Col

MCA (MCA (MCA (MCA (BurtBurtBurtBurt matrixmatrixmatrixmatrix versionversionversionversion))))Women in the workplace – 4 questions

1W

1w

1H

1?

2W

2w

2H

2?

3W

3w

3H

3?4W

4w

4H

4?

-3

-2

-1

0

1

2

-1 0 1 2 3

0.263 (41.9%)

0.479 (23.0%)

0.479 (41.9%)

0.263 (23.0%)

64.9% inertia explained

Page 21: Green Acre 2007 Mc a Col

MCAMCAMCAMCAWomen in the workplace – supplementary demographic groups

DW

DE

M

F

A1A2

A3

A4A5

A6

E1

E2

E3

E4E5

E6

E*

ma

wi

di

se

si

-0.5

0.5

-0.5 0.5

Page 22: Green Acre 2007 Mc a Col

Related topicsRelated topicsRelated topicsRelated topics1. Subset correspondence analysis• restricting analysis to a subset of categories (e.g. all

substantive responses excluding missing categories, or missing categories by themselves, or “middle” categories)

2. Square asymmetric tables• mobility tables, brand-switching, migration...

3. Recoding of data before applying CA

• ratings, preferences, paired comparisons, continuous-scale

data (ratio and interval)

4. Stability and inference• concentration ellipses, convex hulls, permutation tests

5. Canonical correspondence analysis (CCA)• CA with explanatory variables (combination of dimensions

reduction and regression)

Page 23: Green Acre 2007 Mc a Col

SubsetSubsetSubsetSubset correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysisFor example, analysing the women working data but ignoring the missingvalues (this is NOT just a CA of the table without the missing value columns –the masses and metric of the complete matrix are maintained). In XLSTAT’s MCA program you are given a menu for selecting whichcategories you want to retain or omit:

Page 24: Green Acre 2007 Mc a Col

SubsetSubsetSubsetSubset correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis

4H

4w

4W

3H

3w

3W

2H2w

2W

1H

1w

1W

-0.5

0

0.5

1

-1.5 -1 -0.5 0 0.5

0.1240 (70.0%)

0.0241 (13.5%)

Page 25: Green Acre 2007 Mc a Col

Canonical Canonical Canonical Canonical correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis ((((CCACCACCACCA))))

This has the same objective as CA but restricts the CA solution to be (linearly) related to external predictor variables, for exampe we want to find the best low-dimensional view of the responses which is related to age (either agegroup or original age variable)

Page 26: Green Acre 2007 Mc a Col

Canonical Canonical Canonical Canonical correspondencecorrespondencecorrespondencecorrespondence analysisanalysisanalysisanalysis((((restrictedrestrictedrestrictedrestricted to to to to ageageageage groupgroupgroupgroup differencesdifferencesdifferencesdifferences))))

Q4-4Q4-3

Q4-2 Q4-1

Q3-4

Q3-3

Q3-2

Q3-1

Q2-4

Q2-3Q2-2

Q2-1

Q1-4

Q1-3

Q1-2Q1-1

agegp-6

agegp-5

agegp-4agegp-3

agegp-2

agegp-1

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

0.685 (63.5%)

0.465 (18.4%)