multivariate analysis of mixed data: the pcamixdata r …mchave100p/wordpress/wp... · multivariate...

32
PCAmix PCArot MFAmix Multivariate analysis of mixed data: The PCAmixdata R package Marie Chavent In collaboration with: Vanessa Kuentz, Amaury Labenne, Benoˆ ıt Liquet, J´ erˆ ome Saracco University of Bordeaux, France Inria Bordeaux Sud-Ouest, CQFD Team Irstea, UR ADBX, cestas, France The University of Queensland, Australia UseR!2015 June 30 - July 3, Aalborg, Denmark

Upload: hadien

Post on 10-Mar-2018

292 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

Multivariate analysis of mixed data: ThePCAmixdata R package

Marie ChaventIn collaboration with: Vanessa Kuentz, Amaury Labenne, Benoıt

Liquet, Jerome Saracco

University of Bordeaux, FranceInria Bordeaux Sud-Ouest, CQFD Team

Irstea, UR ADBX, cestas, FranceThe University of Queensland, Australia

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 2: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

Multivariate analysis of a mixture of numerical andcategorical data

Three main functions:

Function PCAmix for principal component analysis (PCA) ofmixed data.↪→ Includes standard PCA and MCA (multiple componentanalysis) as special cases.

Function PCArot for orthogonal rotation in PCAmix.↪→ Includes standard varimax rotation and rotation in MCA asspecial cases.

Function MFAmix for multiple factor analysis (MFA) formultiple-table mixed data.

https://github.com/chavent/PCAmixdata

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 3: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Outline

1 PCAmix

2 PCArot

3 MFAmix

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 4: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Principal component analysis of mixed data

Several implementations already in R:

Function FAMD in the R package FactoMineR.

↪→ Implements the method designed by Pages (2004).

Function dudi.mix in the R package ade4.

↪→ Implements the method of Hill & Smith (1976).

Function PCAmix in the R package PCAmixdata.↪→ Implements a single PCA with metrics based on a GSVDof preprocessed data.

⇒ Three different coding scheme but identical numerical results.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 5: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

A real data example

The gironde data are available in the package

library(PCAmixdata)

data(gironde)

They characterize living conditions in Gironde, a southwestregion in France.

542 cities are described with 27 variables separated into 4groups (Employment, Housing, Services, Environment).

↪→ Four datatables

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 6: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

A mixed data type example

The datatable housing is mixed.

↪→ 3 numerical and 2 categorical variables.

housing <- gironde$housing

head(housing)

## density primaryres owners houses council

## ABZAC 132 89 64 inf 90% sup 5%

## AILLAS 21 88 77 sup 90% inf 5%

## AMBARES 532 95 66 inf 90% sup 5%

## AMBES 101 94 67 sup 90% sup 5%

## ANDERNOS 552 62 72 inf 90% inf 5%

## ANGLADE 64 81 81 sup 90% inf 5%

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 7: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Two data sets:

↪→ a numerical data matrix X1 of dimension 542× 3.

↪→ a categorical data matrix X2 of dimension 542× 2.

split<-splitmix(housing)

X1<-split$X.quanti

X2<-split$X.quali

head(X1)

## density primaryres owners

## ABZAC 132 89 64

## AILLAS 21 88 77

## AMBARES 532 95 66

## AMBES 101 94 67

## ANDERNOS 552 62 72

## ANGLADE 64 81 81

head(X2)

## houses council

## ABZAC inf 90% sup 5%

## AILLAS sup 90% inf 5%

## AMBARES inf 90% sup 5%

## AMBES sup 90% sup 5%

## ANDERNOS inf 90% inf 5%

## ANGLADE sup 90% inf 5%

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 8: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

The PCAmix method

An simple algorithm in three main steps

1 Preprocessing step.

2 GSVD (Generalized Singular Value Decomposition) step.

3 Scores processing step.

Some notations:

- Let X1 be a n × p1 numerical data matrix.

- Let X2 be a n × p2 categorical data matrix.

- Let m be the total number of categories.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 9: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Preprocessing step

1 Build a numerical data matrix Z = (Z1|Z2) of dimensionn × (p1 + m) with:

↪→ Z1 the standardized version of the matrix X1.↪→ Z2 the centered indicator matrix of the levels of X2.

2 Build the diagonal matrix N of the weights of the rows.

↪→ The n rows are weighted by 1n .

3 Build the diagonal matrix M of the weights of the columns.

↪→ The p1 first columns are weighted by 1.↪→ The m last columns are weighted by n

ns, with ns the number of

observations with level s.

↪→ The total variance is p1 + m − p2.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 10: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

GSVD step

The GSVD (Generalized Singular Value Decomposition) of Z withthe metrics N and M gives the decompositon

Z = UDVt (1)

where

- D = diag(√λ1, . . . ,

√λr ) is the r × r diagonal matrix of the

singular values of ZMZtN and ZtNZM, and r denotes therank of Z;

- U is the n × r matrix of the first r eigenvectors of ZMZtNsuch that UtNU = Ir ;

- V is the p × r matrix of the first r eigenvectors of ZtNZMsuch that VtMV = Ir .

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 11: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Scores processing step

1 The set of factor scores for rows is computed as:

F = UD.

↪→ Principal Component scores

2 The set of factor scores for colums is computed as:

A = MVD.

↪→ In standard PCA: A = VD.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 12: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

The R function

args(PCAmix)

## function (X.quanti = NULL, X.quali = NULL, ndim = 5, rename.level = FALSE,

## weight.col = NULL, weight.row = NULL, graph = TRUE)

## NULL

PCAmix(X.quanti=X1,X.quali=X2,ndim=2)

##

## Call:

## PCAmix(X.quanti = X1, X.quali = X2, ndim = 2)

##

## Method = Principal Component of mixed data (PCAmix)

##

##

## name description

## [1,] "$eig" "eigenvalues of the principal components (PC) "

## [2,] "$ind" "results for the individuals (coord,contrib,cos2)"

## [3,] "$quanti" "results for the quantitative variables (coord,contrib,cos2)"

## [4,] "$levels" "results for the levels of the qualitative variables (coord,contrib,cos2)"

## [5,] "$quali" "results for the qualitative variables (contrib,relative contrib)"

## [6,] "$sqload" "squared loadings"

## [7,] "$coef" "coef of the linear combinations defining the PC"

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 13: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Principal Component scores of the observations

obj <- PCAmix(X.quanti=X1,X.quali=X2,ndim=2)

head(obj$ind$coord)

## dim 1 dim 2

## ABZAC 2.36 0.024

## AILLAS -0.88 0.123

## AMBARES 2.62 0.800

## AMBES 0.93 0.919

## ANDERNOS 1.18 -2.481

## ANGLADE -1.01 -0.424

plot(obj,choice="ind")

−2 0 2 4 6 8 10

−6

−4

−2

02

Individuals component map

Dim 1 (50.54 %)

Dim

2 (

21.3

9 %

)ABZACAILLAS

AMBARESAMBES

ANDERNOS

ANGLADE

ARBANATSARBIS

ARCACHON

ARCINS

ARES

ARSAC ARTIGUES−PRES−BORDEAUXARTIGUES−DE−LUSSAC

ARVEYRESASQUES

AUBIACAUBIE−ET−ESPESSAS

AUDENGEAURIOLLES

AUROS

AVENSANAYGUEMORTE−LES−GRAVES

BAGASBAIGNEAUXBALIZACBARIE

BARONBARP

BARSACBASSANNE

BASSENSBAURECHBAYAS

BAYON−SUR−GIRONDEBAZAS

BEAUTIRAN

BEGADAN

BEGLES

BEGUEY

BELIN−BELIETBELLEBAT

BELLEFONDBELVES−DE−CASTILLONBERNOS−BEAULAC

BERSONBERTHEZ

BEYCHAC−ET−CAILLAUBIEUJAC BIGANOS

BILLAUXBIRACBLAIGNAC

BLAIGNAN

BLANQUEFORT

BLASIMON

BLAYE

BLESIGNAC

BOMMES

BONNETANBONZAC

BORDEAUX

BOSSUGAN

BOULIAC

BOURDELLES BOURG

BOURIDEYS

BOUSCAT

BRACH BRANNEBRANNENS

BRAUD−ET−SAINT−LOUIS

BROUQUEYRAN

BRUGES

BUDOS

CABANAC−ET−VILLAGRAINS

CABARA

CADARSAC CADAUJAC

CADILLAC

CADILLAC−EN−FRONSADAISCAMARSACCAMBES

CAMBLANES−ET−MEYNAC

CAMIAC−ET−SAINT−DENISCAMIRANCAMPS−SUR−L'ISLE

CAMPUGNAN

CANEJANCANTENAC

CANTOISCAPIAN

CAPLONG

CAPTIEUX

CARBON−BLANC

CARCANS

CARDAN

CARIGNAN−DE−BORDEAUX

CARS

CARTELEGUE

CASSEUIL

CASTELMORON−D'ALBRET

CASTELNAU−DE−MEDOCCASTELVIELCASTETS−EN−DORTHECASTILLON−DE−CASTETS

CASTILLON−LA−BATAILLE

CASTRES−GIRONDE

CAUDROTCAUMONT

CAUVIGNAC

CAVIGNAC

CAZALIS

CAZATS

CAZAUGITAT

CENAC

CENON

CERONSCESSAC

CESTAS

CEZAC

CHAMADELLECISSAC−MEDOC

CIVRAC−DE−BLAYE

CIVRAC−SUR−DORDOGNE

CIVRAC−EN−MEDOC

CLEYRAC

COIMERES

COIRAC

COMPS

COUBEYRAC

COUQUEQUES

COURPIAC

COURS−DE−MONSEGURCOURS−LES−BAINS

COUTRASCOUTURES

CREONCROIGNONCUBNEZAISCUBZAC−LES−PONTS

CUDOS

CURSAN

CUSSAC−FORT−MEDOCDAIGNAC

DARDENACDAUBEZE

DIEULIVOL

DONNEZACDONZACDOULEZONEGLISOTTES−ET−CHALAURES

ESCAUDES

ESCOUSSANSESPIETESSEINTES

ETAULIERS

EYNESSE

EYRANS

EYSINES

FALEYRASFARGUESFARGUES−SAINT−HILAIREFIEU

FLOIRAC

FLAUJAGUESFLOUDES

FONTETFOSSES−ET−BALEYSSACFOURS

FRANCS

FRONSACFRONTENACGABARNACGAILLAN−EN−MEDOCGAJACGALGONGANS

GARDEGAN−ET−TOURTIRACGAURIAC

GAURIAGUET

GENERAC

GENISSACGENSACGIRONDE−SUR−DROPTGISCOS

GORNAC

GOUALADE

GOURS

GRADIGNAN

GRAYAN−ET−L'HOPITAL

GREZILLACGRIGNOLS

GUILLACGUILLOS

GUITRES

GUJAN−MESTRAS

HAILLAN

HAUX

HOSTENS

HOURTIN

HUREILLATSISLE−SAINT−GEORGES

IZON

JAU−DIGNAC−ET−LOIRAC

JUGAZAN

JUILLAC

LABARDE

LABESCAU

BREDE

LACANAU

LADAUXLADOSLAGORCE

LANDE−DE−FRONSAC

LAMARQUELAMOTHE−LANDERRONLALANDE−DE−POMEROL

LANDERROUATLANDERROUET−SUR−SEGUR

LANDIRAS

LANGOIRAN LANGON

LANSAC

LANTON

LAPOUYADE

LAROQUE

LARTIGUE

LARUSCADELATRESNELAVAZAN

LEGE−CAP−FERRET

LEOGEATSLEOGNAN

LERM−ET−MUSSETLESPARRE−MEDOC

LESTIAC−SUR−GARONNE

LEVES−ET−THOUMEYRAGUES

LIBOURNELIGNAN−DE−BAZASLIGNAN−DE−BORDEAUXLIGUEUX

LISTRAC−DE−DUREZELISTRAC−MEDOC

LORMONT

LOUBENSLOUCHATS

LOUPES

LOUPIAC

LOUPIAC−DE−LA−REOLE

LUCMAU

LUDON−MEDOC

LUGAIGNAC

LUGASSON

LUGON−ET−L'ILE−DU−CARNAY

LUGOSLUSSAC

MACAUMADIRACMARANSINMARCENAIS

MARCILLAC MARGAUXMARGUERON

MARIMBAULT

MARIONS

MARSAS MARTIGNAS−SUR−JALLEMARTILLAC

MARTRES

MASSEILLESMASSUGAS

MAURIAC

MAZERES

MAZION

MERIGNAC

MERIGNAS

MESTERRIEUX

MIOSMOMBRIERMONGAUZY

MONPRIMBLANCMONSEGUR

MONTAGNE

MONTAGOUDINMONTIGNAC

MONTUSSAN

MORIZESMOUILLAC

MOULIETS−ET−VILLEMARTINMOULIS−EN−MEDOCMOULONMOURENS

NAUJAC−SUR−MER

NAUJAN−ET−POSTIAC

NEAC

NERIGEAN

NEUFFONSNIZAN

NOAILLAC

NOAILLAN

OMET

ORDONNAC

ORIGNE PAILLET

PAREMPUYRE

PAUILLAC

PEINTURES

PELLEGRUE

PERISSACPESSAC

PESSAC−SUR−DORDOGNE

PETIT−PALAIS−ET−CORNEMPS

PEUJARDPIAN−MEDOC

PIAN−SUR−GARONNE

PINEUILHPLASSAC

PLEINE−SELVE

PODENSAC

POMEROL

POMPEJAC

POMPIGNAC

PONDAURATPORCHERES

PORGE

PORTETSPOUT

PRECHAC

PREIGNACPRIGNAC−EN−MEDOCPRIGNAC−ET−MARCAMPS PUGNAC

PUISSEGUIN

PUJOLS−SUR−CIRONPUJOLS

PUY

PUYBARBANPUYNORMAND

QUEYRAC

QUINSAC

RAUZANREIGNAC

REOLERIMONSRIOCAUD

RIONS

RIVIERE

ROAILLAN

ROMAGNEROQUEBRUNEROQUILLERUCH

SABLONSSADIRAC

SAILLANSSAINT−AIGNAN

SAINT−ANDRE−DE−CUBZACSAINT−ANDRE−DU−BOISSAINT−ANDRE−ET−APPELLES

SAINT−ANDRONY

SAINT−ANTOINE

SAINT−ANTOINE−DU−QUEYRETSAINT−ANTOINE−SUR−L'ISLE

SAINT−AUBIN−DE−BLAYESAINT−AUBIN−DE−BRANNE

SAINT−AUBIN−DE−MEDOC

SAINT−AVIT−DE−SOULEGE

SAINT−AVIT−SAINT−NAZAIRESAINT−BRICESAINT−CAPRAIS−DE−BLAYE

SAINT−CAPRAIS−DE−BORDEAUX

SAINT−CHRISTOLY−DE−BLAYE

SAINT−CHRISTOLY−MEDOC

SAINT−CHRISTOPHE−DES−BARDES

SAINT−CHRISTOPHE−DE−DOUBLE

SAINT−CIBARDSAINT−CIERS−D'ABZAC

SAINT−CIERS−DE−CANESSE

SAINT−CIERS−SUR−GIRONDE

SAINTE−COLOMBE

SAINT−COME

SAINTE−CROIX−DU−MONTSAINT−DENIS−DE−PILE

SAINT−EMILIONSAINT−ESTEPHE

SAINT−ETIENNE−DE−LISSE

SAINTE−EULALIESAINT−EXUPERYSAINT−FELIX−DE−FONCAUDE

SAINT−FERME

SAINTE−FLORENCE

SAINTE−FOY−LA−GRANDESAINTE−FOY−LA−LONGUE

SAINTE−GEMME

SAINT−GENES−DE−BLAYESAINT−GENES−DE−CASTILLON

SAINT−GENES−DE−FRONSACSAINT−GENES−DE−LOMBAUDSAINT−GENIS−DU−BOIS

SAINT−GERMAIN−DE−GRAVESAINT−GERMAIN−D'ESTEUIL

SAINT−GERMAIN−DU−PUCHSAINT−GERMAIN−DE−LA−RIVIERESAINT−GERVAISSAINT−GIRONS−D'AIGUEVIVES

SAINTE−HELENE

SAINT−HILAIRE−DE−LA−NOAILLESAINT−HILAIRE−DU−BOIS

SAINT−HIPPOLYTE

SAINT−JEAN−DE−BLAIGNAC

SAINT−JEAN−D'ILLAC

SAINT−JULIEN−BEYCHEVELLE

SAINT−LAURENT−MEDOC

SAINT−LAURENT−D'ARCE

SAINT−LAURENT−DES−COMBES

SAINT−LAURENT−DU−BOISSAINT−LAURENT−DU−PLAN

SAINT−LEGER−DE−BALSONSAINT−LEONSAINT−LOUBERT

SAINT−LOUBES

SAINT−LOUIS−DE−MONTFERRAND

SAINT−MACAIRE

SAINT−MAGNESAINT−MAGNE−DE−CASTILLONSAINT−MAIXANTSAINT−MARIENSSAINT−MARTIAL

SAINT−MARTIN−LACAUSSADE

SAINT−MARTIN−DE−LAYESAINT−MARTIN−DE−LERM

SAINT−MARTIN−DE−SESCASSAINT−MARTIN−DU−BOIS

SAINT−MARTIN−DU−PUYSAINT−MEDARD−DE−GUIZIERES

SAINT−MEDARD−D'EYRANSSAINT−MEDARD−EN−JALLES

SAINT−MICHEL−DE−CASTELNAU

SAINT−MICHEL−DE−FRONSAC

SAINT−MICHEL−DE−RIEUFRET

SAINT−MICHEL−DE−LAPUJADE

SAINT−MORILLON

SAINT−PALAIS

SAINT−PARDON−DE−CONQUES

SAINT−PAUL

SAINT−PEY−D'ARMENS

SAINT−PEY−DE−CASTETSSAINT−PHILIPPE−D'AIGUILLE

SAINT−PHILIPPE−DU−SEIGNALSAINT−PIERRE−D'AURILLAC

SAINT−PIERRE−DE−BAT

SAINT−PIERRE−DE−MONS

SAINT−QUENTIN−DE−BARON

SAINT−QUENTIN−DE−CAPLONG

SAINTE−RADEGONDESAINT−ROMAIN−LA−VIRVEESAINT−SAUVEUR

SAINT−SAUVEUR−DE−PUYNORMAND

SAINT−SAVIN

SAINT−SELVE

SAINT−SEURIN−DE−BOURG

SAINT−SEURIN−DE−CADOURNE

SAINT−SEURIN−DE−CURSAC

SAINT−SEURIN−SUR−L'ISLE

SAINT−SEVESAINT−SULPICE−DE−FALEYRENS

SAINT−SULPICE−DE−GUILLERAGUESSAINT−SULPICE−DE−POMMIERS

SAINT−SULPICE−ET−CAMEYRAC

SAINT−SYMPHORIENSAINTE−TERRESAINT−TROJAN

SAINT−VINCENT−DE−PAUL

SAINT−VINCENT−DE−PERTIGNASSAINT−VIVIEN−DE−BLAYE

SAINT−VIVIEN−DE−MEDOC

SAINT−VIVIEN−DE−MONSEGUR SAINT−YZAN−DE−SOUDIAC

SAINT−YZANS−DE−MEDOC

SALAUNESSALIGNAC

SALLEBOEUF

SALLESSALLES−DE−CASTILLON

SAMONACSAUCATS

SAUGON

SAUMOS

SAUTERNES

SAUVE

SAUVETERRE−DE−GUYENNE

SAUVIACSAVIGNAC

SAVIGNAC−DE−L'ISLE

SEMENS

SENDETS

SIGALENS

SILLAS

SOULAC−SUR−MER

SOULIGNAC

SOUSSAC

SOUSSANSTABANAC

TAILLAN−MEDOC

TAILLECAVAT

TALAIS

TALENCE

TARGON

TARNES

TAURIAC

TAYAC TEICH

TEMPLE

TESTE−DE−BUCH

TEUILLACTIZAC−DE−CURTONTIZAC−DE−LAPOUYADE TOULENNETOURNE

TRESSES

TUZANUZESTE

VALEYRAC

VAYRES

VENDAYS−MONTALIVET

VENSAC

VERAC

VERDELAIS

VERDON−SUR−MER

VERTHEUILVIGNONET

VILLANDRAUT

VILLEGOUGEVILLENAVE−DE−RIONS

VILLENAVE−D'ORNON

VILLENEUVE

VIRELADEVIRSAC YVRACMARCHEPRIME

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 14: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Loadings of the numerical variables

head(obj$quanti.cor)

## dim1 dim2

## density 0.704 0.25

## primaryres -0.019 0.97

## owners -0.858 0.13

#Component map with factor scores of the numerical columns

plot(obj,choice="cor")

−2 −1 0 1 2

−1.

0−

0.5

0.0

0.5

1.0

Correlation circle

Dim 1 (50.54 %)

Dim

2 (

21.3

9 %

) density

primaryres

owners

↪→ The (non standardized) loadings are correlations.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 15: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Scores of the levels of the categorical variables

head(obj$categ.coord)

## dim1 dim2

## inf 90% 1.63 -0.339

## sup 90% -0.42 0.087

## inf 5% -0.40 -0.065

## sup 5% 1.52 0.245

plot(obj,choice="levels")

−0.5 0.0 0.5 1.0 1.5 2.0

−0.

4−

0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Levels component map

Dim 1 (50.54 %)

Dim

2 (

21.3

9 %

)

inf 90%

sup 90%

inf 5%

sup 5%

↪→ The barycentric property is still verified.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 16: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Contributions of the variables

#contributions of the variables

head(obj$sqload)

## dim1 dim2

## density 0.49550 0.061

## primaryres 0.00035 0.946

## owners 0.73651 0.017

## houses 0.68226 0.030

## council 0.61226 0.016

plot(obj,choice="sqload",coloring.var="type")

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Squared loadings

Dim 1 (50.54 %)

Dim

2 (

21.3

9 %

)

density

primaryres

ownershousescouncil

numericalcategorical

The contribution cjα of a variable j to the component α is:{cjα = a2

jα = Squared correlation if variable j is numerical,

cjα =∑

s∈Ijnnsa2sα = Correlation ratio if variable j is categorical.

↪→ Called squared loadings in varimax criterion for PC rotation.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 17: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

Principal components prediction

Each principal component fα writes as a linear combination of thecolumns of X = (X1|G) where X1 is the numerical data matrix andG is the indicator matrix of the levels of the matrix X2:

fα = β0 +

p1+m∑j=1

βjxj

with:

β0 = −p1∑k=1

vkαxksk−

p1+m∑k=p1+1

vkα,

βj = vjα1

sj, for j = 1, . . . , p1

βj = vjαn

nj, for j = p1 + 1, . . . , p1 + m

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 18: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmix

A real data exampleThe PCAmix methodPrincipal component prediction

The method predict()Coefficients of the PC found on the learning set (without the 5 first cities)

test <- 1:5

obj2 <- PCAmix(X1[-test,],X2[-test,],graph=FALSE,ndim=3)

data.frame(obj2$coef)

## dim1 dim2 dim3

## const 3.64214 -9.02951 1.5950

## density 0.00086 0.00048 0.0015

## primaryres -0.00129 0.09355 -0.0179

## owners -0.05065 0.01207 -0.0047

## inf 90% 1.03579 -0.32092 -0.4617

## sup 90% -0.26076 0.08079 0.1162

## inf 5% -0.25129 -0.05706 0.2704

## sup 5% 0.96441 0.21898 -1.0379

Scores of the 5 first cities on the principal components

predict(obj2,X1[test,],X2[test,])

## dim1 dim2 dim3

## ABZAC 2.39 0.011 -1.595

## AILLAS -0.87 0.122 0.084

## AMBARES 2.65 0.795 -1.098

## AMBES 0.94 0.895 -1.164

## ANDERNOS 1.19 -2.466 0.800

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 19: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

Outline

1 PCAmix

2 PCArot

3 MFAmix

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 20: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

Varimax type rotation in PCAmix

Let us introduce

T an orthonormal rotation matrix: TT′ = T′T = Ikk is the number of dimensions in the rotation procedure

⇒ Rotate the loading matrix and the principal components sothat groups of variables appear: having high loadings on thesame component and negligible ones on the remainingcomponents.

⇒ In PCAmix rotated squared loadings cjα are correlations orcorrelation ratios for categorical variables.

⇒ The varimax function writes:

f (T) =k∑

α=1

p∑j=1

(cjα)2 − 1

p

k∑α=1

p∑j=1

cjα

2

. (2)

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 21: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

Find the optimal rotation matrix T

The varimax rotation problem is formulated as

maxT

{f (T)|TT′ = T′T = Ik}, (3)

⇒ An iterative procedure based on successive planar rotations.

⇒ Direct solution for the optimal angle of rotation (Chavent &al. 2012).

⇒ Reduces to the Kaiser (1958) for numerical data.

⇒ Performs rotation in MCA for categorical data.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 22: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

The R function

obj <- PCAmix(X.quanti=X1, X.quali=X2, rename.level=TRUE, graph=FALSE)

rot <- PCArot(obj,dim=3)

rot

##

## Call:

## PCArot(obj = obj, dim = 3)

##

## Method = rotation after Principal Component of mixed data (PCAmix)

## number of iterations: 4

##

## name description

## [1,] "$eig" "variances of the 'ndim' first dimensions after rotation"

## [2,] "$ind" "results for the individuals after rotation (coord)"

## [3,] "$quanti" "results for the quantitative variables (coord) after rotation"

## [4,] "$levels" "results for the levels of the qualitative variables (coord) after rotation"

## [5,] "$quali" "results for the qualitative variables (coord) after rotation "

## [6,] "$sqload" "squared loadings after rotation"

## [7,] "$coef" "coef of the linear combinations defining the rotated PC"

## [8,] "$theta" "angle of rotation if 'dim'=2"

## [9,] "$T" "matrix of rotation"

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 23: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

The method plot()

plot(obj,choice="sqload",coloring.var="type",

leg=TRUE,axes=c(1,3), posleg="topright",

main="Squared loadings before rotation")

0.0 0.2 0.4 0.6 0.8

−0.

10.

00.

10.

20.

30.

40.

5

Squared loadings before rotation

Dim 1 (50.54 %)

Dim

3 (

12.6

1 %

)

density

primaryres

owners

houses

council

numericalcategorical

plot(rot,choice="sqload",coloring.var="type",

leg=TRUE,axes=c(1,3),posleg="topright",

main="Squared loadings after rotation")

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Squared loadings after rotation

Dim 1 (38.52 %)

Dim

3 (

24.8

8 %

)

density

primaryres

owners

houses

council

numericalcategorical

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 24: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

The method plot()

plot(obj,choice="ind",label=FALSE,axes=c(1,3),

main="Observations before rotation")

−2 0 2 4 6 8 10

−2

02

46

Observations before rotation

Dim 1 (50.54 %)

Dim

3 (

12.6

1 %

)

plot(rot,choice="ind",label=FALSE,axes=c(1,3),

main="Observations after rotation")

−1 0 1 2 3 4

−2

02

46

810

12

Observations after rotation

Dim 1 (38.52 %)D

im 3

(24

.88

%)

↪→ Prediction of scores of new observations on the rotatedprincipal components

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 25: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixVarimax orthogonal rotation in PCAmix

The method predict()Coefficients of the PC found on the learning set (without the 5 first cities)

obj2 <- PCAmix(X1[-test,],X2[-test,],graph=FALSE)

rot2 <- PCArot(obj2,dim=3)

data.frame(rot2$coef)

## dim1.rot dim2.rot dim3.rot

## const 2.00163 -9.3e+00 1.3741

## density -0.00094 3.9e-05 0.0022

## primaryres 0.00688 9.6e-02 -0.0014

## owners -0.03313 1.4e-02 -0.0228

## inf 90% 1.23247 -2.2e-01 -0.1816

## sup 90% -0.31027 5.5e-02 0.0457

## inf 5% -0.44031 -1.2e-01 0.1958

## sup 5% 1.68985 4.6e-01 -0.7514

Scores of the 5 first cities on the rotated principal components

predict(rot2,X1[test,],X2[test,])

## dim1.rot dim2.rot dim3.rot

## ABZAC 3.28 0.36 -0.865

## AILLAS -0.72 0.12 -0.223

## AMBARES 2.90 0.98 -0.037

## AMBES 1.73 1.15 -0.764

## ANDERNOS 0.33 -2.64 0.868

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 26: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

Outline

1 PCAmix

2 PCArot

3 MFAmix

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 27: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

Multiple Factor Analysis for mixed data

⇒ Analyze a set of observations described by several groups ofvariables.

⇒ Make all the groups of variables comparable in the PCAanalysis by introducing weights: the weight of a variable is theinverse of the variance of the first principal component of itsgroup.

⇒ In the function MFA in the package FactoMineR, the natureof the variables (categorical or numerical) can vary from onegroup to another, but the variables should be of the sametype within a given group.

⇒ MFAmix is able to handle mixed data within a group ofvariables.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 28: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

The R function

dat<-cbind(gironde$employment,gironde$housing, gironde$services,gironde$environment)

#definition of the groups of variables

class.var<-c(rep(1,9),rep(2,5),rep(3,9),rep(4,4))

#names of the groups of variables

names<-c("employment","housing","services","environment")

#Perform MFAmix

obj3<-MFAmix(data=dat,groups=class.var,name.groups=names,ndim=3,rename.level=TRUE,graph=FALSE)

obj3

## **Results of the Multiple Factor Analysis for mixed data (MFAmix)**

## The analysis was performed on 542 individuals, described by 27 variables

## *Results are available in the following objects :

##

## name description

## 1 "$eig" "eigenvalues"

## 2 "$eig.separate" "eigenvalues of the separate analyses"

## 3 "$separate.analyses" "separate analyses for each group of variables"

## 4 "$groups" "results for all the groups"

## 5 "$partial.axes" "results for the partial axes"

## 6 "$ind" "results for the individuals"

## 7 "$ind.partial" "results for the partial individuals"

## 8 "$quanti" "results for the quantitative variables"

## 9 "$levels" "results for the levels of the qualitative variables"

## 10 "$quali" "results for the qualitative variables"

## 11 "$sqload" "squared loadings"

## 12 "$global.pca" "results for the global PCA"

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 29: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

Some graphical output

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

(a) Correlation circle

Dim 1 (21.78 %)

Dim

2 (

10.9

9 %

)

employmenthousingservicesenvironment

farmers

tradesmen

managers

workers unemployed

middleempl

retired

employrateincome

density

primaryresowners

buildingwater

vegetation

agricul

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

(b) Partial axes

Dim 1 (21.78 %)

Dim

2 (

10.9

9 %

)

dim 1.employment

dim 2.employment

dim 3.employment

dim 1.housing

dim 2.housing

dim 3.housing

dim 1.servicesdim 2.servicesdim 3.services

dim 1.environment

dim 2.environment

dim 3.environment

employmenthousingservicesenvironment

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(c) Groups representation

Dim 1 (21.78 %)

Dim

2 (

10.9

9 %

)

employment

housing

services

environment

0 5 10 15

−10

−5

05

(d) Partial observations

Dim 1 (21.78 %)

Dim

2 (

10.9

9 %

)

employmenthousingservicesenvironment

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 30: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

Other packages working with mixed data

⇒ ClustOfVar for the clustering of variables.

⇒ divclust for the divisive and monothetic clustering ofobservations.

⇒ ClustGeo for the clustering with geographical constraints(very soon available for mixed data).

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 31: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

Some references

Beaton, D., Chin Fatt, C. R., Abdi, H. (2014). An ExPosition ofmultivariate analysis with the singular value decomposition in R.Computational Statistics & Data Analysis, 72, 176-189.

Chavent, M., Kuentz, V., Liquet B., Saracco, J. (2012), ClustOfVar: An RPackage for the Clustering of Variables. Journal of Statistical Software 50,1-16.

Chavent, M., Kuentz, V., Saracco, J. (2012), Orthogonal Rotation inPCAMIX. Advances in Data Analysis and Classification 6, 131-146.

Chavent, M., Kuentz, V., Saracco, J. (2012), Multivariate analysis ofmixed type data: The PCAmixdata R package. arXiv:1411.4911v1.

Dray, S., Dufour, A., 2007. The ade4 package: implementing the dualitydiagram for ecologists. Journal of Statistical Software 22 (4), 120.

Le, S., Josse, J., Husson, F., et al. (2008). Factominer: an R package formultivariate analysis. Journal of Statistical Software 25 (1), 118.

UseR!2015 June 30 - July 3, Aalborg, Denmark

Page 32: Multivariate analysis of mixed data: The PCAmixdata R …mchave100p/wordpress/wp... · Multivariate analysis of mixed data: The PCAmixdata R package ... (PCA) of mixed data. ... Function

PCAmixPCArot

MFAmixMultiple Factor Analysis for mixed data

THANK YOU !

UseR!2015 June 30 - July 3, Aalborg, Denmark