+
The right answer to the wrong question
The use of factor analysis and principal component analysis in the social sciences
Jonathan Rose
Research Fellow
University of Nottingham
+ Before we start
A note of caution from the introductory section of a chapter
on factor analysis and principal component analysis in The R
Book:
These techniques are not recommended unless you know exactly
what you are doing, and exactly why you are doing it. Beginners
are sometimes attracted to multivariate techniques because of the
complexity of the output they produce, making the classic mistake
of confusing the opaque for the profound. (Crawley, 2007: 731)
This may somewhat be overstating the case, but is none the
less a healthy reminder. In extremis, people‟s lives are being
staked on incorrect models (more on this later).
+ A fundamental conception of latent
variables
Latent structure: the possibility that the variance in the
observed variables (indicators) can be accounted for by a
smaller number of latent variables, which are conceivably of
a more fundamental nature.
These variables are „latent‟ in the sense that they are not
observed, and may well be unobservable.
Think, for instance, of intelligence, trust, confidence, happiness,
etc.
Almost everything we are really interested in measuring is a
latent variable, even if we don‟t use latent variable models.
+ What you want
A method to analyze the structure of data
Either by testing for a specific structure (confirmatory models), or
by attempting to discover a structure through various means
(exploratory models)
The understanding of which will tell you which of your
indicators are „like‟ the others, and which are „different‟:
basically, what we can lump together and what we can‟t.
+ Why do you need that?
More reliable measurement
Require fewer variables in an analysis
Avoid multicollinearity
Understand deep-seated processes that drive responses
Help with conceptualizing the world
Avoid spuriously high correlations caused by analyzing two
halves of the same whole as if they were in a cause and effect
relationship
+ This is going surprisingly well!
Now, we just move the variable names over. That big friendly
„OK‟ button looks so inviting. I bet if I press that I‟ll get my
factor analysis…The defaults will be fine. What‟s the worst
that can happen?
+ But what did we actually get here?
Remember the method of analysis we chose?
And remember the title of the options box?
Any guesses?
+ That‟s right!
We got a Principal Components Analysis (PCA).
If you look carefully, there are clues that this is what you‟re getting;
but they don‟t make it anywhere near as explicit as it ought to
be…
+ I‟ve heard of PCA – isn‟t it basically
the same thing as a factor analysis?
No, despite how they are usually treated.
There are similarities, which we will discuss in a short while – but
the take-home message of the presentation is that PCA and FA are
fundamentally different things, even if the results can be similar in
some circumstances.
+ Terminological confusion
Factor analysis has one of the most confused and
contradictory terminologies of any analytical method
Confusion around principal components analysis and factor
analysis
Confusion between various kinds of factor analysis
Confusion as to what you get out (e.g. factors, components,
principal components…)
And that is without dealing with extraction system,
eigenvalues, factor retention criteria, loadings…
+ Perpetuating confusion
One of the things that perpetuates confusion is the habit in
introductory texts to deliberately conflate FA and PCA.
For example, in SPSS Survival Manual (2007, 3rd Ed.), Pallant
says, in the chapter called „Factor Analysis‟, “I have chosen to
demonstrate principal components analysis in this chapter. If
you would like to explore other approaches further, see
Tabachnick and Fidell (2007)”.
Judging by sales, and the number of copies in the library at
Nottingham, this book is clearly a popular way to learn about
quantitative analysis using SPSS – but even in the FA chapter they
don‟t discuss FA.
+ Perpetuating confusion
You might have seen in research papers people saying
things like: “we employed a principal components factor
analysis (PCF) to aggregate groups of attitudinal questions
that reflect a common cluster”. Or “We performed a principal
component factor analysis of all drug prescriptions during
the entire course of the illness in a representative sample of
naturalistically treated bipolar outpatients.” Or countless
other examples.
„Principal components factor analysis‟ basically doesn‟t exist,
it is a conflation of PCA and FA – and it‟s difficult to know
exactly what one gets when papers say that they did this.
+ But PCA and FA are similar, right?
Somewhat. Indeed, sometimes people argue that “either that
there is almost no difference between principal components
and factor analysis, or that PCA is preferable (Arrindell & van
der Ende, 1985; Guadagnoli and Velicer, 1988; Schoenmann,
1990; Steiger, 1990; Velicer & Jackson, 1990).” (from Costello
& Osborne, 2005, Best Practices in Exploratory Factor Analysis)
However…
+ PCA vs Factor Analysis
Whilst there are overlaps, and sometimes the solutions are
similar, they are fundamentally different procedures. They
are different:
Conceptually
Mathematically
Practically
However, you should note that how different analyses will be
in practice is not easily specified before hand
+ Conceptual matters
A very general latent variable model
Applies to all kinds of latent variable models
Multiple causes of manifest items
But with an important shared cause
(note that this is slightly different from
how you might see such models
elsewhere).
+ The factor analytic conceptual
model
Conceptually much like other latent variable models
Unique components are included in the „error‟; they are
standardly lumped together because in reality you cannot
separate them
+ The PCA conceptual model
Notably different from the FA model, and from the conceptual
model of latent variables
+ PCA and causality
It is also more difficult to interpret PCA as a causal model,
since PCA is aiming to give you a a number of linear
combinations of the variables so as to capture the variance in
the set of items as a whole, rather than an analysis of shared
variance (as in FA). This breaks (standard) conceptual
models of causality.
There is no need for the relationship to be causal, and so it‟s
not such a big deal when people introduce items that are
clearly not caused by an underlying factor.
+ Mathematics
The equations underlying the procedures reflect this
difference in approaches.
For factor analysis, the model is:
For PCA it is:
+ The mathematical differences
between FA and PCA
It‟s easy to see that the equations are different. One includes
error and unique variance, and the other does not. But this
difference means that the analyses are not even conducted
upon the same information.
+ Different matrices, different
answers?
So, we have seen that the mathematics are different, and that
means that we use different matrices for our analysis – but
does that mean that we are likely to see radically different
results when we perform analyses?
According to Dunteman (1989) in the Sage green book on
PCA, “Both principal components analysis and factor analysis
give similar results if the communalities of the variables are
high and/or there are a large number of variables”
That the communalities being high makes a difference is not
surprising, since it makes the diagonal increasingly close to 1
(which is how it is in PCA).
+ Practical matters
If there were no practical implications of the choice between FA and PCA, or only minor ones, there would be very little to worry about. Yes, one model might be formally inappropriate, but we use formally inappropriate models all the time: linear regression of dichotomous items, SEM of non-multivariate normal data, etc., etc.
Unfortunately, FA and PCA are particularly susceptible to small deviations – not really because of any mathematical quirk, because of you. FA and PCA, perhaps more than any other method of analysis, require a significant degree of interpretation and theoretical consideration. Coefficients never fully speak for themselves, but they do so even less in FA/PCA than we are used to.
+ A worked example
Data on the psychological impact of Huntington‟s Disease
1803 cases
Dealing with:
Depressed mood
Low self-esteem
Suicidal thoughts
Anxiety
Compulsions
Perseveration
Apathy
Aggression
Irritability
Hallucinations
Delusions
+ Research findings
Of articles which analyze similar data, or older versions of
the same dataset, “[a]ll the studies have shown distinct
factors for depression, executive functions and irritability.”
(Rickards et al., 2011)
This study finds 4 factors – depression, executive function,
irritability, and also psychosis.
We might take issue with the idea of extracting 4 dimensions
anyway (more on this after the lunch break), but let‟s take it
as read for the minute that there really are 4. Does the
decision to use FA, rather than PCA, make a difference?
+ Again on terminology…
Note that they call their article “Factor Analysis of…”, but
actually use PCA – as do most other people as best as I can
tell (if you‟re going to do it, at least report what you actually
do).
+ Similar, but not identical results
Compulsive behaviour, perseveration and apathy – or just
compulsive behaviour and perseveration ?
Do hallucinations and delusions „go together‟?
Notice also the changes in the coefficients for aggression
(„dab‟) and irritability („ib‟)
+ Not just mathematical quirks
This has real implications – the differences we see here are
not „massive‟, in a traditional sense, but they would have
genuine consequences for how we interpret the world
around us.
In the article published using this research, they chose to
bold coefficients over 0.4 – on this criterion, apathy doesn‟t
warrant inclusion in the FA model at all
Yet the objective of the analysis is to develop rating scales for
use in actual day-to-day treatment. Errors here are
potentially very serious.
+ And finally, differences in software
Whilst this may be a problem which is especially bad in SPSS,
the problem is far more general. Try looking through R
packages that perform these types of analysis and try to
figure out what you are actually getting from the routine
(Normalization as standard? Rotation as standard? Which
rotation? How many factors does it standardly extract?...)
I‟ve been unable to easily replicate a PCA analysis between
SPSS and R (using prcomp or princomp) – although the code
suggests I should be seeing essentially the same thing.
+ Recommendations
We started with the advice that “[t]hese techniques are not
recommended unless you know exactly what you are doing,
and exactly why you are doing it.”
So what are practical ways to begin, to make sure you‟re
doing it right.
Study the manuals, or if you can, the code itself. It is rarely obvious
what routines actually do from the main interface itself.
Think carefully about what you actually want to find out before
you analyze the data you have – do you want a causal model? Do
you care about „unique‟ variance?
Test different options that you could have sensibly used to see
how big a difference it would make.
+ tl;dr
Factor analysis is not the same as principal components
analysis.
They can lead to different conclusions.
So be careful.