how to: use the psych package for factor analysis and data
TRANSCRIPT
How To Use the psych package for Factor Analysis and data
reduction
William RevelleDepartment of PsychologyNorthwestern University
April 24 2017
Contents
1 Overview of this and related documents 311 Jump starting the psych packagendasha guide for the impatient 3
2 Overview of this and related documents 6
3 Getting started 6
4 Basic data analysis 741 Getting the data by using readfile 742 Data input from the clipboard 843 Basic descriptive statistics 944 Simple descriptive graphics 9
441 Scatter Plot Matrices 10442 Correlational structure 10443 Heatmap displays of correlational structure 12
45 Polychoric tetrachoric polyserial and biserial correlations 15
5 Item and scale analysis 1551 Dimension reduction through factor analysis and cluster analysis 15
511 Minimum Residual Factor Analysis 17512 Principal Axis Factor Analysis 18513 Weighted Least Squares Factor Analysis 18514 Principal Components analysis (PCA) 24515 Hierarchical and bi-factor solutions 24516 Item Cluster Analysis iclust 28
1
52 Confidence intervals using bootstrapping techniques 3153 Comparing factorcomponentcluster solutions 3154 Determining the number of dimensions to extract 37
541 Very Simple Structure 38542 Parallel Analysis 40
55 Factor extension 40
6 Classical Test Theory and Reliability 4261 Reliability of a single scale 4562 Using omega to find the reliability of a single scale 4863 Estimating ωh using Confirmatory Factor Analysis 52
631 Other estimates of reliability 5464 Reliability and correlations of multiple scales within an inventory 54
641 Scoring from raw data 54642 Forming scales from a correlation matrix 57
65 Scoring Multiple Choice Items 5966 Item analysis 60
7 Item Response Theory analysis 6071 Factor analysis and Item Response Theory 6272 Speeding up analyses 6673 IRT based scoring 68
8 Multilevel modeling 7081 Decomposing data into within and between level correlations using statsBy 7082 Generating and displaying multilevel data 72
9 Set Correlation and Multiple Regression from the correlation matrix 72
10 Simulation functions 75
11 Graphical Displays 77
12 Miscellaneous functions 79
13 Data sets 80
14 Development version and a users guide 81
15 Psychometric Theory 82
16 SessionInfo 82
2
1 Overview of this and related documents
To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package
The current list of How Torsquos includes
1 Installing R and some useful packages
2 Using R and the psych package to find omegah and ωt
3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)
4 Using the scoreitems function to find scale scores and scale statistics
5 An overview (vignette) of the psych package
11 Jump starting the psych packagendasha guide for the impatient
You have installed psych (section 3) and you want to use it without reading much moreWhat should you do
1 Activate the psych packageR code
library(psych)
2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command
R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file
3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView
R codedescribe(myData)headTail(myData)
3
orquickView(myData)
4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)
R codepairspanels(myData)
5 Find the correlations of all of your data
bull Descriptively (just the values) (section 442)R code
lowerCor(myData)
bull Graphically (section 443)R code
corPlot(r)
6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)
R codefaparallel(myData)vss(myData)
7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)
R codefa(myData)iclust(myData)omega(myData)
8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects
R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results
4
At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page
5
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
52 Confidence intervals using bootstrapping techniques 3153 Comparing factorcomponentcluster solutions 3154 Determining the number of dimensions to extract 37
541 Very Simple Structure 38542 Parallel Analysis 40
55 Factor extension 40
6 Classical Test Theory and Reliability 4261 Reliability of a single scale 4562 Using omega to find the reliability of a single scale 4863 Estimating ωh using Confirmatory Factor Analysis 52
631 Other estimates of reliability 5464 Reliability and correlations of multiple scales within an inventory 54
641 Scoring from raw data 54642 Forming scales from a correlation matrix 57
65 Scoring Multiple Choice Items 5966 Item analysis 60
7 Item Response Theory analysis 6071 Factor analysis and Item Response Theory 6272 Speeding up analyses 6673 IRT based scoring 68
8 Multilevel modeling 7081 Decomposing data into within and between level correlations using statsBy 7082 Generating and displaying multilevel data 72
9 Set Correlation and Multiple Regression from the correlation matrix 72
10 Simulation functions 75
11 Graphical Displays 77
12 Miscellaneous functions 79
13 Data sets 80
14 Development version and a users guide 81
15 Psychometric Theory 82
16 SessionInfo 82
2
1 Overview of this and related documents
To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package
The current list of How Torsquos includes
1 Installing R and some useful packages
2 Using R and the psych package to find omegah and ωt
3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)
4 Using the scoreitems function to find scale scores and scale statistics
5 An overview (vignette) of the psych package
11 Jump starting the psych packagendasha guide for the impatient
You have installed psych (section 3) and you want to use it without reading much moreWhat should you do
1 Activate the psych packageR code
library(psych)
2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command
R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file
3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView
R codedescribe(myData)headTail(myData)
3
orquickView(myData)
4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)
R codepairspanels(myData)
5 Find the correlations of all of your data
bull Descriptively (just the values) (section 442)R code
lowerCor(myData)
bull Graphically (section 443)R code
corPlot(r)
6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)
R codefaparallel(myData)vss(myData)
7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)
R codefa(myData)iclust(myData)omega(myData)
8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects
R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results
4
At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page
5
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
1 Overview of this and related documents
To do basic and advanced personality and psychological research using R is not as compli-cated as some think This is one of a set of ldquoHow Tordquo to do various things using R (R CoreTeam 2017) particularly using the psych (Revelle 2017) package
The current list of How Torsquos includes
1 Installing R and some useful packages
2 Using R and the psych package to find omegah and ωt
3 Using R and the psych for factor analysis and principal components analysis (Thisdocument)
4 Using the scoreitems function to find scale scores and scale statistics
5 An overview (vignette) of the psych package
11 Jump starting the psych packagendasha guide for the impatient
You have installed psych (section 3) and you want to use it without reading much moreWhat should you do
1 Activate the psych packageR code
library(psych)
2 Input your data (section 42) If your file name ends in sav text txt csv xptrds Rds rda or RDATA then just read it in directly using readfile Or youcan go to your friendly text editor or data manipulation program (eg Excel) andcopy the data to the clipboard Include a first line that has the variable labels Pasteit into psych using the readclipboardtab command
R codemyData lt- readfile() this will open a search window on your machine and read or load the fileorfirst copy your file to your clipboard and thenmyData lt- readclipboardtab() if you have an excel file
3 Make sure that what you just read is right Describe it (section 43) and perhapslook at the first and last few lines If you want to ldquoviewrdquo the first and last few linesusing a spreadsheet like viewer use quickView
R codedescribe(myData)headTail(myData)
3
orquickView(myData)
4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)
R codepairspanels(myData)
5 Find the correlations of all of your data
bull Descriptively (just the values) (section 442)R code
lowerCor(myData)
bull Graphically (section 443)R code
corPlot(r)
6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)
R codefaparallel(myData)vss(myData)
7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)
R codefa(myData)iclust(myData)omega(myData)
8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects
R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results
4
At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page
5
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
orquickView(myData)
4 Look at the patterns in the data If you have fewer than about 10 variables look atthe SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 441)
R codepairspanels(myData)
5 Find the correlations of all of your data
bull Descriptively (just the values) (section 442)R code
lowerCor(myData)
bull Graphically (section 443)R code
corPlot(r)
6 Test for the number of factors in your data using parallel analysis (faparallelsection 542) or Very Simple Structure (vss 541)
R codefaparallel(myData)vss(myData)
7 Factor analyze (see section 51) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 511-513)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 516) Also consider a hierarchical factor solution to findcoefficient ω (see 515)
R codefa(myData)iclust(myData)omega(myData)
8 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 61) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreIems function (see 64) and to find various estimates of internal consistency forthese scales find their intercorrelations and find scores for all the subjects
R codealpha(myData) score all of the items as part of one scalemyKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))myscores lt- scoreItems(myKeysmyData) form several scalesmyscores show the highlights of the results
4
At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page
5
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading the entire overviewvignette helpful to get a broader understanding of what can be done in R using the psychRemember that the help command () is available for every function Try running theexamples for each help page
5
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
2 Overview of this and related documents
The psych package (Revelle 2017) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook
Some of the functions (eg readfile readclipboard describe pairspanels scat-terhist errorbars multihist bibars) are useful for basic data entry and descrip-tive analyses
Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Determining the num-ber of factors or components to extract may be done by using the Very Simple Structure(Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer 1976)(MAP) or parallel analysis (faparallel) criteria Item Response Theory (IRT) models fordichotomous or polytomous items may be found by factoring tetrachoric or polychoriccorrelation matrices and expressing the resulting parameters in terms of location and dis-crimination using irtfa Bifactor and hierarchical factor structures may be estimated byusing Schmid Leiman transformations (Schmid and Leiman 1957) (schmid) to transforma hierarchical factor structure into a bifactor solution (Holzinger and Swineford 1937)Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available
Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels corre-lation ldquoheat mapsrdquo (corplot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram as well as item response characteristics and itemand test information characteristic curves plotirt and plotpoly
3 Getting started
Some of the functions described in this overview require other packages Particularly usefulfor rotating the results of factor analyses (from eg fa or principal) or hierarchicalfactor models using omega or schmid is the GPArotation package These and other useful
6
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
packages may be installed by first installing and then using the task views (ctv) packageto install the ldquoPsychometricsrdquo task view but doing it this way is not necessary
4 Basic data analysis
A number of psych functions facilitate the entry of data and finding basic descriptivestatistics
Remember to run any of the psych functions it is necessary to make the package activeby using the library command
R codelibrary(psych)
The other packages once installed will be called automatically by psych
It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg
R codeFirst lt- function(x) library(psych)
41 Getting the data by using readfile
Although many find copying the data to the clipboard and then using the readclipboardfunctions (see below) a helpful alternative is to read the data in directly This can be doneusing the readfile function which calls filechoose to find the file and then based uponthe suffix of the file chooses the appropriate way to read it For files with suffixes of txttext r rds rda csv xpt or sav the file will be read correctly
R codemydata lt- readfile()
If the file contains Fixed Width Format (fwf) data the column information can be specifiedwith the widths command
R codemydata lt- readfile(widths = c(4rep(135)) will read in a file without a header row and 36 fields the first of which is 4 colums the rest of which are 1 column each
If the file is a RData file (with suffix of RData Rda rda Rdata or rdata) the objectwill be loaded Depending what was stored this might be several objects If the file is asav file from SPSS it will be read with the most useful default options (converting the fileto a dataframe and converting character fields to numeric) Alternative options may bespecified If it is an export file from SAS (xpt or XPT) it will be read csv files (commaseparated files) normal txt or text files data or dat files will be read as well These are
7
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
assumed to have a header row of variable labels (header=TRUE) If the data do not havea header row you must specify readfile(header=FALSE)
To read SPSS files and to keep the value labels specify usevaluelabels=TRUE Unfortu-nately this will make it difficult to do any analyses and you might want to read it againconverting the value labels to numeric values
R codemyspss lt- readfile(usevaluelabels=TRUE) this will keep the value labels for sav filesmydata lt- readfile() just read the file and convert the value labels to numeric
You can quickView the myspss object to see what it looks like as well as the mydata tosee the numeric encoding
R codequickViewmyspss to see the original encodingquickkView(mydata) to see the numeric encoding
42 Data input from the clipboard
There are of course many ways to enter data into R Reading from a local file usingreadfile is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly
readclipboard is the base function for reading data from the clipboard
readclipboardcsv for reading text that is comma delimited
readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)
readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix
readclipboardupper for reading input of an upper triangular matrix
readclipboardfwf for reading in fixed width fields (some very old data sets)
For example given a data set copied to the clipboard from a spreadsheet just enter thecommand
R codemydata lt- readclipboard()
This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values
8
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function
R codemydata lt- readclipboard(sep=t) define the tab option ormytabdata lt- readclipboardtab() just use the alternative function
For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)
R codemydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))
43 Basic descriptive statistics
Once the data are read in then describe will provide basic descriptive statistics arrangedin a data frame format Consider the data set satact which includes data from 700 webbased participants on 3 demographic variables and 3 ability measures
describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an
It is very important to describe your data before you continue on doing more complicatedmultivariate statistics The problem of outliers and bad data can not be overempha-sized
gt library(psych)gt data(satact)gt describe(satact) basic descriptive statistics
vars n mean sd median trimmed mad min max range skew kurtosis segender 1 700 165 048 2 168 000 1 2 1 -061 -162 002education 2 700 316 143 3 331 148 0 5 5 -068 -007 005age 3 700 2559 950 22 2386 593 13 65 52 164 242 036ACT 4 700 2855 482 29 2884 445 3 36 33 -066 053 018SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064 033 427SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059 -002 441
44 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels
9
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries
441 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 1) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 1 for an example)
pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic
442 Correlational structure
There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix lowerCor
calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix
gt lowerCor(satact)
gendr edctn age ACT SATV SATQgender 100education 009 100age -002 055 100ACT -004 015 011 100SATV -002 005 -004 056 100SATQ -017 003 -003 059 064 100
When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this
gt female lt- subset(satactsatact$gender==2)gt male lt- subset(satactsatact$gender==1)gt lower lt- lowerCor(male[-1])
10
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
gt png( pairspanelspng )gt pairspanels(satactpch=)gt devoff()null device
1
Figure 1 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT The plot character was set to a period (pch=rsquorsquo) in order tomake a cleaner graph
11
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
edctn age ACT SATV SATQeducation 100age 061 100ACT 016 015 100SATV 002 -006 061 100SATQ 008 004 060 068 100
gt upper lt- lowerCor(female[-1])
edctn age ACT SATV SATQeducation 100age 052 100ACT 016 008 100SATV 007 -003 053 100SATQ 003 -009 058 063 100
gt both lt- lowerUpper(lowerupper)gt round(both2)
education age ACT SATV SATQeducation NA 052 016 007 003age 061 NA 008 -003 -009ACT 016 015 NA 053 058SATV 002 -006 061 NA 063SATQ 008 004 060 068 NA
It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal
gt diffs lt- lowerUpper(lowerupperdiff=TRUE)gt round(diffs2)
education age ACT SATV SATQeducation NA 009 000 -005 005age 061 NA 007 -003 013ACT 016 015 NA 008 002SATV 002 -006 061 NA 005SATQ 008 004 060 068 NA
443 Heatmap displays of correlational structure
Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated dataset of 24 variables with a circumplex structure (Figure 3) The color coding represents aldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown
12
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
gt png(corplotpng)gt corplot(Thurstonenumbers=TRUEmain=9 cognitive variables from Thurstone)gt devoff()null device
1
Figure 2 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well
13
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-
gt png(circplotpng)gt circ lt- simcirc(24)gt rcirc lt- cor(circ)gt corplot(rcircmain=24 variables in a circumplex)gt devoff()null device
1
Figure 3 Using the corplot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect
14
45 Polychoric tetrachoric polyserial and biserial correlations
The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process A simple generalization of this to the caseof the multiple cuts is the polychoric correlation
Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation
If the data are a mix of continuous polytomous and dichotomous variables the mixedcor
function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations
The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will also happen if the correlationmatrix is formed by using pair-wise deletion of cases The corsmooth function will adjustthe smallest eigen values of the correlation matrix to make them positive rescale all ofthem to sum to the number of variables and produce a ldquosmoothedrdquo correlation matrix Anexample of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem
5 Item and scale analysis
The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales
51 Dimension reduction through factor analysis and cluster analysis
Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)
1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science
15
or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors
The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items
At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores
Several of the functions in psych address the problem of data reduction
fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out
principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values
factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings
vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-
2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors
16
confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate
faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data
faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOMof the loadings is generated
nfactors A number of different tests for the number of factors problem are run
fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred
fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well
iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease
511 Minimum Residual Factor Analysis
The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix nRn be approximated by the product of a factor matrix
nFk and its transpose plus a diagonal matrix of uniqueness
R = FF prime+U2 (1)
The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=ldquordquowlsrdquofm=rdquoglsrdquo and fm=rdquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method
factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is done
17
in factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa
solutions minres is the default for the fa function
A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solutionwas transformed into an oblique solution using the default option on rotate which usesan oblimin transformation (Table 1) Alternative rotations and transformations includeldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo and ldquogeominTrdquo (all of which are orthogonalrotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo andldquoclusterrdquo which are possible oblique transformations of the solution The default is to doa oblimin transformation although prior versions defaulted to varimax The measures offactor adequacy reflect the multiple correlations of the factors with the best fitting linearregression estimates of the factor scores (Grice 2001)
512 Principal Axis Factor Analysis
An alternative least squares algorithm factorpa (incorporated into fa as an option (fm= ldquopardquo) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-sition of the correlation matrix with the diagonal replaced by the values estimated by thefactors of the previous iteration This OLS solution is not as sensitive to improper matri-ces as is the maximum likelihood method and will sometimes produce more interpretableresults It seems as if the SAS example for PA uses only one iteration Setting the maxiterparameter to 1 produces the SAS solution
The solutions from the fa the factorminres and factorpa as well as the principal
functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations are varimax and quartimax Obliquetransformations include oblimin quartimin and then two targeted rotation functionsPromax and targetrot The latter of these will transform a loadings matrix towardsan arbitrary target matrix The default is to transform towards an independent clustersolution
Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 2)
513 Weighted Least Squares Factor Analysis
Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smaller
18
Table 1 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt f3t lt- fa(Thurstone3nobs=213)gt f3tFactor Analysis using method = minresCall fa(r = Thurstone nfactors = 3 nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 comSentences 091 -004 004 082 018 10Vocabulary 089 006 -003 084 016 10SentCompletion 083 004 000 073 027 10FirstLetters 000 086 000 073 027 104LetterWords -001 074 010 063 037 10Suffixes 018 063 -008 050 050 12LetterSeries 003 -001 084 072 028 10Pedigrees 037 -005 047 050 050 19LetterGroup -006 021 064 053 047 12
MR1 MR2 MR3SS loadings 264 186 150Proportion Var 029 021 017Cumulative Var 029 050 067Proportion Explained 044 031 025Cumulative Proportion 044 075 100
With factor correlations ofMR1 MR2 MR3
MR1 100 059 054MR2 059 100 052MR3 054 052 100
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197The degrees of freedom for the model are 12 and the objective function was 001
The root mean square of the residuals (RMSR) is 001The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1The total number of observations was 213 with MLE Chi Square = 282 with prob lt 1
Tucker Lewis Index of factoring reliability = 1027RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6151Fit based upon off diagonal values = 1Measures of factor score adequacy
MR1 MR2 MR3Correlation of scores with factors 096 092 090Multiple R square of scores with factors 093 085 081Minimum correlation of possible factor scores 086 071 063
19
Table 2 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt f3 lt- fa(Thurstone3nobs = 213fm=pa)gt f3o lt- targetrot(f3)gt f3oCall NULLStandardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 h2 u2Sentences 089 -003 007 081 019Vocabulary 089 007 000 080 020SentCompletion 083 004 003 070 030FirstLetters -002 085 -001 073 0274LetterWords -005 074 009 057 043Suffixes 017 063 -009 043 057LetterSeries -006 -008 084 069 031Pedigrees 033 -009 048 037 063LetterGroup -014 016 064 045 055
PA1 PA2 PA3SS loadings 245 172 137Proportion Var 027 019 015Cumulative Var 027 046 062Proportion Explained 044 031 025Cumulative Proportion 044 075 100
PA1 PA2 PA3PA1 100 002 008PA2 002 100 009PA3 008 009 100
20
overall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 3)
The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 4 Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 4 or by astructure diagram (Figure 5
gt plot(f3t)
MR1
00 02 04 06 08
00
02
04
06
08
00
02
04
06
08
MR2
00 02 04 06 08
00 02 04 06 08
00
02
04
06
08
MR3
Factor Analysis
Figure 4 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package
A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but
21
Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)gt print(f3wcut=0digits=3)Factor Analysis using method = wlsCall fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)Standardized loadings (pattern matrix) based upon correlation matrix
WLS1 WLS2 WLS3 h2 u2 comSentences 0905 -0034 0040 0822 0178 101Vocabulary 0890 0066 -0031 0835 0165 101SentCompletion 0833 0034 0007 0735 0265 100FirstLetters -0002 0855 0003 0731 0269 1004LetterWords -0016 0743 0106 0629 0371 104Suffixes 0180 0626 -0082 0496 0504 120LetterSeries 0033 -0015 0838 0719 0281 100Pedigrees 0381 -0051 0464 0505 0495 195LetterGroup -0062 0209 0632 0527 0473 124
WLS1 WLS2 WLS3SS loadings 2647 1864 1488Proportion Var 0294 0207 0165Cumulative Var 0294 0501 0667Proportion Explained 0441 0311 0248Cumulative Proportion 0441 0752 1000
With factor correlations ofWLS1 WLS2 WLS3
WLS1 1000 0591 0535WLS2 0591 1000 0516WLS3 0535 0516 1000
Mean item complexity = 12Test of the hypothesis that 3 factors are sufficient
The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968The degrees of freedom for the model are 12 and the objective function was 0014
The root mean square of the residuals (RMSR) is 0006The df corrected root mean square of the residuals is 001
The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1The total number of observations was 213 with MLE Chi Square = 2886 with prob lt 0996
Tucker Lewis Index of factoring reliability = 10264RMSEA index = 0 and the 90 confidence intervals are NA NABIC = -6145Fit based upon off diagonal values = 1Measures of factor score adequacy
WLS1 WLS2 WLS3Correlation of scores with factors 0964 0923 0902Multiple R square of scores with factors 0929 0853 0814Minimum correlation of possible factor scores 0858 0706 0627
22
gt fadiagram(f3t)
Factor Analysis
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
MR1
09
09
08
MR2
09
07
06
MR308
06
05
06
05
05
Figure 5 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin
23
that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa
514 Principal Components analysis (PCA)
An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Psychologiststypically use PCA in a manner similar to factor analysis and thus the principal functionproduces output that is perhaps more understandable than that produced by princomp inthe stats package Table 4 shows a PCA of the Thurstone 9 variable problem rotated usingthe Promax function Note how the loadings from the factor model are similar but smallerthan the principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices (Factorloadings and PCA loadings converge as the number of items increases Factor loadings maybe thought of as the asymptotic principal component loadings as the number of items onthe component grows towards infinity) Also note how the goodness of fit statistics basedupon the residual off diagonal elements is much worse than the fa solution
515 Hierarchical and bi-factor solutions
For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)
Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 5 can themselves be factored This results in ahigher order factor model (Figure 6) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 7) Yet
24
Table 4 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 2 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)gt p3pPrincipal Components AnalysisCall principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 comSentences 092 001 001 086 014 10Vocabulary 090 010 -005 086 014 10SentCompletion 091 004 -004 083 017 10FirstLetters 001 084 007 078 022 104LetterWords -005 081 017 075 025 11Suffixes 018 079 -015 070 030 12LetterSeries 003 -003 088 078 022 10Pedigrees 045 -016 057 067 033 21LetterGroup -019 019 086 075 025 12
RC1 RC2 RC3SS loadings 283 219 196Proportion Var 031 024 022Cumulative Var 031 056 078Proportion Explained 041 031 028Cumulative Proportion 041 072 100
With component correlations ofRC1 RC2 RC3
RC1 100 051 053RC2 051 100 044RC3 053 044 100
Mean item complexity = 12Test of the hypothesis that 3 components are sufficient
The root mean square of the residuals (RMSR) is 006with the empirical chi square 5617 with prob lt 11e-07
Fit based upon off diagonal values = 098
25
another solution is to use structural equation modeling to directly solve for the general andgroup factors
gt omh lt- omega(Thurstonenobs=213sl=FALSE)gt op lt- par(mfrow=c(11))
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
09
09
08
04
F2
09
07
06
02F3
08
06
05
g
08
08
07
Figure 6 A higher order factor solution to the Thurstone 9 variable problem
Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011) Unfortunately this rotation will frequently achieve a local rather than globaloptimum and it is recommended to try multiple original configurations
26
gt om lt- omega(Thurstonenobs=213)
Omega
Sentences
Vocabulary
SentCompletion
FirstLetters
4LetterWords
Suffixes
LetterSeries
LetterGroup
Pedigrees
F1
06
06
05
02
F2
06
05
04
F306
05
03
g
07
07
07
06
06
06
06
05
06
Figure 7 A bifactor factor solution to the Thurstone 9 variable problem
27
516 Item Cluster Analysis iclust
An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing
Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)
1 Find the proximity (eg correlation) matrix
2 Identify the most similar pair of items
3 Combine this most similar pair of items to form a new variable (cluster)
4 Find the similarity of this cluster to all other items and clusters
5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )
6 Purify the solution by reassigning items to the most similar cluster center
iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 8)
28
Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 7) The pattern matrix is available as an object in the results
gt data(bfi)gt ic lt- iclust(bfi[125])
ICLUST
C20α = 081β = 063
C19α = 076β = 064
063
C11α = 072β = 069 06
C6077
E4 072E2 minus072
E1 minus077
C12α = 068β = 064
079C4073
O3 063O1 063
C909E5 063
E3 063
C18α = 071β = 05
069
C17α = 072β = 061
054
A4 07C10
α = 072β = 068
078A2 077
C5091A5 071
A3 071
A1 minus061
C16α = 081β = 076
C13α = 071β = 065
086
N5 075
C8086N4 072
N3 072
C3076N2 084
N1 084
C15α = 073β = 067
C14α = 063β = 058
077
C3 07
C1084C2 065
C1 065
C2minus079C5 069
C4 069
C21α = 041β = 027
C7032
O2 057O5 057
O4 minus042
Figure 8 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer
The previous analysis (Figure 8) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 9) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working using
29
Table 5 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the resultsICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])ICLUST
Purified AlphaC20 C16 C15 C21080 081 073 061
Guttman Lambda6C20 C16 C15 C21082 081 072 061
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -0291 040 -033C16 -024 0815 -029 011C15 030 -0221 073 -030C21 -023 0074 -020 061
30
progressBar polychoric and tetrachoric when running on OSX will take advantageof the multiple cores on the machine and run somewhat faster
Table 6 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function
A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution
52 Confidence intervals using bootstrapping techniques
Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 8) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer
53 Comparing factorcomponentcluster solutions
Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that can inturn be compared in terms of the congruence coefficient which is just the cosine of theangle between the dimensions
c fi f j =sum
nk=1 fik f jk
sum f 2ik sum f 2
jk
Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem
gt f4 lt- fa(bfi[125]4fm=pa)gt factorcongruence(f4ic)
C20 C16 C15 C21PA1 092 -032 044 -040PA2 -026 095 -033 012
31
gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 9 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 8) which was done using Pearson correlations
32
gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)gt iclustdiagram(icpoly)
ICLUST using polychoric correlations for nclusters=5
C20α = 083β = 066
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
C15α = 077β = 071
C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
O4
C7O5 063O2 063
Figure 10 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 9) which was donewithout specifying the number of clusters and to the next one (Figure 11) which was doneby changing the beta criterion
33
gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)
ICLUST betasize=3
C23α = 083β = 058
C22α = 081β = 029
06
C21α = 048β = 035031
C704
O5 063O2 063
O4 minus052
C20α = 083β = 066
031
C18α = 076β = 056
063
A1 065C17
α = 077β = 065
minus068A4 073C10
α = 077β = 073
079A2 081
C3092A5 076
A3 076
C19α = 08β = 066
minus072C12
α = 072β = 068
07
C2075
O3 067O1 067
C909E5 066
E3 066
C11α = 077β = 073
minus068E1 08
C5minus091E4 076
E2 minus076
C15α = 077β = 071
minus049C14α = 067β = 06108
C4069
C2 07C1 07
C3 072
C1minus081C5 073
C4 073
C16α = 084β = 079
C13α = 074β = 069
088
N5 078
C8086N4 075
N3 075
C6077N2 087
N1 087
Figure 11 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 8 9 10)
34
Table 7 The output from iclustincludes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])
Purified AlphaC20 C16 C15 C21080 081 073 061
G6 reliabilityC20 C16 C15 C21083 100 067 038
Original BetaC20 C16 C15 C21063 076 067 027
Cluster sizeC20 C16 C15 C2110 5 5 5
Item by Cluster Structure matrixO P C20 C16 C15 C21
A1 C20 C20A2 C20 C20 059A3 C20 C20 065A4 C20 C20 043A5 C20 C20 065C1 C15 C15 054C2 C15 C15 062C3 C15 C15 054C4 C15 C15 031 -066C5 C15 C15 -030 036 -059E1 C20 C20 -050E2 C20 C20 -061 034E3 C20 C20 059 -039E4 C20 C20 066E5 C20 C20 050 040 -032N1 C16 C16 076N2 C16 C16 075N3 C16 C16 074N4 C16 C16 -034 062N5 C16 C16 055O1 C20 C21 -053O2 C21 C21 044O3 C20 C21 039 -062O4 C21 C21 -033O5 C21 C21 053
With eigenvalues ofC20 C16 C15 C2132 26 19 15
Purified scale intercorrelationsreliabilities on diagonalcorrelations corrected for attenuation above diagonal
C20 C16 C15 C21C20 080 -029 040 -033C16 -024 081 -029 011C15 030 -022 073 -030C21 -023 007 -020 061
Cluster fit = 068 Pattern fit = 096 RMSR = 005NULL
35
Table 8 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)Factor Analysis using method = minresCall fa(r = bfi[110] nfactors = 2 niter = 20)Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 comA1 007 -040 015 085 11A2 002 065 044 056 10A3 -003 077 057 043 10A4 015 044 026 074 12A5 002 062 039 061 10C1 057 -005 030 070 10C2 062 -001 039 061 10C3 054 003 030 070 10C4 -066 001 043 057 10C5 -057 -005 035 065 10
MR2 MR1SS loadings 180 177Proportion Var 018 018Cumulative Var 018 036Proportion Explained 050 050Cumulative Proportion 050 100
With factor correlations ofMR2 MR1
MR2 100 032MR1 032 100
Mean item complexity = 1Test of the hypothesis that 2 factors are sufficient
The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489The degrees of freedom for the model are 26 and the objective function was 017
The root mean square of the residuals (RMSR) is 004The df corrected root mean square of the residuals is 005
The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69The total number of observations was 2800 with MLE Chi Square = 46404 with prob lt 92e-82
Tucker Lewis Index of factoring reliability = 0865RMSEA index = 0078 and the 90 confidence intervals are 0071 0084BIC = 25767Fit based upon off diagonal values = 098Measures of factor score adequacy
MR2 MR1Correlation of scores with factors 086 088Multiple R square of scores with factors 074 077Minimum correlation of possible factor scores 049 054
Coefficients and bootstrapped confidence intervalslow MR2 upper low MR1 upper
A1 003 007 011 -046 -040 -036A2 -002 002 005 061 065 070A3 -006 -003 000 074 077 080A4 010 015 018 041 044 048A5 -002 002 007 058 062 066C1 052 057 060 -008 -005 -002C2 059 062 066 -004 -001 002C3 050 054 057 -001 003 008C4 -070 -066 -062 -002 001 003C5 -062 -057 -053 -008 -005 -001
Interfactor correlations and bootstrapped confidence intervalslower estimate upper
MR2-MR1 028 032 037gt
36
PA3 035 -024 088 -037PA4 029 -012 027 -090
A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 9)
Table 9 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))
MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099g 072 060 052 067 057 054 100 072 060 052 099 069 058 050F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100
54 Determining the number of dimensions to extract
How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)
Techniques most commonly used include
1) Extracting factors until the chi square of the residual matrix is not significant
2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant
3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)
4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up the
37
talus slope of a mountain and approaching the rock face (Cattell 1966)
5) Extracting factors as long as they are interpretable
6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)
7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)
8) Extracting principal components until the eigen value lt 1
Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples(N gt 1000) all of the eigen values of random factors will be very close to 1 The screetest is quite appealing but can lead to differences of interpretation as to when the screeldquobreaksrdquo Extracting interpretable factors means that the number of factors reflects theinvestigators creativity more than the data vss while very simple to understand will notwork very well if the data are very factorially complex (Simulations suggests it will workfine if the complexities of some of the items are no more than 2) The eigen value of 1 rulealthough the default for many programs seems to be a rough way of dividing the numberof variables by 3 and is probably the worst of all criteria
An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function
541 Very Simple Structure
The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)
Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 12) However the MAP criterion suggests that 5 is optimal
38
gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)
11
1 11 1 1 1
1 2 3 4 5 6 7 8
00
02
04
06
08
10
Number of Factors
Ver
y S
impl
e S
truc
ture
Fit
Very Simple Structure of a Big 5 inventory
2
22 2 2
2 233
3 3 3 344 4 4 4
Figure 12 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5
39
gt vss
Very Simple Structure of Very Simple Structure of a Big 5 inventoryCall vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)VSS complexity 1 achieves a maximimum of 058 with 4 factorsVSS complexity 2 achieves a maximimum of 074 with 4 factors
The Velicer MAP achieves a minimum of 001 with 5 factorsBIC achieves a minimum of -52426 with 8 factorsSample Size adjusted BIC achieves a minimum of -11756 with 8 factors
Statistics by number of factorsvss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq SRMR eCRMS eBIC
1 049 000 0024 275 11831 00e+00 260 049 0123 9648 105221 10 23881 0119 0125 216982 054 063 0018 251 7279 00e+00 189 063 0100 5287 60845 12 12432 0086 0094 104403 057 069 0017 228 5010 00e+00 148 071 0087 3200 39243 13 7232 0066 0075 54224 058 074 0015 206 3366 00e+00 117 077 0074 1731 23851 14 3750 0047 0057 21155 053 073 0015 185 1750 14e-252 95 081 0055 281 8693 16 1495 0030 0038 276 054 072 0016 165 1014 44e-122 84 084 0043 -296 2285 17 670 0020 0027 -6397 052 070 0019 146 696 14e-72 79 084 0037 -463 12 19 448 0016 0023 -7118 052 069 0022 128 492 47e-44 74 085 0032 -524 -1176 19 289 0013 0020 -727
542 Parallel Analysis
An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 13) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors
A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly or faparallel with the coroption=rdquopolyrdquo By default the number of replications is 20 This is appropriate whenchoosing the number of factors from dicthotomous or polytomous data matrices
55 Factor extension
Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables created
40
gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)Parallel analysis suggests that the number of factors = 6 and the number of components = 6
5 10 15 20 25
01
23
45
Parallel Analysis of a Big 5 inventory
FactorComponent Number
eige
nval
ues
of p
rinci
pal c
ompo
nent
s an
d fa
ctor
ana
lysi
s
PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data
Figure 13 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum
41
to represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 14)
Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)
6 Classical Test Theory and Reliability
Surprisingly 112 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)
α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby
λ3 =n
nminus1
(1minus tr(V )x
Vx
)=
nnminus1
Vxminus tr(V x)
Vx= α
Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency re-liability For τ equivalent items alpha is the mean of all possible spit half reliabilities(corrected for test length) For a unifactorial test it is a reasonable estimate of the firstfactor saturation although if the test has any microstructure (ie if it is ldquolumpyrdquo) coeffi-cients β (Revelle 1979) (see iclust) and ωh (see omega) are more appropriate estimates ofthe general factor saturation ωt is a better estimate of the reliability of the total test
Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2
j and is
λ6 = 1minussume2
j
Vx= 1minus sum(1minus r2
smc)
Vx
The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate
G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliability
42
gt v16 lt- simitem(16)gt s lt- c(13579111315)gt f2 lt- fa(v16[s]2)gt fe lt- faextension(cor(v16)[s-s]f2)gt fadiagram(f2fe=fe)
Factor analysis and extension
V3
V11
V9
V1
V5
V13
V7
V15
MR1
06
minus06
minus06
06
MR2
06
minus06
06
minus05
V12
minus07
V206
V406
V10minus06
V14minus06
V605
V805
V16
minus05
Figure 14 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure
43
the smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item
Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances
More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis
Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha
Functions for examining the reliability of a single scale or a set of scales include
alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha
function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations
guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)
omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)
clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters
scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items
scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and finds
44
total or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported
61 Reliability of a single scale
A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time
gt setseed(17)gt r9 lt- simhierarchical(n=500raw=TRUE)$observedgt round(cor(r9)2)
V1 V2 V3 V4 V5 V6 V7 V8 V9V1 100 058 059 041 044 030 040 031 025V2 058 100 048 035 036 022 024 029 019V3 059 048 100 032 035 023 029 020 017V4 041 035 032 100 044 036 026 027 022V5 044 036 035 044 100 032 024 023 020V6 030 022 023 036 032 100 026 026 012V7 040 024 029 026 024 026 100 038 025V8 031 029 020 027 023 026 038 100 025V9 025 019 017 022 020 012 025 025 100
gt alpha(r9)
Reliability analysisCall alpha(x = r9)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd08 08 08 031 4 0013 0017 063
lower alpha upper 95 confidence boundaries077 08 083
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 075 075 075 028 31 0017V2 077 077 077 030 34 0015V3 077 077 077 030 34 0015V4 077 077 077 030 34 0015V5 078 078 077 030 35 0015V6 079 079 079 032 38 0014V7 078 078 078 031 36 0015V8 079 079 078 032 37 0014V9 080 080 080 034 40 0013
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 077 076 076 067 00327 102V2 500 067 066 062 055 01071 100
45
V3 500 065 065 061 053 -00462 101V4 500 065 065 059 053 -00091 098V5 500 064 064 058 052 -00012 103V6 500 055 055 046 041 -00499 099V7 500 060 060 052 046 00481 101V8 500 058 057 049 043 00526 107V9 500 047 048 036 032 00164 097
Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to bereversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that item 2 is to be scored (incorrectly) negatively
gt keys lt- c(1-111111)gt alpha(attitudekeys)
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd043 052 075 014 11 013 58 55
lower alpha upper 95 confidence boundaries018 043 068
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 032 044 067 0114 077 0170complaints- 080 080 082 0394 389 0057privileges 027 041 072 0103 069 0176learning 014 031 064 0069 044 0207raises 014 027 061 0059 038 0205critical 036 047 076 0130 090 0139advance 021 034 066 0079 051 0171
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 060 060 060 033 65 122complaints- 30 -057 -058 -074 -074 50 133privileges 30 066 065 054 042 53 122learning 30 080 079 078 064 56 117raises 30 082 083 085 069 65 104critical 30 050 053 035 027 75 99advance 30 074 075 071 058 43 103
Note how the reliability of the 7 item scales with an incorrectly reversed item is very poorbut if the item 2 is dropped then the reliability is improved substantially This suggeststhat item 2 was incorrectly scored Doing the analysis again with item 2 positively scoredproduces much more favorable results
gt keys lt- c(1111111)gt alpha(attitudekeys)
46
Reliability analysisCall alpha(x = attitude keys = keys)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd084 084 088 043 52 0042 60 82
lower alpha upper 95 confidence boundaries076 084 093
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
rating 081 081 083 041 42 0052complaints 080 080 082 039 39 0057privileges 083 082 087 044 47 0048learning 080 080 084 040 40 0054raises 080 078 083 038 36 0056critical 086 086 089 051 63 0038advance 084 083 086 046 50 0043
Item statisticsn rawr stdr rcor rdrop mean sd
rating 30 078 076 075 067 65 122complaints 30 084 081 082 074 67 133privileges 30 070 068 060 056 53 122learning 30 081 080 078 071 56 117raises 30 085 086 085 079 65 104critical 30 042 045 031 027 75 99advance 30 060 062 056 046 43 103
It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha
gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete itemsgt alpha(items$observed) item response analysis of congeneric measures
Reliability analysisCall alpha(x = items$observed)
raw_alpha stdalpha G6(smc) average_r SN ase mean sd072 072 067 039 26 0021 0056 073
lower alpha upper 95 confidence boundaries068 072 076
Reliability if an item is droppedraw_alpha stdalpha G6(smc) average_r SN alpha se
V1 059 059 049 032 14 0032V2 065 065 056 038 19 0027V3 067 067 059 041 20 0026V4 072 072 064 046 25 0022
Item statisticsn rawr stdr rcor rdrop mean sd
V1 500 081 081 074 062 0058 097V2 500 074 075 063 052 0012 098V3 500 072 072 057 048 0056 099V4 500 068 067 048 041 0098 102
47
Non missing response frequency for each item-2 -1 0 1 2 miss
V1 004 024 040 025 007 0V2 006 023 040 024 006 0V3 005 025 037 026 007 0V4 006 021 037 029 007 0
62 Using omega to find the reliability of a single scale
Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 15) or omegaSem for a confirmatory analysis usingthe sem package (Fox et al 2013) based upon the exploratory solution from omega
McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project
orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg
revellepublicationsrevellezinbarg08pdf )
One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh
ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal) 84 For ability items it is typically the case that allitems will have positive loadings on the general factor However for non-cognitive itemsit is frequently the case that some items are to be scored positively and some negativelyAlthough probably better to specify which directions the items are to be scored by spec-ifying a key vector if flip =TRUE (the default) items will be reversed so that they havepositive loadings on the general factor The keys are reported so that scores can be foundusing the scoreItems function Arbitrarily reversing items this way can overestimate thegeneral factor (See the example with a simulated circumplex)
β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)
The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of the
48
method chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true generalfactor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)
Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings
There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the
radicrab where rab is the
oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call
Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints
In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test
McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2
j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor g that due toa set of group factors f (factors common to some but not all of the items) specificfactors s unique to each item and e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)
Letting x = cg + A f + Ds + e then the communality of item j based upon general as well asgroup factors h2
j = c2j + sum f 2
i j and the unique variance for the item u2j = σ2
j (1minush2j) may be
used to estimate the test reliability That is if h2j is the communality of item j based upon
general as well as group factors then for standardized items e2j = 1minush2
j and
ωt =1ccprime1 + 1AAprime1prime
Vx= 1minus
sum(1minush2j)
Vx= 1minus sumu2
Vx
49
Because h2j ge r2
smc ωt ge λ6
It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadingson the general factor
ωh =1ccprime1
Vx
Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by
ωinf =1ccprime1
1ccprime1 + 1AAprime1prime
In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6
gt om9
9 simulated variablesCall omega(m = r9 title = 9 simulated variables)Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003
50
gt om9 lt- omega(r9title=9 simulated variables)
9 simulated variables
V1
V3
V2
V7
V8
V9
V4
V5
V6
F1
05
04
04
F2
05
04
02
F304
03
03
g
07
06
06
05
04
03
06
05
04
Figure 15 A bifactor solution for 9 simulated variables with a hierarchical structure
51
RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
63 Estimating ωh using Confirmatory Factor Analysis
The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor
gt omegaSem(r9nobs=500)
Call omegaSem(m = r9 nobs = 500)OmegaCall omega(m = m nfactors = nfactors fm = fm key = key flip = flip
digits = digits title = title sl = sl labels = labelsplot = plot nobs = nobs rotate = rotate Phi = Phi option = option)
Alpha 08G6 08Omega Hierarchical 069Omega H asymptotic 082Omega Total 083
Schmid Leiman Factor loadings greater than 02g F1 F2 F3 h2 u2 p2
V1 072 045 072 028 071V2 058 037 047 053 071V3 057 041 049 051 066V4 057 041 050 050 066V5 055 032 041 059 073V6 043 027 028 072 066V7 047 047 044 056 050V8 043 042 036 064 051V9 032 023 016 084 063
52
With eigenvalues ofg F1 F2 F3
248 052 047 036
generalmax 476 maxmin = 146mean percent general = 064 with sd = 008 and cv of 013Explained Common Variance of the general factor = 065
The degrees of freedom are 12 and the fit is 003The number of observations was 500 with Chi Square = 1586 with prob lt 02The root mean square of the residuals is 002The df corrected root mean square of the residuals is 003RMSEA index = 0026 and the 90 confidence intervals are NA 0055BIC = -5871
Compare this with the adequacy of just a general factor and no group factorsThe degrees of freedom for just the general factor are 27 and the fit is 032The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21The root mean square of the residuals is 008The df corrected root mean square of the residuals is 009
RMSEA index = 01 and the 90 confidence intervals are 0085 0114BIC = -797
Measures of factor score adequacyg F1 F2 F3
Correlation of scores with factors 084 059 061 054Multiple R square of scores with factors 071 035 037 029Minimum correlation of factor score estimates 043 -030 -025 -041
Total General and Subset omega for each subsetg F1 F2 F3
Omega total for total scores and subscales 083 079 056 065Omega general for total scores and subscales 069 055 030 046Omega group for total scores and subscales 012 024 026 019
Omega Hierarchical from a confirmatory model using sem = 072Omega Total from a confirmatory model using sem = 083With loadings of
g F1 F2 F3 h2 u2V1 074 040 071 029V2 058 036 047 053V3 056 042 050 050V4 057 045 053 047V5 058 025 040 060V6 043 026 026 074V7 049 038 038 062V8 044 045 039 061V9 034 024 017 083
With eigenvalues ofg F1 F2 F3
260 047 040 033
53
631 Other estimates of reliability
Other estimates of reliability are found by the splitHalf function These are described inmore detail in Revelle and Zinbarg (2009) They include the 6 estimates from Guttmanfour from TenBerge and an estimate of the greatest lower bound
gt splitHalf(r9)
Split half reliabilitiesCall splitHalf(r = r9)
Maximum split half reliability (lambda 4) = 084Guttman lambda 6 = 08Average split half reliability = 079Guttman lambda 3 (alpha) = 08Minimum split half reliability (beta) = 072
64 Reliability and correlations of multiple scales within an inventory
A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor
and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6
641 Scoring from raw data
To score these five scales from the 25 items use the scoreItems function with the helperfunction makekeys Logically scales are merely the weighted composites of a set of itemsThe weights used are -1 0 and 1 0 implies do not use that item in the scale 1 implies apositive weight (add the item to the total score) -1 a negative weight (subtract the itemfrom the total score ie reverse score the item) Reverse scoring an item is equivalent tosubtracting the item from the maximum + minimum possible value for that item Theminima and maxima can be estimated from all the items or can be specified by theuser
There are two different ways that scale scores tend to be reported Social psychologistsand educational psychologists tend to report the scale score as the average item score whilemany personality psychologists tend to report the total item score The default option forscoreItems is to report item averages (which thus allows interpretation in the same metricas the items) but totals can be found as well Personality researchers should be encouraged
54
to report scores based upon item means and avoid using the total score although somereviewers are adamant about the following the tradition of total scores
The printed output includes coefficients α and G6 the average correlation of the itemswithin the scale (corrected for item overlap and scale relliability) as well as the correlationsbetween the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information isreturned as part of the object
First create keys matrix using the makekeys function (The keys matrix could also beprepared externally using a spreadsheet and then copying it into R) Although not normallynecessary show the keys to understand what is happening
Note that the number of items to specify in the makekeys function is the total number ofitems in the inventory That is if scoring just 5 items from a 25 item inventory makekeysshould be told that there are 25 items makekeys just changes a list of items on eachscale to make up a scoring matrix Because the bfi data set has 25 items as well as 3demographic items the number of variables is specified as 28
gt keys lt- makekeys(nvars=28list(Agree=c(-125)Conscientious=c(68-9-10)+ Extraversion=c(-11-121315)Neuroticism=c(1620)+ Openness = c(21-222324-25))+ itemlabels=colnames(bfi))gt keys
Agree Conscientious Extraversion Neuroticism OpennessA1 -1 0 0 0 0A2 1 0 0 0 0A3 1 0 0 0 0A4 1 0 0 0 0A5 1 0 0 0 0C1 0 1 0 0 0C2 0 1 0 0 0C3 0 1 0 0 0C4 0 -1 0 0 0C5 0 -1 0 0 0E1 0 0 -1 0 0E2 0 0 -1 0 0E3 0 0 1 0 0E4 0 0 1 0 0E5 0 0 1 0 0N1 0 0 0 1 0N2 0 0 0 1 0N3 0 0 0 1 0N4 0 0 0 1 0N5 0 0 0 1 0O1 0 0 0 0 1O2 0 0 0 0 -1O3 0 0 0 0 1O4 0 0 0 0 1O5 0 0 0 0 -1gender 0 0 0 0 0education 0 0 0 0 0
55
age 0 0 0 0 0
The use of multiple key matrices for different inventories is facilitated by using the su-
perMatrix function to combine two or more matrices This allows convenient scoring oflarge data sets combining multiple inventories with keys based upon each individual inven-tory Pretend for the moment that the big 5 items were made up of two inventories oneconsisting of the first 10 items the second the last 18 items (15 personality items + 3demographic items) Then the following code would work
gt keys1lt- makekeys(10list(Agree=c(-125)Conscientious=c(68-9-10)))gt keys2 lt- makekeys(15list(Extraversion=c(-1-235)Neuroticism=c(610)+ Openness = c(11-121314-15)))gt keys25 lt- superMatrix(list(keys1keys2))
The resulting keys matrix is identical to that found above except that it does not includethe extra 3 demographic items This is useful when scoring the raw items because theresponse frequencies for each category are reported and for the demographic data
This use of making multiple key matrices and then combining them into one super matrixof keys is particularly useful when combining demographic information with items to bescores A set of demographic keys can be made and then these can be combined with thekeys for the particular scales
Now use these keys in combination with the raw data to score the items calculate basicreliability and intercorrelations and find the item-by scale correlations for each item andeach scale By default missing data are replaced by the median for that variable
gt scores lt- scoreItems(keysbfi)gt scores
Call scoreItems(keys = keys items = bfi)
(Unstandardized) AlphaAgree Conscientious Extraversion Neuroticism Openness
alpha 07 072 076 081 06
Standard errors of unstandardized AlphaAgree Conscientious Extraversion Neuroticism Openness
ASE 0014 0014 0013 0011 0017
Average item correlationAgree Conscientious Extraversion Neuroticism Openness
averager 032 034 039 046 023
Guttman 6 reliabilityAgree Conscientious Extraversion Neuroticism Openness
Lambda6 07 072 076 081 06
SignalNoise based upon avr Agree Conscientious Extraversion Neuroticism Openness
SignalNoise 23 26 32 43 15
Scale intercorrelations corrected for attenuation
56
raw correlations below the diagonal alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 070 036 063 -0245 023Conscientious 026 072 035 -0305 030Extraversion 046 026 076 -0284 032Neuroticism -018 -023 -022 0812 -012Openness 015 019 022 -0086 060
In order to see the item by scale loadings and frequency counts of the dataprint with the short option = FALSE
To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 16
642 Forming scales from a correlation matrix
There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case
Consider the same bfi data set but first find the correlations and then use clus-
tercor
gt rbfi lt- cor(bfiuse=pairwise)gt scales lt- clustercor(keysrbfi)gt summary(scales)
Call clustercor(keys = keys rmat = rbfi)
Scale intercorrelations corrected for attenuationraw correlations below the diagonal (standardized) alpha on the diagonalcorrected correlations above the diagonal
Agree Conscientious Extraversion Neuroticism OpennessAgree 071 035 064 -0242 025Conscientious 025 073 036 -0287 030Extraversion 047 027 076 -0278 035Neuroticism -018 -022 -022 0815 -011Openness 016 020 024 -0074 061
To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use theclusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems
57
gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()quartz
2
Figure 16 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf
58
65 Scoring Multiple Choice Items
Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scoremultiplechoice(iqkeysiqitems)
Call scoremultiplechoice(key = iqkeys data = iqitems)
(Unstandardized) Alpha[1] 084
Average item correlation[1] 025
item statisticskey 0 1 2 3 4 5 6 7 8 miss r n mean sd
reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064 048reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070 046reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070 046reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062 049letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060 049letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057 050letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061 049letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044 050matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053 050matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055 050matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061 049matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037 048rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019 040rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021 041rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030 046rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019 039
gt just convert the items to true or falsegt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)gt describe(iqtf) compare to previous results
vars n mean sd median trimmed mad min max range skew kurtosis sereason4 1 1523 064 048 1 068 0 0 1 1 -058 -166 001reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126 001reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126 001reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178 001letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184 001letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192 001letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179 001
59
letter58 8 1525 044 050 0 043 0 0 1 1 023 -195 001matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199 001matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196 001matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178 001matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173 001rotate3 13 1523 019 040 0 012 0 0 1 1 155 040 001rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003 001rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124 001rotate8 16 1524 019 039 0 011 0 0 1 1 162 063 001
Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function
66 Item analysis
Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches
Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 17)
7 Item Response Theory analysis
The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 18)
60
gt data(iqitems)gt iqkeys lt- c(444 66344 5224 3267)gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)gt note that for speed we can just do this on simple item counts rather than IRT based scoresgt op lt- par(mfrow=c(22)) set this to see the output for multiple itemsgt irtresponses(scores$scoresiqitems[14]breaks=11)
00 02 04 06 08 10
00
04
08
reason4
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 100
00
40
8
reason16
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason17
theta
P(r
espo
nse)
01
23
45
6
00 02 04 06 08 10
00
04
08
reason19
theta
P(r
espo
nse)
01
23
45
6
Figure 17 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability
61
71 Factor analysis and Item Response Theory
If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function
δ j =Dτradic1minusλ 2
j
α j =λ jradic
1minusλ 2j
(2)
where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just
λ j =α jradic
1 + α2j
τ j =δ jradic
1 + α2j
Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty
gt setseed(17)gt d9 lt- simirt(91000-2525mod=normal) dichotomous itemsgt test lt- irtfa(d9$items)
gt test
Item Response Analysis using Factor Analysis
Call irtfa(x = d9$items)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3V1 048 059 024 006 001 000 000V2 030 068 045 012 002 000 000V3 012 050 074 029 006 001 000V4 005 026 074 055 014 003 000V5 001 007 044 105 041 006 001V6 000 003 015 060 074 024 004V7 000 001 004 022 073 063 016V8 000 000 002 012 045 069 031V9 000 001 002 008 025 047 036Test Info 098 214 285 309 281 214 089SEM 101 068 059 057 060 068 106Reliability -002 053 065 068 064 053 -012
62
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 27 and the objective function was 12The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235
The root mean square of the residuals (RMSA) is 009The df corrected root mean square of the residuals is 01
Tucker Lewis Index of factoring reliability = 0699RMSEA index = 0209 and the 90 confidence intervals are 0198 0218BIC = 100907
Similar analyses can be done for polytomous items such as those of the bfi extraversionscale
gt data(bfi)gt eirt lt- irtfa(bfi[1115])gt eirt
Item Response Analysis using Factor Analysis
Call irtfa(x = bfi[1115])Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E1 037 058 075 076 061 039 021E2 048 089 118 124 099 059 025E3 034 049 058 057 049 036 023E4 063 098 111 096 067 037 016E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 5 and the objective function was 005The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27
The root mean square of the residuals (RMSA) is 004The df corrected root mean square of the residuals is 006
Tucker Lewis Index of factoring reliability = 0932RMSEA index = 0097 and the 90 confidence intervals are 0083 0111BIC = 9623
The item information functions show that not all of items are equally good (Figure 19)
These procedures can be generalized to more than one factor by specifying the number of
63
gt op lt- par(mfrow=c(31))gt plot(testtype=ICC)gt plot(testtype=IIC)gt plot(testtype=test)gt op lt- par(mfrow=c(11))
minus3 minus2 minus1 0 1 2 3
00
04
08
Item parameters from factor analysis
Latent Trait (normal scale)
Pro
babi
lity
of R
espo
nse
V1 V2 V3 V4 V5 V6 V7 V8 V9
minus3 minus2 minus1 0 1 2 3
00
04
08
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
V1 V2 V3 V4
V5V6 V7
V8V9
minus3 minus2 minus1 0 1 2 3
00
15
30
Test information minusminus item parameters from factor analysis
Latent Trait (normal scale)
Test
Info
rmat
ion
minus0
290
57R
elia
bilit
y
Figure 18 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait
64
gt einfo lt- plot(eirttype=IIC)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
12
Item information from factor analysis for factor 1
Latent Trait (normal scale)
Item
Info
rmat
ion minusE1
minusE2
E3
E4
E5
Figure 19 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item
65
factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut
gt print(einfosort=TRUE)
Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3E2 048 089 118 124 099 059 025E4 063 098 111 096 067 037 016E1 037 058 075 076 061 039 021E3 034 049 058 057 049 036 023E5 033 042 047 045 037 027 017Test Info 215 336 409 398 313 197 102SEM 068 055 049 050 056 071 099Reliability 053 070 076 075 068 049 002
More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis
72 Speeding up analyses
Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis
For example taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 21) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items
gt iqirt
Item Response Analysis using Factor Analysis
Call irtfa(x = iqtf)Item Response Analysis using Factor Analysis
Summary information by factor and itemFactor = 1
-3 -2 -1 0 1 2 3reason4 004 020 059 059 019 004 001reason16 007 023 046 039 016 005 001reason17 006 027 069 050 014 003 001reason19 005 017 041 046 022 007 002
66
gt iqirt lt- irtfa(iqtf)
minus3 minus2 minus1 0 1 2 3
00
02
04
06
08
10
Item information from factor analysis
Latent Trait (normal scale)
Item
Info
rmat
ion
reason4
reason16
reason17
reason19letter7
letter33
letter34letter58
matrix45matrix46
matrix47
matrix55
rotate3
rotate4
rotate6
rotate8
Figure 20 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension
67
letter7 004 016 044 052 023 006 002letter33 004 014 034 043 024 008 002letter34 004 017 048 054 022 006 001letter58 002 008 028 055 039 013 003matrix45 005 011 023 029 021 010 004matrix46 005 012 025 030 021 010 004matrix47 005 016 037 041 021 007 002matrix55 004 008 014 020 019 013 006rotate3 000 001 007 029 066 045 013rotate4 000 001 005 030 091 055 011rotate6 001 003 014 049 068 027 006rotate8 000 002 008 027 054 039 013Test Info 055 195 499 653 541 257 071SEM 134 072 045 039 043 062 118Reliability -080 049 080 085 082 061 -040
Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotatefm = fm)
Test of the hypothesis that 1 factor is sufficientThe degrees of freedom for the model is 104 and the objective function was 185The number of observations was 1525 with Chi Square = 280105 with prob lt 0
The root mean square of the residuals (RMSA) is 008The df corrected root mean square of the residuals is 009
Tucker Lewis Index of factoring reliability = 0742RMSEA index = 0131 and the 90 confidence intervals are 0126 0135BIC = 203876
73 IRT based scoring
The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreirt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding theaverage of the three (classical test theory) fails to consider that the items might differ ineither discrimination or in difficulty The scoreirt function applies basic IRT to thisproblem
Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects
gt v9 lt- simirt(91000-22mod=normal) dichotomous itemsgt items lt- v9$itemsgt test lt- irtfa(items)
68
gt om lt- omega(iqirt$rho4)
Omega
rotate3
rotate4
rotate8
rotate6
letter34
letter7
letter33
letter58
matrix47
reason17
reason4
reason19
reason16
matrix45
matrix46
matrix55
F1
07060605
F2
05
04
04
03
F306020202
F408
03
g
06
06
05
06
07
06
06
06
06
07
06
06
06
05
05
04
Figure 21 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions
69
gt total lt- rowSums(items)gt ord lt- order(total)gt items lt- items[ord]gt now delete some of the data - note that they are ordered by scoregt items[133359] lt- NAgt items[33466637] lt- NAgt items[667100014] lt- NAgt scores lt- scoreirt(testitems)gt unitweighted lt- scoreirt(items=itemskeys=rep(19))gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)
These results are seen in Figure 22
8 Multilevel modeling
Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups
81 Decomposing data into within and between level correlations usingstatsBy
There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models
This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package
rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)
70
Figure 22 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)
True theta
minus3 0 2
078 040
00 04 08
000 046
00 04 08
040
minus3
02
minus003
minus3
02
irt theta
072 001 079 072 minus003
total
002 099 100
00
04
08
minus003
00
04
08
fit
minus005 002 090
rasch
099minus
20
2
minus008
00
04
08
total
minus003
minus3 0 2
00 04 08
minus2 0 2
00 06
00
06
fit
Comparing true theta for IRT Rasch and classically based scoring
71
where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means
82 Generating and displaying multilevel data
withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data
simmultilevel will generate simulated data with a multilevel structure
The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary
function specifying the variable of interest
Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )
9 Set Correlation and Multiple Regression from the corre-lation matrix
An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setcor
function Set correlation is
R2 = 1minusn
prodi=1
(1minusλi)
72
where λi is the ith eigen value of the eigen value decomposition of the matrix
R = Rminus1xx RxyRminus1
xx Rminus1xy
Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate
setcor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data
Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setcor Two things tonotice setcor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized β weights Secondly it is possible todo several multiple regressions simultaneously If the number of observations is specifiedor if the analysis is done on raw data statistical tests of significance are applied
For this example the analysis is done on the correlation matrix rather than the rawdata
gt C lt- cov(satactuse=pairwise)gt model1 lt- lm(ACT~ gender + education + age data=satact)gt summary(model1)
Calllm(formula = ACT ~ gender + education + age data = satact)
ResidualsMin 1Q Median 3Q Max
-252458 -32133 07769 35921 92630
CoefficientsEstimate Std Error t value Pr(gt|t|)
(Intercept) 2741706 082140 33378 lt 2e-16 gender -048606 037984 -1280 020110education 047890 015235 3143 000174 age 001623 002278 0712 047650---Signif codes 0 acircAŸacircAZ 0001 acircAŸacircAZ 001 acircAŸacircAZ 005 acircAŸacircAZ 01 acircAŸ acircAZ 1
Residual standard error 4768 on 696 degrees of freedomMultiple R-squared 00272 Adjusted R-squared 002301F-statistic 6487 on 3 and 696 DF p-value 00002476
Compare this with the output from setcor
gt compare with matregressgt setcor(c(46)c(13)C nobs=700)
73
Call setCor(y = y x = x data = data z = z nobs = nobs use = usestd = std square = square main = main plot = plot)
Multiple Regression from matrix input
Beta weightsACT SATV SATQ
gender -005 -003 -018education 014 010 010age 003 -010 -009
Multiple RACT SATV SATQ016 010 019multiple R2
ACT SATV SATQ00272 00096 00359
Unweighted multiple RACT SATV SATQ015 005 011Unweighted multiple R2ACT SATV SATQ002 000 001
SE of Beta weightsACT SATV SATQ
gender 018 429 434education 022 513 518age 022 511 516
t of Beta WeightsACT SATV SATQ
gender -027 -001 -004education 065 002 002age 015 -002 -002
Probability of t ltACT SATV SATQ
gender 079 099 097education 051 098 098age 088 098 099
Shrunken R2ACT SATV SATQ
00230 00054 00317
Standard Error of R2ACT SATV SATQ
00120 00073 00137
FACT SATV SATQ649 226 863
Probability of F ltACT SATV SATQ
248e-04 808e-02 124e-05
74
degrees of freedom of regression[1] 3 696
Various estimates of between set correlationsSquared Canonical Correlations[1] 0050 0033 0008Chisq of canonical correlations[1] 358 231 56
Average squared canonical correlation = 003Cohens Set Correlation R2 = 009Shrunken Set Correlation R2 = 008F and df of Cohens Set Correlation 726 9 168186Unweighted correlation between the two sets = 001
Note that the setcor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship
10 Simulation functions
It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures
A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help
sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible
simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors
75
simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models
simhierarchical To create data with a hierarchical (bifactor) structure
simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure
simcirc To create data with a circumplex structure
simitem To create items that either have a simple structure or a circumplex structure
simdichot Create dichotomous item data with a simple or circumplex structure
simrasch Simulate a 1 parameter logistic (Rasch) model
simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models
simmultilevel Simulate data with different within group and between group correla-tional structures
Some of these functions are described in more detail in the companion vignette psych forsem
The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors
Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors
Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals
Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega
function
76
The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel
The logistic model is
P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j
1 + eα j(δ jminusθi (4)
where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify
(Graphics of these may be seen in the demonstrations for the logistic function)
The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the sameWith the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical
11 Graphical Displays
Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)
f3 lt- fa(Thurstone3)plot(f3)fadiagram(f3)c lt- iclust(Thurstone)plot(c) a pretty boring ploticlustdiagram(c) a better diagramc3 lt- iclust(Thurstone3)plot(c3) a more interesting plotdata(bfi)eirt lt- irtfa(bfi[1115])plot(eirt)ot lt- omega(Thurstone)plot(ot)omegadiagram(ot)
The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions
77
gt xlim=c(010)gt ylim=c(010)gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)gt diaarrow(from=lrto=ullabels=right to left)gt diaarrow(from=ulto=urlabels=left to right)gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)gt diacurve(ll$topul$bottomdouble) for rectangles specify where to pointgt diacurvedarrow(mrurup) but for ellipses just point to itgt diacurve(mlmracross)gt diaarrow(urlrtop down)gt diacurvedarrow(br$toplr$bottomup)gt diacurvedarrow(blbrleft to right)gt diaarrow(blll$bottom)gt diacurvedarrow(mlll$right)gt diacurvedarrow(mrlr$top)
Demontration of dia functions
upper left
lower left lower right
upper right
middle left middle right
bottom left bottom right
right to left
left to right
right to left
left to right
double
up
across
top down
upleft to right
Figure 23 The basic graphic capabilities of the dia functions are shown in this figure78
12 Miscellaneous functions
A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions
blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design
df2latex is useful for taking tabular output (such as a correlation matrix or that of de-
scribe and converting it to a LATEX table May be used when Sweave is not conve-nient
cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex
cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-
dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data
fisherz Convert a correlation to the corresponding Fisher z score
geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data
ICC and cohenkappa are typically used to find the reliability for raters
headtail combines the head and tail functions to show the first and last lines of a dataset or output
topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between
mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe
prep finds the probability of replication for an F t or r and estimate effect size
partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)
rangeCorrection will correct correlations for restriction of range
reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages
79
superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems
13 Data sets
A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton
Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests
bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses
satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis
epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays
80
iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics
galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas
Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example
miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1994) this data set allows for examples of basic scaling techniques
14 Development version and a users guide
The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc
contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version
Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf
News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news
gt news(Version gt 128package=psych)
81
15 Psychometric Theory
The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book
For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R
16 SessionInfo
This document was prepared using the following settings
gt sessionInfo()
R version 330 (2016-05-03)Platform x86_64-apple-darwin1340 (64-bit)Running under OS X 10115 (El Capitan)
locale[1] en_USUTF-8en_USUTF-8en_USUTF-8Cen_USUTF-8en_USUTF-8
attached base packages[1] stats graphics grDevices utils datasets methods base
other attached packages[1] psych_165
loaded via a namespace (and not attached)[1] Rcpp_0124 lattice_020-33 matrixcalc_10-3 MASS_73-45 grid_330 arm_18-6[7] nlme_31-127 stats4_330 coda_018-1 minqa_124 mi_10 nloptr_104
[13] Matrix_12-6 boot_13-18 splines_330 lme4_11-12 sem_31-7 tools_330[19] abind_14-3 GPArotation_201411-1 parallel_330 mnormt_15-4
82
References
Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432
Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458
Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY
Bliese P D (2009) Multilevel Modeling in R (23) A Brief Introduction to R the multilevelpackage and the nlme package
Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276
Cattell R B (1978) The scientific use of factor analysis Plenum Press New York
Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)
Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition
Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98
Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334
Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178
Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland
Fox J Nie Z and Byrnes J (2013) sem Structural Equation Models R package version31-3
Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450
Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition
83
Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282
Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA
Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132
Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54
Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185
Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to Bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300
Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika76(4)537ndash549
Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258
Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317
MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ
Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686
McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ
Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495
Nunnally J C (1967) Psychometric theory McGraw-Hill New York
Nunnally J C and Bernstein I H (1994) Psychometric theory McGraw-Hill New Yorkrdquo
3rd edition
84
Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers
R Core Team (2017) R A Language and Environment for Statistical Computing RFoundation for Statistical Computing Vienna Austria
Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74
Revelle W (2017) psych Procedures for Personality and Psychological Research North-western University Evanston httpcranr-projectorgwebpackagespsych R pack-age version 173
Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer
Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell
Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414
Revelle W Wilt J and Rosenthal A (2010) Individual differences in cognition Newmethods for examining the personality-cognition link In Gruszka A Matthews Gand Szymura B editors Handbook of Individual Differences in Cognition AttentionMemory and Executive Control chapter 2 pages 27ndash49 Springer New York NY
Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154
Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90
Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428
Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco
Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco
Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101
Thorburn W M (1918) The myth of Occamrsquos razor Mind 27345ndash353
85
Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill
Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454
Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan
Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327
Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133
Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144
86
- Overview of this and related documents
-
- Jump starting the psych packagendasha guide for the impatient
-
- Overview of this and related documents
- Getting started
- Basic data analysis
-
- Getting the data by using readfile
- Data input from the clipboard
- Basic descriptive statistics
- Simple descriptive graphics
-
- Scatter Plot Matrices
- Correlational structure
- Heatmap displays of correlational structure
-
- Polychoric tetrachoric polyserial and biserial correlations
-
- Item and scale analysis
-
- Dimension reduction through factor analysis and cluster analysis
-
- Minimum Residual Factor Analysis
- Principal Axis Factor Analysis
- Weighted Least Squares Factor Analysis
- Principal Components analysis (PCA)
- Hierarchical and bi-factor solutions
- Item Cluster Analysis iclust
-
- Confidence intervals using bootstrapping techniques
- Comparing factorcomponentcluster solutions
- Determining the number of dimensions to extract
-
- Very Simple Structure
- Parallel Analysis
-
- Factor extension
-
- Classical Test Theory and Reliability
-
- Reliability of a single scale
- Using omega to find the reliability of a single scale
- Estimating h using Confirmatory Factor Analysis
-
- Other estimates of reliability
-
- Reliability and correlations of multiple scales within an inventory
-
- Scoring from raw data
- Forming scales from a correlation matrix
-
- Scoring Multiple Choice Items
- Item analysis
-
- Item Response Theory analysis
-
- Factor analysis and Item Response Theory
- Speeding up analyses
- IRT based scoring
-
- Multilevel modeling
-
- Decomposing data into within and between level correlations using statsBy
- Generating and displaying multilevel data
-
- Set Correlation and Multiple Regression from the correlation matrix
- Simulation functions
- Graphical Displays
- Miscellaneous functions
- Data sets
- Development version and a users guide
- Psychometric Theory
- SessionInfo
-